Why doesn't findstr use the standard regular expression library?


Tim wonders why there isn't a standard library that was settled on and used by everyone by now. (While you're at it, why isn't there a standard for electrical outlets that was settled on and used by everyone by now?) And the answer is the same: Things started out with everybody doing their own thing, and by the time a standard emerged, it was too late.

The findstr program was written in 1990 by a colleague of mine who retired in the year 2000. Let's call him Bob. It was originally written for MS-DOS and was called qgrep. I don't know what the q stands for, but it couldn't call itself grep because that name was already taken. And since this was Bob's little program, he got to choose which regular expression language it accepted.

The qgrep program sat in Bob's bag of tricks, and he shared it with his closest friends, who shared it with their friends, and so on. Meanwhile, Bob ported qgrep to OS/2 (because he needed a version that ran on OS/2) and eventually Windows NT (because he needed a version that ran on Windows NT).

At this point, qgrep caught the attention of the people in charge of the Windows Resource Kit. They were on the prowl for handy little utilities that could be tossed onto the CD. They said, "Hey, can we put qgrep on the Resource Kit CD?" Bob said, "Sure, here you go." And then the Resource Kit people said, "Okay, but we are afraid to call it qgrep because that might create licensing or trademark problems, so we're going to have to call it something else. We'll call it findstr! Also, we'd like to change the command line switches to match the other Resource Kit tools, and we'd like to change the help text based on this feedback from our editors."

"Whatever," said Bob. "I'm going to keep calling mine qgrep, thanks the same. Here you go: I created a clone of qgrep, renamed it findstr, and made the changes you requested."

That's how things stood for several years. Bob had qgrep. The Resource Kit had findstr, a mutant offshoot of qgrep.

One piece of common feedback from system administrators was that a lot of the Resource Kit tools were really handy, but it was pain to have to install them on every computer. And since they aren't part of the core Windows installation, the tools aren't available for use in logon scripts either.

And that's how findstr ended up part of Windows. It came in through on the coattails of the Resource Kit. I remember when they were added to Windows because becoming part of the core product meant another round of security reviews.

Okay, that's a nice story, but it doesn't answer the question. Why wasn't findstr upgraded to use a newer regular expression engine?

Recall that Bob retired in 2000. And since qgrep was Bob's baby, all development on qgrep stopped when he retired. When Bob gave the findstr project to the Resource Kit team, they got the source code, but there was no knowledge hand-off so that somebody on the Resource Kit team understood how the program worked, in case they needed to fix a bug or add a feature. Not that there was anybody on the Resource Kit team available to receive said knowledge. The Resource Kit was primarily a book, so the Resource Kit team consisted mostly of writers and editors, not programmers. (That's probably why they were so excited about changing the help text.) The CD filled with tools was considered a bonus feature, not the primary product. I guess they figured that if they needed a bug fixed or a feature added, they'd just ask Bob.

Besides, you can't change the regular expression language accepted by a program after it has been released, because that would break all the scripts that used the old language. Remember those logon scripts that use findstr? If any of them used a regular expression whose meaning changed between the old syntax and the new syntax, those scripts would subtly stop working properly. A change in the regular expression syntax would require a new switch to opt into the new behavior.

I don't recall Bob ever mentioning to me that somebody asked him to upgrade the regular expression engine in qgrep. I suspect nobody asked, seeing as perl-style regular expressions didn't become popular until long after Bob retired. Also, Bob is not a lawyer, so he doesn't want to have to read the license for a third-party library and figure out how to remain in compliance with it.

(From reading the PCRE license, it appears that if your program uses PCRE, you must reproduce "the above copyright notice", but there are three copyright notices on that page. Does the program need to reproduce all of them? Or just the last one? It seems to me that nearly everybody just ignores the license requirements. For example, Safari uses PCRE, but the PCRE copyright, licensing terms, and disclaimer do not appear in the Safari EULA or any other Safari documentation I can find.)

Comments (45)
  1. Ken in NH says:

    I know! There should be a checkbox for power users. It should be global in nature of course.

    Ok, Ok, I'll stop now.

    1. Grep has such a checkbox. Two of them, actually.

      You want Basic Regular Expressions? Use the default options. You want Extended Regular Expressions? Use the -E flag. You want Perl Regular Expressions? Use the -P flag.

      It would not be a totally unreasonable thing to do to add a new flag to findstr to support a more modern flavor of regexes, which new scripts have to opt into using (licensing issues/implementation details notwithstanding) -- I think this particular feature would meet the -100 points bar.

      It would also not be a totally unreasonable thing to do to keep findstr unchanged. If someone wants to use PCRE-compatible regexes in new scripts, there are plenty of alternatives to findstr available.

      1. voo says:

        Well since findstr is by now relegated to backwards compatibility cruft and every sane person uses PowerShell for new scripts anyhow, this problem has been solved nicely.

        Because PowerShell has support for regex out of the box and (obviously) uses .NET's regex library, which while probably not identical to Perl's supports all the useful things you'd usually need (I imagine, but then I personally prefer a short little parser to a 3 line regex that absolutely nobody understands and that you'll have to rewrite if you want to make a trivial change).

  2. Antonio Rodríguez says:

    The Gizmodo article linked in the first paragraph is really interesting, and worrying, too. Knowing how powerful standards can be when followed, is a pity when you see those messes because of parallel development. But, once more, XKCD's Randall Munroe hits the spot and demonstrates that -sadly- there is no solution :-( .

  3. Anders Munch says:

    Safari does show the PCRE license. It seems to be an older version of the license, from 2005, it doesn't like the one you linked to. It's in Help|Acknowledgements. There's an ftp download link to get the source, and the link actually works! Remarkable.

    1. The requirements say "in the documentation and/or other materials provided with the distribution." The "Acknowledgements" box is not documentation, but is it "other materials provided with the distribution"? I interpreted that to mean a LICENSE.txt file or something similar. Maybe this is why software developers shouldn't try to interpret licensing terms.

  4. MacIn173 says:

    I BEG you, please return the old style.

    1. Boris says:

      I second the imploration.

      1. vbdotnet says:

        +1. At least put the lovely orange background color back, that alone handles 80% of my nostalgia factor :)

        1. Boris says:

          To be clear, I have nothing against the nested comments or the newfound ability to post them from iOS as well, but the white background, the blue hyperlinks, the red-on-pink code and all these new fonts just aren't oldnew any more. They're simply new.

      2. not important says:

        ++

      3. Rafael says:

        return new Promise( bringBack( oldStyle ) );

        1. You seem to be under the impression that I control the blog software.

          1. MacIn173 says:

            Thanks to this blog, we're under strong impression that MS is good at keeping backwards compatibility.

    2. voo says:

      Yeah, we really don't want a working mobile site and nested comments and.. wait. Sure a different stylesheet could give it a more unique flair (hey I've been a reader for a decade or so too, I understand the nostalgia!), but the new blog software is a vast, vast improvement to the previous cruft..

      1. MacIn173 says:

        Is there mutual exclusion between usual design and new functionality?

  5. Anon says:

    Safari doesn't use PCRE any more - they switched to their own regex engine called YARR (which Firefox use to use before switching to Chromium's irregexp).

  6. Slashdotter says:

    And thank you for the follow up on the Quake credits post (https://blogs.msdn.microsoft.com/oldnewthing/20151111-00/?p=91972/ ) - I'm the one who asked you about it your suggestion box originally! I only wish I'd posted a comment on it while the comments were still open.

  7. Killer{R} says:

    Windows Server 2003 Resource Kit notice mentions qgrep, but not findstr: http://www.microsoft.com/en-us/download/details.aspx?id=17657
    And after installed on Win7 it works and its command line help differs from findstr's. So renaming and code change occured at least after year 2003. And anybody still can take that original Bob's qgrep :) BTW little googling discovers at least couple of other qgrep-s on github.

  8. hacksoncode says:

    I realize that it would have been a *way* more boring article, but couldn't all of this just been replaced with "Backward Compatibility, duh."?

  9. Waleri says:

    Now would be a good time to hear the story about why Visual Studio *did* change the regular expression library.

    1. Probably because there are no batch files that script the Visual Studio search UI.

      1. roeland says:

        Visual Studio is a bit of an outlier when it comes to regular expressions. It's the only program I know of which uses curly brackets { } for capturing groups, and for word boundaries.

        Was that introduced before others settled on ( ) and \b respectively?

      2. smf says:

        I'm not sure that is a valid argument as human brains are much harder to change than scripts. I've never seen a benefit in the regular expression change in visual studio, but somehow it was still more important to change it than the impact it would have on me and other developers.

        I'd expect that type of reasoning from a manager, but not from a developer.

        1. AndyCadley says:

          I think the "human brains are harder to change than scripts" was pretty much the exact reason for the change. It was annoyingly difficult to mentally switch between the regular expression syntax you were using in code to the Visual Studio only one used by the UI. Having the same regular expressions in both removed an enormous pain point for many developers.

  10. asdf says:

    > Besides, you can’t change the regular expression language accepted by a program after it has been released, because that would break all the scripts that used the old language.

    That's a non-sequitur.

    1. Suppose you have a batch file that searches for a parenthesized "x" by doing findstr /r /c:"(x)" if you switch to PCRE, then that will find all occurrences of x, even those that aren't in parentheses.

      1. cheong00 says:

        IMO, if they can still find Bob, just add a new switch for a different set of RegEx is acceptable. That'll be unlikely to break things because the parsing code would be a seperate one.

  11. lowell says:

    > For example, Safari uses PCRE, but the PCRE copyright, licensing terms, and disclaimer do not appear in the Safari EULA or any other Safari documentation I can find.

    It's included with the product (the only place it needs to be); Help -> Acknowledgements.

    http://i.imgur.com/98rCr3t.png

  12. Mason Wheeler says:

    ...and now we have three problems.

  13. — "From reading the PCRE license, it appears that if your program uses PCRE, you must reproduce “the above copyright notice”, but there are three copyright notices on that page. Does the program need to reproduce all of them?"

    What you are referring to is a 3-clause BSD license. The person creating a derivative work must reproduce all applicable copyright notices, although I don't think anyone would include all three works (basic library, PCRE2 JIT, and Stackless JIT) in a one derivative work. So, most of the time, one of the notices gets reproduced along with three clauses. For more information, look up "BSD licenses" on English Wikipedia.

  14. Kaz says:

    Reams of TL;DR surrounding the only real point here, which is obvious: changing the default regex language accepted by a tool breaks existing scripts.

    Who cares about some guy who wrote a poor imitation of a Unix tool on Windows. It is insignificant and irrelevant.

    1. GregM says:

      If you don't want all the extra flavor in your stories, feel free to not read the stories here, because they all contain a lot of that, which is what makes them interesting.

    2. Those who matter must care.

      Even in art, where originality of the work is important, poor or rudimentary renditions do not rapidly fall into "who cares?" category. (Copyright laws make sure that it does not.) In science, on the other hand, the first rudimentary work is celebrated. Example: Wheel. It is celebrated as a cornerstone invention, even though that wheel is considered junk today. (The Nobel Prize is also given for the *advancement* of science; something that Leonard Hofstadter's mother does not seem to know!)

      Software is both art and science. In this case, it is more science. This, plus the compatibility issue that Raymond explained.

  15. 640k says:

    Scripts written in the 80-ies by developers that doesn't care to keep them updated and compatible with new standards *should* break. If the owner of software isn't interested in keeping it updated, replace the software. Software should in all cases be updated continuously, and stuff that isn't compatible with latest standards should not be used. Embrace change.

    1. Tell that to the company with 9000 install scripts. "Your scripts are supposed to be broken. Replace them all. Embrace change."

      1. 640k says:

        That's what successful companies does and it works perfectly. Being compatible with scripts from the 80-ies will limit your ability to innovate.

        Even luddites with 9k scripts scripts from the last millennium will probably make more money if they continuously update their scripts to the latest and greatest tech.

        1. Richard says:

          @640k:

          What if the 9k scripts were done by government agencies (and they STILL work, thanks to MS$ "backwards compatibility"). Are you OK spending tax payer dollars on "innovation" or "latest and greatest technologies", just for the sake of it? Even more so for a non-profit organization, where any $$ spent on such "fluff" takes away from the reason of the organization (like feeding the homeless).

          1. 640k says:

            Keeping a system compatible with 30 year old batch files instead of refactor it to be compatible with modern and mobile apps, is questionable and will help neither you employer or your customers.

          2. The company doesn't need to keep the system compatible with 30 year old batch files. Microsoft takes care of that. And how do you refactor a batch file to be compatible with mobile apps? Batch files don't run on phones.

        2. Do you want me to forward your contact information to them, so you can show them how spending 10 man-years of effort upgrading 9000 perfectly-functional scripts will make them more money?

  16. Henri Hein says:

    "The wonderful thing about standards is there are so many of them to choose from." - Grace Hopper (maybe)

Comments are closed.

Skip to main content