Stupid command-line trick: Counting the number of lines in stdin


On unix, you can use wc -l to count the number of lines in stdin. Windows doesn't come with wc, but there's a sneaky way to count the number of lines anyway:

some-command-that-generates-output | find /c /v ""

It is a special quirk of the find command that the null string is treated as never matching. The /v flag reverses the sense of the test, so now it matches everything. And the /c flag returns the count.

It's pretty convoluted, but it does work.

(Remember, I provide the occasional tip on batch file programming as a public service to those forced to endure it, not as an endorsement of batch file programming.)

Now come da history: Why does the find command say that a null string matches nothing? Mathematically, the null string is a substring of every string, so it should be that if you search for the null string, it matches everything. The reason dates back to the original MS-DOS version of find.exe, which according to the comments appears to have been written in 1982. And back then, pretty much all of MS-DOS was written in assembly language. (If you look at your old MS-DOS floppies, you'll find that find.exe is under 7KB in size.) Here is the relevant code, though I've done some editing to get rid of distractions like DBCS support.

        mov     dx,st_length            ;length of the string arg.
        dec     dx                      ;adjust for later use
        mov     di, line_buffer
lop:
        inc     dx
        mov     si,offset st_buffer     ;pointer to beg. of string argument

comp_next_char:
        lodsb
        cmp     al,byte ptr [di]
        jnz     no_match

        dec     dx
        jz      a_matchk                ; no chars left: a match!
        call    next_char               ; updates di
        jc      no_match                ; end of line reached
        jmp     comp_next_char          ; loop if chars left in arg.

If you're rusty on your 8086 assembly language, here's how it goes in pseudocode:

 int dx = st_length - 1;
 char *di = line_buffer;
lop:
 dx++;
 char *si = st_buffer;
comp_next_char:
 char al = *si++;
 if (al != *di) goto no_match;
 if (--dx == 0) goto a_matchk;
 if (!next_char(&di)) goto no_match;
 goto comp_next_char;

In sort-of-C, the code looks like this:

 int l = st_length - 1;
 char *line = line_buffer;

 l++;
 char *string = st_buffer;
 while (*string++ == *line && --l && next_char(&line)) {} 

The weird - 1 followed by l++ is an artifact of code that I deleted, which needed the decremented value. If you prefer, you can look at the code this way:

 int l = st_length;
 char *line = line_buffer;
 char *string = st_buffer;
 while (*string++ == *line && --l && next_char(&line)) {} 

Notice that if the string length is zero, there is an integer underflow, and we end up reading off the end of the buffers. The comparison loop does stop, because we eventually hit bytes that don't match. (No virtual memory here, so there is no page fault when you run off the end of a buffer; you just keep going and reading from other parts of your data segment.)

In other words, due to an integer underflow bug, a string of length zero was treated as if it were a string of length 65536, which doesn't match anywhere in the file.

This bug couldn't be fixed, because by the time you got around to trying, there were already people who discovered this behavior and wrote batch files that relied on it. The bug became a feature.

The integer underflow was fixed, but the code is careful to treat null strings as never matching, in order to preserve existing behavior.

Exercise: Why is the loop label called lop instead of loop?

Comments (42)
  1. Anonymous says:

    I believe the loop label has to be called "lop" because "loop" is an x86 assembly instruction (it decrements ecx and jumps to a label if ecx is not zero).

  2. Dan Bugglin says:

    Martin beat me to it.  Not knowing x86 asm my first impulse is to say "loop" is a keyword reserved by the assembler and thus unusable as a label.

  3. Anonymous says:

    (So if batch files are to be avoided, what's the better option? PowerShell maybe?)

  4. Anonymous says:

    Just about anything is a better option. PowerShell is a great option.

  5. Anonymous says:

    Looking at the ChangeLog for GNU Textutils, it looks like the original wc program dates back to September 1991.

  6. Anonymous says:

    But of course the wc in GNU Textutils is not the original one. The original was from (AT&T) Unix somewhen back in the 1970s. The GNU project decided they needed to rewrite it all from scratch, to excape copyright restrictions that they found onerous.

  7. Anonymous says:

    And, yes, you should avoid batch files totally and instead use CMD files :-)

  8. Anonymous says:

    @Henning: Ah, you're right.  Looking at the manual pages for the first edition of Unix (cm.bell-labs.com/…/1stEdman.html), the original AT&T wc was written no later than November 1971.

  9. Anonymous says:

    Now that was fascinating. Useless, but fascinating.

  10. Anonymous says:

    It's not useless.

    I wanted to write (or download) an app to count my code lines.

    :-)

  11. Anonymous says:

    Yes, but why are there find and findstr, with exclusively-excluded features? find supports UTF-16, while findstr doesn't, but does support RegExps.

  12. Anonymous says:

    I'm not immediately convinced that it's "lop" because "loop" is an instruction.  The whole point of the trailing colon is to identify a label, so if you have "loop:", then it's clearly a label and not an instruction.  You'd have to make all reserved words a special case in your parser instead of just picking up a token, seeing that there was a colon, and adding it to your list of labels.

    In 1982 when memory was precious, I'd think that a programmer wouldn't bother.

    Even when you used it, like "jmp loop", the 2nd token to jmp is a label or offset or whatever and I don't see why the parser would try to interpret it as an instruction.  If the label exists, great.  If not, an instruction certainly doesn't make sense there so stop parsing and return an error.

  13. Anonymous says:

    Interpreting "loop:" as a label makes your parser more complex as you need to special case for it. Go write a parser and you will be convinced.

  14. Anonymous says:

    Thanks a million times over Raymond!  For years, I've used:

    my-output-generating-command | find /c /v "{wiggle fingers around on keyboard for good 20 chars or so}"  

    Much the same effect, however now I know I don't have to come up with garbage text either.

  15. Anonymous says:

    "here's how it goes in pseudocode"? – mate, when *I* went to school, pseudo-code was a little more English-looking than that. This looks like pseudo-code from the guys who wrote code first then built the design from that using their favourite text editor global search and replace commands :-)

  16. Anonymous says:

    loop is an MASM macro. That is why it cannot be used as a label target.

    If you want to create a loop that executes some number of times within your program, use the loop instruction. Although the following two code sequences produce the same result, they are not the same:

    ; Code sequence using a run-time loop:

                   mov     cx, 10

    AddLp:          add     ax, [bx]

                   add     bx, 2

                   loop    AddLp

    ; Code sequence using an assembly-time loop:

                   repeat  10

                   add     ax, [bx]

                   add     bx, 2

                   endm

    From: http://www.arl.wustl.edu/…/CH08-9.html

  17. Anonymous says:

    Feroze:

    Actually, loop is an x86 instruction, not (necessarily) an assembler macro.

    Though it's honestly a good idea never to use loop for actually looping, because it has compatibility baggage that requires it to be very slow.

  18. Anonymous says:

    Great tip and very handy to have if you're working on someone else's computer.

    I always find myself going to find the GNU coreutils for Windows not long after installing a new system.  Tossing wc, du, tr, head, tail, and friends in my PATH always seems to make interacting with the command line on Windows more enjoyable :)

    gnuwin32.sourceforge.net/…/coreutils.htm

  19. Anonymous says:

    @kzinti: That's probably true for a hand-coded parser (which I bet would be most/all parsers written pre-1982), but for a parser generated from a grammar-based parser generator like yacc or GNU bison, it's simpler to write the grammar without the special case.  Something like this, maybe:

    program := line*

    line := [ t]* (intrinsic | label | instruction) "n"

    intrinsic := "." intrinsic_keyword intrinsic_operands?

    label := label_name ":"

    label_name := [A-Za-z0-9_]*

    instruction := opcode instruction_operands?

    opcode := "loop" | "mov" | "add" | (etc.)

    This is a straightforward, unambiguous grammar (like JJJ suggested), but the resulting generated parser is more complex than the equivalent hand-coded parser would be.

  20. cheong00 says:

    Regarding "lop", I think it might also carry to meaning "to lop one character from buffer at a time".

  21. Anonymous says:

    @Adam

    I'm mostly neutral on the larger issue, but you're forgetting that most parser generators (in particular, those that follow the Yacc lineage) have dedicated lexers. (This is even sort of true for some like ANTLR.) This means that that the assembler must commit *at lexing time* to what token "loop" becomes. If you have a dedicated token for each keyword and your keywords include instruction mnemonics (like LOOP), then your 'label' production has to be 'label := IDENT COLON | LOOP COLON | MOV COLON | ADD COLON | …'. (More likely you'd split that up, but the point is you need an explicit list of keywords that you can use as a label somewhere.) That grammar is a bit more annoying, since that's a list you probably don't need otherwise. (I'm assuming that the list of opcodes doesn't exactly correspond to the list of opcodes that you give.)

    You could also lex "loop" and "mov" and other keywords like "eax" into an IDENT. This solves this problem, but makes it so that the grammar can't distinguish between tokens which are keywords and which aren't, so it can't enforce that the first token in, say, an 'instruction' is in fact an OPCODE: you'd need to do this separately. It's even possible that you could have two productions that you can only distinguish between in the action, and that is always quite obnoxious to deal with in a Yacc-like parser generator.

  22. Yuhong Bao says:

    "Though it's honestly a good idea never to use loop for actually looping, because it has compatibility baggage that requires it to be very slow."

    In particular, Windows 95 had a timing loop that used this instruction that overflowed on faster processors:

    support.microsoft.com/…/192841

  23. Anonymous says:

    @AsmGuru62

    If you want to count the lines in a file from the command prompt, you can search a certain popular Q&A site named after a programming error condition to discover a powershell one-liner that will do it for you.  :)

    Duplicated here for ease of copying:

    PS C:Path> (dir -include *.cs,*.xaml -recurse | select-string .).Count

    8396

    PS C:Path>

  24. Anonymous says:

    The source code for wc from Unix V5:

    minnie.tuhs.org/…/utree.pl

  25. Anonymous says:

    Greg D:

    (ls -rec -inc *.cs,*.xaml | cat).count

    or just

    (cat *.cs).count

    if no recursion needed

  26. Anonymous says:

    caf: Ah, the good ol' days: back when men were men, Unix utilities did one thing and did it well, ints were 16 bits, and it was safe to use gets!

  27. Anonymous says:

    It doesn't matter if loop is a macro, instruction, or whatever. And an assembly parser is trivial to write.

    Every line consists of up to 4 things (all optional). A label, an instruction/pseudo-instruction, operands (requires instruction to precede it), and comments. The label begins a line. If there is no label, then the line must start with whitespace. Operands cannot exist without instructions, and there's a delimiter to indicate comments.

    Some assemblers are fussy enough that a comment cannot start on the first column (symbol/label). And most assemblers ignore ":" ending a label since whitespace must separate the label from the next thing (instruction or comment).

    Quite simple, really.

  28. Anonymous says:

    Nitpickers corner: "" is not the null string. It's called the empty string.

  29. cheong00 says:

    @ThomasX: Except when talking about null terminated string, "" is not different to "". (Remember, you just pass the pointer to start of string.)

  30. Anonymous says:

    Instead of the useless slow loop instruction (/macro?), use:

    dec cx

    jnz lop

    @ThomasX: NULL != Null string

    [Not sure what you mean by "useless slow". The loop instruction was faster on an 8086 (by five clocks) than the dec/jnz equivalent. -Raymond]
  31. cheong00 says:

    Oops, mixed a few things up. Null string is pointer to 0.

  32. Anonymous says:

    The reason the Loop instruction performs so badly on modern processors is because one of the operations required is not like any other… any form of decrement on a general purpose register other than the one generated by loop will update the flags register.

    So regular decrementing is prioritized such that there is a specific micro-operation dedicated to it, while the loop instruction generates many micro-operations.. save the flags.. decrement.. restore the flags..

    Such is life in a world where most of the instruction set is just something to be emulated by the instruction decoder, rather than something the execution units speaks natively.

  33. Anonymous says:

    @Joseph Koss

    True, but strangely loop was already slower than a dec cx/jnz pair (or whatever is faster) on the 486.. which was still a very "CISCy" cpu.

    I think it's also a vicious cycle of compilers not using loop, leading to the cpu having a slower loop, leading to even less compilers using it etc. etc.

    This also applies to ENTER, LEAVE and other instruction meant more for human usage than compiler usage.

  34. Anonymous says:

    @cheong00: If you're going to be nit-picking, then "" is different from "" (assuming that the empty string is what you meant by "").  "" is a two-character array (const char[2]) consisting of two NUL bytes, while "" is a one-character array (const char[1]) consisting of one NUL byte.  They both decay into pointers that would both be interpreted as the empty string, but sizeof("") != sizeof("").

  35. Anonymous says:

    I measured the LOOP vs. SUB R32,1/JNZ – LOOP slower about 15% on average on modern CPUs.

    So, IMO, it is not VERY SLOW – just a tad slower.

    :-)

  36. Anonymous says:

    @AsmGuru:

    Do a real test that measures the pairing opportunities. We are talking about 3 cycles of latency for dec/sub on core2/phenom which is enough to sneak in 6+ micro-ops (max of 6 on phenom because of limits on instruction retirement), and thats just between just the dec/sub and the jnz.

    Never time instructions in isolation.

  37. Anonymous says:

    I'm glad somebody has finally given me a use for FIND! Ever since I discovered FINDSTR, I have always wondered why FIND was still around. Now I know.

    For optimization purposes, I've discovered that the spaces before and after the switches are unnecessary and I can save a couple keystrokes by typing:

    command | find/c /v""

  38. cheong00 says:

    @Adam: if your string variable is being passed as char*, can you tell the difference from an empty string if the only content is ""? Remember, for null terminated strings the length is "position of first null character encountered" – 1.

  39. Anonymous says:

    Hmm, I didn't know about "findstr".  Windows now just needs xargs and sed.  One of the first things I do with a fresh Windows installation is install UnxUtils.

    loop is slower because it's not a RISC-like instruction, and is thus typically implemented in microcode instead of being hardwired.  Same with instructions like xlat, enter, aad…

  40. Anonymous says:

    char* isn't a string. It's a char pointer, which often happens to be used as a string because the c/c++ languages lack a native string type.

  41. Anonymous says:

    Another quirk of find is that it treats as a line break. Here is an example of using find to count characters in a variable:

    set str=can you count me [32 characters]

    for /f %%a in ('cmd /u /c set /p "=%str:"= %"^<nul^|find /v /c ""') do echo %%a

  42. Anonymous says:

    @cheong00: Of course you can't tell the difference between "" and "" if all you have is a pointer, but as 640k said, char* isn't a string.  There's no such thing as strings in C, just character arrays and character pointers.  String literals are character arrays which happen to conveniently decay into pointers to their first characters.

Comments are closed.