CLC and it's Data Viewer CLCV

CLC is a native C++ program I developed that counts lines of source code.  It is designed to handle very large source code trees - all of Windows Vista.  For large trees, it saves its output in comma separated value format (CSV) which is easily read by Excel.

CLC is extremely fast - it uses very efficient I/O and easily drives the disk at its maximum sequential read rate.  The way it processes source data is also efficient - as a result, CLC is very I/O bound, spending most if its time waiting on disk reads to complete.

For small tress, or a few files, the console output from CLC is very readable.  For larger trees, its not too hard to use Excel.  But for 10's of thousands of files, the data becomes unweildy.  So, I decided this was a great oportinty to write my first practial WPF applicaton, a utility for graphically viewing CLC data.  I've named it CLCV ('V' for viewer).

Ultimatly, CLCV will display CLC's data in a tree-map with a few other charts and graphs.  In the mean time, developing CLCV is how I'm learning WPF and C#. 

I'm writing a series of blog posts as I develop CLCV, and learn WPF and C#.  You can find them under the Beyond Hello World tag.  

So far, I'm on V2 of CLCV.  You can find the first three posts here.

  1. Beyond Hello World - My First WPF Application (with source)
  2. Beyond Hello World - An Update On My First WPF Application (with source)
  3. Beyond Hello World - Update 3, Control Templates, Multithreading, and more... (with source)

Ok, so you may ask "Why did you write your on code line counter?  Aren't there a bunch of them already out there?".  I wrote CLC because our team needed a good code counting tool.  Count based metrics - commonly called "Source Lines of Code (SLOC) - are common.  For example, bugs per line of code.  Of course, count based metrics have all kinds of issues but they are commonly used and can be useful if used in the proper context.   Other count based metrics include total lines in source files, lines of code, lines of comments, and various rations between them.

There are hundreds of code counting utilities availed.  Many of them have some interesting problems and I didn't find one that was ideal - or even satisfactory.

  1. Most are focused only on C or C++.
  2. Almost all of them only counted physical source lines of code - see foot note[1] below.
  3. Few handled UNICODE input files.
  4. Some are pretty "hacky".
  5. Several utilities for PERL were written in PERL (slow, hard to maintain etc).
  6. Most do not handle multiple programming languages in one utility with a unified set of output statistics.
  7. Few handled make files.
  8. None handled batch files.
  9. Most did a poor job of reporting statistics.
  10. Some tools need CygWin to run (not evil, just a pain).
  11. Many tools are out of date and haven't been revised in several years.

I could have leveraged some existing utilities and patched together a hodgepodge of ad-hoc utilities using some PERL scripts but this would have been a poor approach at best.   It would have been very slow, hard to maintain, and difficult to deploy.  Stitching together counts from various utilities into a cohesive whole would have been problematic at best.  

So, I decided to write my own utility in native C++.  This approach solves all the above problems, gave me a utility that knew about effectively blank lines, handled multiple languages in one tool, and reported statistics consistently across all the supported languages.  It is also very fast, and will get faster as I improve the file read mechanisms. 


[1] Commonly called "Physical Source Lines of Code" or simply SLOC.  Defined as "A physical source line of code is a line ending in a newline or end-of-file marker, and which contains at least one non-whitespace non-comment character".  The most interesting (and probably best) work on this is from David A. Wheeler.

[2] In a purist sense CLC counts "Logical Source Lines of Code" which is essentially any count that isn't SLOC.