OpenSource CSV Reader on Nuget


I did some volunteer work a few years ago that required processing lots of CSV files. So I solved the problems by writing a C# CSV reader, which I wanted to share here. The basic features here are:

  1. be easy to use
  2. read and write CSV files (and support tab and “|” delimiters too)
  3. create CSV files around IEnumerable<T>, dictionaries, and other sources.
  4. Provide a “linq to CSV” experience
  5. provide both in-memory mutable tables and streaming over large data sources (thank you polymorphism!)
  6. provide basic analysis operations like histogram, join, find duplicates, etc. The operations I implemented were driven entirely by the goals I had for my volunteer work.
  7. Read from Excel
  8. Work with Azure. (This primarily means no foolish dependencies, and support TextReader/TextWriter instead of always hitting the file system)

I went ahead and put it on github  at https://github.com/MikeStall/DataTable. And it’s available for download via Nuget (see “CsvTools”).  It’s nice to share, and maybe somebody else will find this useful. But selfishly, I’ve used this library for quite a few tasks over the years and putting it on Github and Nuget also makes it easier for me to find for future projects.

There are the obvious disclaimers here that this was just a casual side project I did as a volunteer, and so use as is.

Step 1: Install “CsvTools” via Nuget:

When you right click on the project references node, just select “Add Library Package Reference”. That will bring up the nuget dialog which will search the online repository for packages. Search for “CsvTools” and then you can instantly install it. It’s built against CLR 4.0, but has no additional dependencies.

image

 

Example 1: Loading from a CSV file

Here’s a CSV at: c:\temp\test.csv

name, species
Kermit, Frog
Ms. Piggy, Pig
Fozzy, Bear

To open and print the contents of the file:

using System;
using DataAccess; // namespace that Csv reader lives in

class Program
{
    static void Main(string[] args)
    {
        DataTable dt = DataTable.New.ReadCsv(@"C:\temp\test.csv");

        // Query via the DataTable.Rows enumeration.
        foreach (Row row in dt.Rows)
        {
            Console.WriteLine(row["name"]);
        }        
    }
}

There are a bunch of extension methods hanging off “DataTable.New” to provide different ways of loading a table. ReadCsv will load everything into memory, which allows mutation operations (see below).  But this also supports streaming operations via the methods with “lazy” in their name, such as ReadLazy().

Example 2: Creating a CSV from an IEnumerable<T> and saving back to a file

Here’s creating a table from an IEnumerable<T>, and then saving that back to a TextWriter (in this case, Console.Out).

var vals = from i in Enumerable.Range(1, 10) select new { N = i, NSquared = i * i };
DataTable dt = DataTable.New.FromEnumerable(vals);
dt.SaveToStream(Console.Out);  


Which produces this CSV:

N,NSquared

1,1

2,4

3,9

4,16

5,25

6,36

7,49

8,64

9,81

10,100

 

Example 3: Mutations

DataTable is actually an abstract base class. There are two primary derived classes:

  1. MutableDataTable,, which loads everything into memory, stores it in column major order, and provides mutation operations.
  2. streaming data table, which provides streaming access over a rows. This is obviously row major order, and doesn’t support mutation. The streaming classes are non-public derived classes of DataTable.

Most of the builder functions that load in memory actually return the derived MutableDataTable object anyways. A MutableDataTable is conceptually a giant 2d string array stored in column major order. So adding new columns or rearranging columns is cheap. Adding rows is expensive. Here’s an example of some mutations:

static void Main(string[] args)
{
    MutableDataTable dt = DataTable.New.ReadCsv(@"C:\temp\test.csv");

    // Mutations
    dt.ApplyToColumn("name", originalValue => originalValue.ToUpper());
    dt.RenameColumn(oldName:"species", newName: "kind");
    
    
    int id = 0;
    dt.CreateColumn("id#", row => { id++; return id.ToString(); });

    dt.GetRow(1)["kind"] = "Pig!!"; // update in place by row
    dt.Columns[0].Values[2] = "Fozzy!!"; // update by column

    // Print out new table
    dt.SaveToStream(Console.Out);        
}

Produces and prints this table:

name,kind,id#

KERMIT,Frog,1

MS. PIGGY,Pig!!,2

Fozzy!!,Bear,3

 

There’s a builder function, DataTable.New.GetMutableCopy, which produces a mutable copy from an arbitrary DataTable.

Example 4: Analysis

I needed some basic analysis functions, like join, histogram, select duplicates, sample, and where. These sit as static methods in the Analyze class.

Here’s an example of creating a table with random numbers, and then printing the histogram:

static void Main(string[] args)
{   
    // Get a table of 1000 random numbers
    Random r = new Random();
    DataTable dt = DataTable.New.FromEnumerable(
        from x in Enumerable.Range(1, 1000) 
        select r.Next(1, 10));

    Tuple<string,int>[] hist = Analyze.AsHistogram(dt, columnIdx: 0);
    
    // Convert the tuple[] to a table for easy printing
    DataTable histTable = DataTable.New.FromTuple(hist, 
        columnName1: "value",
        columnName2: "frequency");
    histTable.SaveToStream(Console.Out);
}

Produces this result:

value,frequency

9,151

8,124

2,118

7,110

3,107

5,104

1,101

6,99

4,86

Comments (5)

  1. Mike says:

    Not too sure about the naming of namespace DataTable (confusing with built-in class DataTable, and doesn't indicate fixed text format) and the class New (seems like an attempt at a fluent interface, but it's be better to follow reader/writer design of your classes imho).

    I'm not ungrateful, just wanted to give you some feedback on the design!

  2. I am curious, since I wrote a similar function this past week, as to why you decided to make AsHistogram return a Tuple<string, int>[] instead of a Dictionary<string, int> or even a KeyValuePair<string, int>[].

    Also, why not implement the Analyze functions as extensions on DataTable (e.g. dt.ToHistogram())?

    Thanks for sharing a quite comprehensive CSV parser!

  3. James says:

    Why not use FileHelpers – free import/export routine(s)

    filehelpers.sourceforge.net

  4. Your welcome! I hope you find it useful. I have some excel and Azure functions that I want to polish and add to it.  

    Honestly, I didn't put a ton of thought into it.  I was mostly goal driven. I needed to produce some results, and a CSV reader was a biproduct of that.

    KeyValuePair would probably have been better than Tuple, since you get better names (Key and Value rather than Item1 and Item2). I skipped on Dictionary because I didn't need lookup functionality and was just using the histogram to produce another table.  It's open source, so if you want to make that change, I'd support it. 🙂

    Re Analyze vs. extension methods:

    – There were historical reasons here.

    – it felt more discoverable to have all the analysis functions grouped on one class (Analyze). This was particularly useful when working through analysis heavy algorithms. Once everything becomes an extension method on DataTable, then it gets noisy and hard to discover new things.

    – some of the operations are symmetric (like Join) and I personally felt a static captured that symmetry better than an extension method.

  5. @mike – Yeah, I struggled with DataTable. I liked the name and couldn't come up with a better one. but it does have an unfortunate conflict with System.Data.  

    Regarding the New pattern, what I really wanted was Extension Constructors. I wanted to add new builder methods that were discoverable. For example, I have another library that loads a table from Excel. That could be discoverable via DataTable.New.ReadExcel(…)

    @james – fair question. This was code I started 2 years ago, and I didn't find the right library at the time. I also wasn't sure exactly what I needed for the volunteer project so I just started writing and it morphed into a CSV reader.