LINQ to TEXT and LINQ to CSV

LINQ is a great tool for writing ad-hoc queries and transforms, and occasionally I need to write queries or transforms on text files.  And sometimes I receive CSV files, and need to do something with them.  I wrote a blog post on LINQ to Text files over two years ago.  My coding practices today differ from what I presented in that blog post.  This post presents my current approach for dealing with text files using LINQ, and includes a function for splitting CSV lines.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCIn that post, I detailed an approach for writing lazy queries on text files.  Well, to tell you the truth, I never write lazy queries on text files – I simply use File.ReadAllLines and write queries over the returned string array.  After all, my computer has a lot of RAM (and probably yours does too), and the CSV files I receive are maybe in the 50K size, and are rarely greater than a few hundred K.  It simply doesn’t matter that I read the entire text file into memory.  If I were to write the exact same two queries today, here is how I would write them:

string[] lines = File.ReadAllLines("TextFile.txt");
var t1 = lines
.Where(l => !l.StartsWith("#"))
.Select(l => l.Split(','))
.Select(items => String.Format("{0}{1}{2}",
items[1].PadRight(16),
items[2].PadRight(16),
items[3].PadRight(16)));
var t2 = t1
.Select(l => l.ToUpper());
foreach (var t in t2)
Console.WriteLine(t);

I have a small extension method (CsvSplit) that I use to split lines that are in CSV format.  I’ve had this method around for a while – it’s not written in the functional style.  Instead, it’s a state machine.  I’ve thought about what it would take to rewrite this method in the functional style, and as far as I know, the only way to do it would be to define a grammar, and maybe write a recursive descent parser.  (Actually, there is another approach, but it would be very inefficient, and the code would be longer and less readable.)  Well the job of parsing CSV files simply isn’t worth the effort!  So a state machine it is.  In any case, the ‘functional impurity’ is local to the function.

The semantics of CsvSplit are:

  • CsvSplit is an extension method on the String class.
  • Only a comma is valid for the separator character.
  • Values can be quoted.  The quotes are trimmed.
  • Quoted values can have, of course, internal commas.  Quoted values can also have internal escape sequences: backslash followed by any character, including quote (\”), backslash (\\) or any other character (\a).
  • CsvSplit will throw an exception for incorrectly formatted strings.

If the CSV file that I receive isn’t in this format, then I just load it into Excel and save so that it is in this format.

If you have this for a source file:

Bob,"Bob said to go to the store."
Mary,"Mary said, \"Whatever.\""
Jim,Jim's quote doesn't contain quotes or commas.

Then you can query the CSV file like this:

var data = File.ReadAllLines("TextFile.txt")
.Select(
l => {
var split = l.CsvSplit();
return new {
Person = split[0],
Quote = split[1]
};
}
);

foreach (var item in data)
Console.WriteLine("{0}:{1}", item.Person, item.Quote);

The function is composable.  If you want to convert the CSV file to an XML file, you can do so like this:

XElement xmlDoc = new XElement("Root",
File.ReadAllLines("TextFile.txt")
.Select
(
line => {
var split = line.CsvSplit();
return new XElement("Quote",
new XElement("Person", split[0]),
new XElement("Text", split[1])
);
}
)
);
Console.WriteLine(xmlDoc);

Here is the listing for CsvSplit (also attached):

publicclassCsvParseException : Exception
{
public CsvParseException(string message) : base(message) { }
}

publicstaticclassMyExtensions
{
privateenumState
{
AtBeginningOfToken,
InNonQuotedToken,
InQuotedToken,
ExpectingComma,
InEscapedCharacter
};

publicstaticstring[] CsvSplit(thisString source)
{
List<string> splitString = newList<string>();
List<int> slashesToRemove = null;
State state = State.AtBeginningOfToken;
char[] sourceCharArray = source.ToCharArray();
int tokenStart = 0;
int len = sourceCharArray.Length;
for (int i = 0; i < len; ++i)
{
switch (state)
{
caseState.AtBeginningOfToken:
if (sourceCharArray[i] == '"')
{
state = State.InQuotedToken;
slashesToRemove = newList<int>();
continue;
}
if (sourceCharArray[i] == ',')
{
splitString.Add("");
tokenStart = i + 1;
continue;
}
state = State.InNonQuotedToken;
continue;
caseState.InNonQuotedToken:
if (sourceCharArray[i] == ',')
{
splitString.Add(
source.Substring(tokenStart, i - tokenStart));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
}
continue;
caseState.InQuotedToken:
if (sourceCharArray[i] == '"')
{
state = State.ExpectingComma;
continue;
}
if (sourceCharArray[i] == '\\')
{
state = State.InEscapedCharacter;
slashesToRemove.Add(i - tokenStart);
continue;
}
continue;
caseState.ExpectingComma:
if (sourceCharArray[i] != ',')
thrownewCsvParseException("Expecting comma");
string stringWithSlashes =
source.Substring(tokenStart, i - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes =
stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
continue;
caseState.InEscapedCharacter:
state = State.InQuotedToken;
continue;
}
}
switch (state)
{
caseState.AtBeginningOfToken:
splitString.Add("");
return splitString.ToArray();
caseState.InNonQuotedToken:
splitString.Add(
source.Substring(tokenStart,
source.Length - tokenStart));
return splitString.ToArray();
caseState.InQuotedToken:
thrownewCsvParseException("Expecting ending quote");
caseState.ExpectingComma:
string stringWithSlashes =
source.Substring(tokenStart, source.Length - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes = stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
return splitString.ToArray();
caseState.InEscapedCharacter:
thrownewCsvParseException("Expecting escaped character");
}
thrownewCsvParseException("Unexpected error");
}
}

LinqToCsv.cs