Kirk Evans Blog

.NET From a Markup Perspective

Convert CSV to XML

[Update: As Paul Owen notes in the comments to this post, there were issues with the markup as presented in the post.  I removed the formatting and replaced the XSLT to clarify the stylesheet.]

In the SQLXML group on Yahoo! groups, Jonathan Smith asks:

I’m trying to find a way to convert a .csv file to xml using stylesheets and I can’t seem to find anything. I did read about something called fxml or something like that for flat files but I don’t think it’s a standard. Can anyone help me in this regard?

This is certainly doable with XSLT, provided that the data contains characters allowable in XML. If not, you will have some data scrubbing and character escaping to do to represent the same data as XML. In the simple case of “<” characters, this is easily done with character code escapes like &lt;. For high-order values, this is going to require some other scrubbing and character replacement because XSLT cannot transform the data even with character entity references for the characters in unacceptable ranges.

The first order of business is to put a root tag on the data. This will allow us to load the data into a DOM and ensure that the data can be represented as XML.

<root>1,5 main st, ,Cumming, GA, 30040,Kirk Evans
2,13 elm st, ,Anywhere, NJ, 07825,Bob Smith</root>

Once we have an XML document where all of the data is well-formed, we next need to start the transformation process. Logically, we can break the problem into 2 steps. We need to get the rows of data, and then break each row up by its individual fields.

The XPath string function substring-before() can be used to grab all of the data before the first carriage return. The function substring-after() can then be used to grab all of the data after the first carriage return. We just call the same function recursively until the string we are processing no longer has any carriage returns, at which point we add the remaining string to the result tree.

 <!– template that actually does the conversion–>
 <xsl:template name =”texttorows” >
  <!– import $StringToTransform–>
  <xsl:param name =”StringToTransform” select =””” />
  <xsl:choose>
   <!– string contains linefeed–>
   <xsl:when test =”contains($StringToTransform,’&#xA;’)” >
    <!– Get everything up to the first carriage return–>
    <row>
     <xsl:call-template name =”csvtoxml” >
      <xsl:with-param name =”StringToTransform” select =”substring-before($StringToTransform,’&#xA;’)” />
     </xsl:call-template>
    </row>
    <!– repeat for the remainder of the original string–>
    <xsl:call-template name =”texttorows” >
     <xsl:with-param name =”StringToTransform” >
      <xsl:value-of select =”substring-after($StringToTransform,’&#xA;’)” />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!– string does not contain newline, so just output it–>
   <xsl:otherwise>
    <row>
     <xsl:call-template name =”csvtoxml” >
      <xsl:with-param name =”StringToTransform” select =”$StringToTransform” />
     </xsl:call-template>
    </row>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

To break up the columns in each row, we go through the same process. Instead of looking for carriage returns, we look for the first comma. We then call ourselves recursively with the rest of the string until the string contains no more commas.

 <xsl:template name =”csvtoxml” >
  <!– import $StringToTransform–>
  <xsl:param name =”StringToTransform” select =””” />
  <xsl:choose>
   <!– string contains linefeed–>
   <xsl:when test =”contains($StringToTransform,’,’)” >
    <!– Get everything up to the first carriage return–>
    <elem>
     <xsl:value-of select =”substring-before($StringToTransform,’,’)” />
    </elem>
    <!– repeat for the remainder of the original string–>
    <xsl:call-template name =”csvtoxml” >
     <xsl:with-param name =”StringToTransform” >
      <xsl:value-of select =”substring-after($StringToTransform,’,’)” />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!– string does not contain newline, so just output it–>
   <xsl:otherwise>
    <elem>
     <xsl:value-of select =”$StringToTransform” />
    </elem>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

Now that we have the 2 template rules in place, we form the entire stylesheet together.

<xsl:stylesheet version =”1.0″ xmlns:xsl =”http://www.w3.org/1999/XSL/Transform” >
 <xsl:output method =”html” />
 <!– template that matches the root node–>
 <xsl:template match =”/” >
  <root>
   <xsl:call-template name =”texttorows” >
    <xsl:with-param name =”StringToTransform” select =”/root” />
   </xsl:call-template>
  </root>
 </xsl:template>
 <!– template that actually does the conversion–>
 <xsl:template name =”texttorows” >
  <!– import $StringToTransform–>
  <xsl:param name =”StringToTransform” select =””” />
  <xsl:choose>
   <!– string contains linefeed–>
   <xsl:when test =”contains($StringToTransform,’&#xA;’)” >
    <!– Get everything up to the first carriage return–>
    <row>
     <xsl:call-template name =”csvtoxml” >
      <xsl:with-param name =”StringToTransform” select =”substring-before($StringToTransform,’&#xA;’)” />
     </xsl:call-template>
    </row>
    <!– repeat for the remainder of the original string–>
    <xsl:call-template name =”texttorows” >
     <xsl:with-param name =”StringToTransform” >
      <xsl:value-of select =”substring-after($StringToTransform,’&#xA;’)” />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!– string does not contain newline, so just output it–>
   <xsl:otherwise>
    <row>
     <xsl:call-template name =”csvtoxml” >
      <xsl:with-param name =”StringToTransform” select =”$StringToTransform” />
     </xsl:call-template>
    </row>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>
 <xsl:template name =”csvtoxml” >
  <!– import $StringToTransform–>
  <xsl:param name =”StringToTransform” select =””” />
  <xsl:choose>
   <!– string contains linefeed–>
   <xsl:when test =”contains($StringToTransform,’,’)” >
    <!– Get everything up to the first carriage return–>
    <elem>
     <xsl:value-of select =”substring-before($StringToTransform,’,’)” />
    </elem>
    <!– repeat for the remainder of the original string–>
    <xsl:call-template name =”csvtoxml” >
     <xsl:with-param name =”StringToTransform” >
      <xsl:value-of select =”substring-after($StringToTransform,’,’)” />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!– string does not contain newline, so just output it–>
   <xsl:otherwise>
    <elem>
     <xsl:value-of select =”$StringToTransform” />
    </elem>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>
</xsl:stylesheet>

The result of the transformation is:

<root>
<row>
<elem>1</elem>
<elem>5 main st</elem>
<elem/>
<elem>Cumming</elem>
<elem> GA</elem>
<elem> 30040</elem>
<elem>Kirk Evans</elem>
</row>
<row>
<elem>2</elem>
<elem>13 elm st</elem>
<elem/>
<elem>Anywhere</elem>
<elem> NJ</elem>
<elem> 07825</elem>
<elem>Bob Smith</elem>
</row>
</root>