How to Create Hierarchy from Flat Data using LINQ

This is a fun, geeky post for those interested in functional programming.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCSometimes you have flat data where there is hierarchy expressed in the data, but the form of the data is flat. You may need to transform this data into a hierarchy, such as an XML tree.

For instance, you may have the following data:

Heading[] headings = new[] {

    new Heading { Level = 1, Text = "Intro" },

    new Heading { Level = 2, Text = "Hello" },

    new Heading { Level = 2, Text = "Summary" },

    new Heading { Level = 1, Text = "Technical Info" },

    new Heading { Level = 2, Text = "Class API" },

    new Heading { Level = 3, Text = "Class1" },

    new Heading { Level = 3, Text = "Class2" },

    new Heading { Level = 2, Text = "Interoperability" },

    new Heading { Level = 3, Text = "Scenario 1" },

    new Heading { Level = 3, Text = "Scenario 2" },

    new Heading { Level = 1, Text = "In Conclusion" }

};

And you want to transform this data into the following XML:

<Root>

  <Heading1Name="Intro">

    <Heading2Name="Hello" />

    <Heading2Name="Summary" />

  </Heading1>

  <Heading1Name="Technical Info">

    <Heading2Name="Class API">

      <Heading3Name="Class1" />

      <Heading3Name="Class2" />

    </Heading2>

    <Heading2Name="Interoperability">

      <Heading3Name="Scenario 1" />

      <Heading3Name="Scenario 2" />

    </Heading2>

  </Heading1>

  <Heading1Name="In Conclusion" />

</Root>

This is easy to do – you create a recursive function that returns the headings for the next level down. This function can use the SkipWhile and TakeWhile extension methods to return the next level down headings. This is certainly not the most efficient way to do it, but I was interested in coding this as fast as possible – I’m only going to run this once, and don’t care very much how long it takes.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Xml.Linq;

 

namespace ConsoleApplication1

{

    class Heading

    {

        public int Level { get; set; }

        public string Text { get; set; }

    }

 

    class Program

    {

        public static IEnumerable<XElement> GetChildrenHeadings(

            IEnumerable<Heading> headingList,

            Heading parent)

        {

            return

                headingList

                    .SkipWhile(h => h != parent)

                    .Skip(1)

     .TakeWhile(h => h.Level > parent.Level)

                    .Where(h => h.Level == parent.Level + 1)

                    .Select(h =>

                        new XElement("Heading" + h.Level,

                            new XAttribute("Name", h.Text),

                            GetChildrenHeadings(headingList, h)

                        )

                    );

        }

 

        public static void Main(string[] args)

        {

            Heading[] headings = new[] {

                new Heading { Level = 1, Text = "Intro" },

                new Heading { Level = 2, Text = "Hello" },

                new Heading { Level = 2, Text = "Summary" },

                new Heading { Level = 1, Text = "Technical Info" },

                new Heading { Level = 2, Text = "Class API" },

                new Heading { Level = 3, Text = "Class1" },

                new Heading { Level = 3, Text = "Class2" },

                new Heading { Level = 2, Text = "Interoperability" },

                new Heading { Level = 3, Text = "Scenario 1" },

                new Heading { Level = 3, Text = "Scenario 2" },

                new Heading { Level = 1, Text = "In Conclusion" }

            };

 

            var toc = new XElement("Root",

                    headings

   .Where(h => h.Level == 1)

                        .Select

                        (

                            h => new XElement("Heading" + h.Level,

                                new XAttribute("Name", h.Text),

                    GetChildrenHeadings(headings, h)

                            )

                        )

                );

 

            Console.WriteLine(toc);

        }

    }

}

Today, I had the need to take the Ecma Open XML spec part 4 and generate a TOC in XML. I used the code from this blog post to return a collection of the paragraphs, and the technique shown above to generate the TOC. Loading the doc takes 7 seconds on my machine, generating the TOC in this manner is less than a second, so even though it is not the most efficient, it is more than efficient enough for my purposes.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Xml.Linq;

using Microsoft.Examples.LtxOpenXml;

using DocumentFormat.OpenXml.Packaging;

 

class Heading

{

    public string StyleName { get; set; }

    public string Text { get; set; }

    public int HeadingLevel { get; set; }

}

 

class Program

{

    public static IEnumerable<XElement> GetChildrenHeadings(

        IEnumerable<Heading> headingList, Heading parent)

    {

        return

            headingList

                .SkipWhile(h => h != parent)

                .Skip(1)

               .TakeWhile(h => h.HeadingLevel > parent.HeadingLevel)

                .Where(h => h.HeadingLevel == parent.HeadingLevel + 1)

                .Select(h =>

                    new XElement(h.StyleName,

                        new XAttribute("Name", h.Text),

                        GetChildrenHeadings(headingList, h)

                    )

                );

    }

 

    public static void Main(string[] args)

    {

        using (WordprocessingDocument doc =

            WordprocessingDocument.Open(

            "Office Open XML Part 4 - Markup Language Reference.docx",

           false))

        {

            // materialize to list for efficiency

            List<Heading> headingList =

                doc.MainDocumentPart

                    .Paragraphs()

                    .Where(p => p.StyleName.StartsWith("Heading"))

                    .Select(p =>

                        {

                            int headingLevel;

                            Int32.TryParse(p.StyleName.Substring(7),

                                out headingLevel);

                            return new Heading

                {

                                StyleName = p.StyleName,

                                Text = p.Text,

                                HeadingLevel = headingLevel

                            };

                        }

                    )

                    .ToList();

 

            var toc = new XElement("Root",

                    headingList

                        .Where(p => p.HeadingLevel == 1)

                        .Select

                        (

                            p => new XElement(p.StyleName,

                                new XAttribute("Name", p.Text),

                                GetChildrenHeadings(headingList, p)

                            )

                        )

                );

 

            toc.Save("toc.xml");

            Console.WriteLine(headingList.Count());

            Console.WriteLine(toc.Descendants().Count());

        }

    }

}

When run, it produces:

<?xmlversion="1.0"encoding="utf-8"?>

<Root>

  <Heading1Name="Part Overview">

    <Heading2Name="WordprocessingML Part Summary" />

    <Heading2Name="SpreadsheetML Part Summary" />

    <Heading2Name="PresentationML Part Summary" />

    <Heading2Name="DrawingML Part Summary" />

    <Heading2Name="Shared Part Summary" />

  </Heading1>

  <Heading1Name="WordprocessingML Reference Material">

    <Heading2Name="Table of Contents" />

    <Heading2Name="Main Document Story">

      <Heading3Name="background (Document Background)" />

      <Heading3Name="body (Document Body)" />

      <Heading3Name="document (Document)" />

    </Heading2>

    <Heading2Name="Paragraphs and Rich Formatting">

      <Heading3Name="Paragraphs">

        <Heading4Name="adjustRightInd (Automatically Adjust Right Indent When Using Document Grid)" />

        <Heading4Name="autoSpaceDE (Automatically Adjust Spacing of Latin and East Asian Text)" />

        <Heading4Name="autoSpaceDN (Automatically Adjust Spacing of East Asian Text and Numbers)" />

        <Heading4Name="bar (Paragraph Border Between Facing Pages)" />

        ...

Code is attached.

CreateHierarchyFromFlat.zip