Retrieving the Two Code Groups - VB

[Table of Contents] [Next Topic]

There are two groups of paragraphs in our document that are styled as "Code".  The first group contains the C# code that we want to test.  The second group contains a single paragraph that is the output of the code in the first group.  Next in the process of formulating our query, we want to retrieve each block of code as a separate group.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThe problem is, the GroupBy extension method doesn't do what we want.  It groups all items together in the collection, regardless of if they are separated by other items.  It would join our two groups of code, which we want to keep separate.

For instance, if we amend the code to group the paragraphs, adding one more query to the bottom of our string of queries, as follows:

Dim defaultStyle As String = _
CStr( _
( _
From style in styleDoc.Root _
.Elements(w + "style") _
Where( _
CStr(style.Attribute(w + "type")) = "paragraph" And _
CStr(style.Attribute(w + "default")) = "1") _
) _
.First() _
.Attribute(w + "styleId") _
)

Dim paragraphs = _
mainPartDoc.Root _
.Element(w + "body") _
.Descendants(w + "p") _
.Select(Function(p) _
New With { _
.ParagraphNode = p, _
.Style = GetParagraphStyle(p, defaultStyle) _
} _
)

Dim r As XName = w + "r"
Dim ins As XName = w + "ins"

Dim paragraphsWithText = _
paragraphs.Select(Function(p) _
New With { _
.ParagraphNode = p.ParagraphNode, _
.Style = p.Style, _
.Text = p.ParagraphNode _
.Elements() _
.Where(Function(z) z.Name = r or z.Name = ins) _
.Descendants(w + "t") _
.StringConcatenate(Function(s) CStr(s)) _
} _
)

Dim groupedCodeParagraphs = _
paragraphsWithText.GroupBy(Function(p) p.Style)

For Each g In groupedCodeParagraphs
Console.WriteLine("Group of paragraphs styled {0}", g.Key)
For Each p In g
Console.WriteLine("{0} {1}", _
p.Style.PadRight(12), _
p.Text)
Next
Console.WriteLine()
Next

Then we see:

Group of paragraphs styled Heading1
Heading1 Parsing WordprocessingML with LINQ to XML

Group of paragraphs styled Normal
Normal The following example prints to the console.
Normal This example produces the following output:

Group of paragraphs styled Code
Code using System;
Code
Code class Program {
Code public static void Main(string[] args) {
Code Console.WriteLine("Hello World");
Code }
Code }
Code
Code Hello World

This grouped the "Hello World" with the code, which is not what we want.

As it turns out, there isn't a standard query operator that does exactly what we want.  We want an operator that groups only adjacent fields with a common key.  So let's write one.  In addition to the GroupAdjacent extension method, we need an GroupOfAdjacent class that we can iterate through for each grouping.  It only takes a couple dozen lines of code to implement this.

Unlike the C# version, the GroupAdjacent implementation for Visual Basic is not lazy.  But this really doesn’t impact performance in any noticeable way, even for large documents.

Before this version of GroupAdjacent returns the first group, it iterates through the entire collection, creating a list of lists.

To use GroupAdjacent, we pass it a lambda that selects the value that when that value changes, the operator creates a new group.  GroupAdjacent then is a sequence of groups, each of which contain a sequence of type T.

Here is the listing:

Imports System.IO
Imports System.Xml
Imports System.Text
Imports DocumentFormat.OpenXml.Packaging

Public Class GroupOfAdjacent(Of TElement, TKey)
Implements IEnumerable(Of TElement)

Private _key As TKey
Private _groupList As List(Of TElement)

Public Property GroupList() As List(Of TElement)
Get
Return _groupList
End Get
Set(ByVal value As List(Of TElement))
_groupList = value
End Set
End Property

Public ReadOnly Property Key() As TKey
Get
Return _key
End Get
End Property

Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of TElement) _
Implements System.Collections.Generic.IEnumerable(Of TElement).GetEnumerator
Return _groupList.GetEnumerator
End Function

Public Function GetEnumerator1() As System.Collections.IEnumerator _
Implements System.Collections.IEnumerable.GetEnumerator
Return _groupList.GetEnumerator
End Function

Public Sub New(ByVal key As TKey)
_key = key
_groupList = New List(Of TElement)
End Sub
End Class

Module Module1
<System.Runtime.CompilerServices.Extension()> _
Public Function GroupAdjacent(Of TElement, TKey)(ByVal source As IEnumerable(Of TElement), _
ByVal keySelector As Func(Of TElement, TKey)) As List(Of GroupOfAdjacent(Of TElement, TKey))
Dim lastKey As TKey = Nothing
Dim currentGroup As GroupOfAdjacent(Of TElement, TKey) = Nothing
Dim allGroups As List(Of GroupOfAdjacent(Of TElement, TKey)) = New List(Of GroupOfAdjacent(Of TElement, TKey))()
For Each item In source
Dim thisKey As TKey = keySelector(item)
If lastKey IsNot Nothing And Not thisKey.Equals(lastKey) Then
allGroups.Add(currentGroup)
End If
If Not thisKey.Equals(lastKey) Then
currentGroup = New GroupOfAdjacent(Of TElement, TKey)(keySelector(item))
End If
currentGroup.GroupList.Add(item)
lastKey = thisKey
Next
If lastKey IsNot Nothing Then
allGroups.Add(currentGroup)
End If
Return allGroups
End Function

<System.Runtime.CompilerServices.Extension()> _
Public Function GetPath(ByVal el As XElement) As String
Return el _
.AncestorsAndSelf _
.InDocumentOrder _
.Aggregate("", Function(seed, i) seed & "/" & i.Name.LocalName)
End Function

<System.Runtime.CompilerServices.Extension()> _
Function StringConcatenate(Of T) _
(ByVal source As IEnumerable(Of T), ByVal projectionFunc As Func(Of T, String)) _
As String
Return source.Aggregate(New StringBuilder, _
Function(sb, i) sb.Append(projectionFunc(i)), _
Function(sb) sb.ToString)
End Function

Public Function LoadXDocument(ByVal part As OpenXmlPart) _
As XDocument
Using streamReader As StreamReader = New StreamReader(part.GetStream())
Using xmlReader As XmlReader = xmlReader.Create(streamReader)
Return XDocument.Load(xmlReader)
End Using
End Using
End Function

Public Function GetParagraphStyle(ByVal para As XElement, _
ByVal defaultStyle As String) As String
Dim w As XNamespace = _
"https://schemas.openxmlformats.org/wordprocessingml/2006/main"
Dim paraStyle = CStr(para.Elements(w + "pPr") _
.Elements(w + "pStyle") _
.Attributes(w + "val") _
.FirstOrDefault())
If (paraStyle Is Nothing) Then
Return defaultStyle
Else
Return paraStyle
End If
End Function

Sub Main()
Dim w As XNamespace = _
"https://schemas.openxmlformats.org/wordprocessingml/2006/main"
Dim filename As String = "SampleDoc.docx"
Using wordDoc As WordprocessingDocument = _
WordprocessingDocument.Open(filename, True)
Dim mainPart As MainDocumentPart = _
wordDoc.MainDocumentPart
Dim styleDefinitionPart As StyleDefinitionsPart = _
mainPart.StyleDefinitionsPart
&nbs