ConvertFrom-String: Example-based text parsing

 


Intro

I’m sure most of you are familiar with the powerful tools for text parsing available in PowerShell. A presentation at the PowerShell Summit a couple of weeks ago provides a good overview of these and mentions a new Powershell cmdlet, ConvertFrom-String, that was introduced in Windows Management Framework 5.0 Preview September 2014. ConvertFrom-String lets you parse a file by providing a template that contains examples of the desired output data rather than by writing a (potentially complex) script.


A Simple Example

The namesAndCities.input.txt attached to this post contains simple names together with cities, and namesAndCities.namesOnly.template.txt copies the first two records and wraps their names in template markup:

{Name*:Craig Trudeau}

Buffalo, NY

 

{Name*:Merle Baldridge}

Baltimore, MD

 

This defines the extraction of all names from the file. (In this case a single example would have worked due to the distinct formatting of the lines, but in general it is better to supply two examples to give FlashExtract a better idea of the context.) Now let’s run it:

 

gc .\namesAndCities.input.txt | ConvertFrom-String -templateFile .\namesAndCities.namesOnly.template.txt

 

ExtentText                                Name

———-                                —-

Craig Trudeau …                         Craig Trudeau

Merle Baldridge …                       Merle Baldridge

Vicente Saul …                          Vicente Saul

Lydia Parsons …                         Lydia Parsons

Cheryl Booth …                          Cheryl Booth

Shannon Holland …                       Shannon Holland

Libby Stevens …                         Libby Stevens

Thomas Donnelly …                       Thomas Donnelly      

 

The rest of this post describes how ConvertFrom-String works and develops a more full-featured address file and templates for these addresses. I’ll also describe some ways to figure out what to change when you don’t get the results you want.


How ConvertFrom-String Works

ConvertFrom-String is built on top of FlashExtract, a program-synthesis technology developed by Microsoft Research. FlashExtract uses an improved version of the substring-extraction techniques that were developed in Flash Fill, which ships in Excel 2013.  In Flash Fill, those substrings are extracted from one or more source strings and combined into a target string. FlashExtract learns substrings to perform a top-down partitioning of the file into regions that are either nested or completely non-overlapping, and then to extract the contents of some subset of those regions as the desired output strings. In ConvertFrom-String, the regions are defined by examples in the template markup, and the substrings that are extracted by FlashExtract become the values of properties on a sequence of output objects.

The program synthesis in FlashExtract is based upon analyzing the substrings surrounding the beginning and ending of each example region and generating programs that are combinations of various primitive string operations such as regular expressions. For each region, it finds the set of these programs that are consistent with all examples for that region and ranks them. The combination of the best-ranking sub-program for each region becomes the final FlashExtract program.


Defining a Structure and Fields

Now let’s look at a more realistic address file and template. Here are the first two examples in addresses.PersonInfo.template.txt:

{PersonInfo*:{Name:Craig Trudeau}

{Address:{Street:4567 Main St NE}

{[string]City:Buffalo}, {State:NY} {Zip:98052}}

{Phone:(425) 555-0100}}

 

{PersonInfo*:{Name:Merle Baldridge}

{Address:{Street:1234 First Ave}

{City:Baltimore}, {State:MD} {Zip:98101}}

{Phone:(425) 555-0101}}

 

In this template we define a hierarchy with the PersonInfo structure defined at the highest level and within this the individual fields of the structure, including another structure for Address.  In a bit more detail we are defining:

  • Examples of regions that define a sequence of structures named PersonInfo.  The ‘*’ suffix defines a sequence within its parent region. In this case, the parent region is the entire file.
  • An example of a region that defines a non-sequence (there is no ‘*’ suffix) leaf value Name within the PersonInfo structure.
  • An example of a region that defines a non-sequence structure Address within the PersonInfo structure, along with the fields of Address. If all addresses in the file had exactly the same format, only the first PersonInfo definition would be necessary. However, the second Address.Street does not have the “NE” suffix. If we did not define the Address structure and its Street field in the second example, our extracted addresses would only recognize Streets that had such a suffix. By supplying the second example, we tell FlashExtract to be more flexible in its extraction of Street. (In this particular example, simply defining the second PersonInfo structure, even without its Address.Street definition, results in a correct program. However, it is safest to provide the definition for the field, to ensure that changes in other areas such as ranking will continue to give the desired results.)
  • Notice the [string] type cast on City.  This is the default, so it is merely illustrative here. As in PowerShell, you can specify a type cast to any .NET type.  For example, to use Sort-Object on a field with integers, define it with [int] so it is sorted as integers rather than text.


Debugging ConvertFrom-String

Now let’s look at a program that FlashExtract might generate for these two examples.  To do so, pass the -Debug parameter to the ConvertFrom-String cmdlet (“cfs” is the alias for ConvertFrom-String):

gc .\addresses.input.txt | cfs -templateFile .\addresses.PersonInfo.template.txt -Debug

 

Important Note: The -Debug output shown here is specific to the current preview version of ConvertFrom-String and will change in later versions. And as always in PowerShell, text output is not a contract.

Running this gives us 8 programs, one for each field in the template.  Here are the first two:

DEBUG: Property: PersonInfo

Program: EndSSL(ESPL((StartsWith(Left parenthesis(\(), Number([0-9]+(\,[0-9]{3})*(\.[0-9]+)?), Right parenthesis(\)))): 0, 1, …: ε…ε, 0)Line Separator([ \t]*((\r)?\n)

)…Camel Case(\p{Lu}(\p{Ll})+), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+), -1)

————————————————-

Property: Name

Program: ESSL((EndsWith(Camel Case(\p{Lu}(\p{Ll})+), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+))): 0, 1, …: ε…ε, 1 + ε…ε, 0)

 

 

Before we dive into the details, here’s a high-level view of what’s happening. FlashExtract first learns how to recognize the start and end positions of the PersonInfo structure examples. Then it evaluates the subfield examples within each of those structure examples to learn those subfields’ boundaries. In this case, we have two examples of the Address structure, one within each of the two PersonInfo examples.  For each Address example, FlashExtract learns programs to recognize the start and end positions of that Address example within its parent PersonInfo example, then combines these to create a single substring-recognition program that satisfies both Address examples. In the same way, we learn a substring-recognition program for each of the fields Street, City, State, and Zip within Address, and for the field Phone within PersonInfo.

Now let’s look in more detail at the -Debug output, starting with PersonInfo.

A line sequence is a subset of the lines in the file that match certain criteria. A position may be either a constant or a location in a string where the substrings to either side of that location match certain regular expressions.  In this example, for the multiline PersonInfo region FlashExtract first learned the line sequence that identifies the end positions, then learned a function that, for each line in this sequence, backs up to identify the region start positions. (FlashExtract can also learn it in the other direction, first learning the starting position sequence and then a function that moves forward to find the ending position).  In the above program, EndSSL is the function that drives this process, ESPL defines the ending-position line sequence, and after ESPL is the function that maps an ending position to the start position. So the PersonInfo program breaks out as:

EndSSL(

   ESPL((StartsWith(/*area code*/)):

            0, 1, …: // This represents a filter that accepts all

                       // matching lines (starts at first position

                       // and increments by one.

            e…e, 0   // The end position is the last position in

                       // the line.

         )

    Line Separator()… // Find the start position by looking for

                       // a line start

        Camel Case(), WhiteSpace(), Camel Case(), // that is followed

                       // by two "names" separated by whitespace;

                       // the start position is at the beginning of

                       // the first "name".

        -1             // Move backward from the end position to

                             // find the match.

    )

 

Notice that in the comments above, “name” is in quotes. So far our examples assume that a name consists of an initial uppercase letter followed by lowercase letters (sometimes called “proper case”). We’ll see later that this is not always correct.

As mentioned above this output will change but to help you diagnose problems in the meantime here is a list of the sequence-generating functions you might see in the current version:

ESSL: This returns a sequence of (single-line) substrings by finding a sequence of lines and extracting a substring from each line.

EndSSL: This returns a sequence of (possibly multiline) substrings by finding a sequence of ending positions, and for each ending position, finding the starting position.

StartSSL: This returns a sequence of (possibly multiline) substrings by finding a sequence of starting positions, and for each starting position, finding the ending position.

ESPL: This returns a sequence of positions by finding a sequence of lines, and for each line, finding a position within it.

SPL: This returns a sequence of positions. It takes four parameters: re1, re2, init, incr. SPL finds all positions that match regex re1 on its left and match regex re2 on its right. From this sequence it selects every incr’th item starting at index init.

 

Why does FlashExtract break the file into lines?  Consider the following:

$123 one two three four {CapitalLetters*:ABC} five six seven eight

123 put four words here DEF and another four here

123 2001 2002 2003 2004 GHI 2005 2006 2007 2008
$123 eleven twelve thirteen fourteen {CapitalLetters*:JKL} fifteen sixteen seventeen eighteen

 

Here we want to capture CapitalLetters only if the line starts with $. However, learning the start and end positions of CapitalLetters can be done over a much shorter span (“extract an all-capital sequence that is between two lower-case alphabetical sequences”), which would mistakenly capture the DEF line.  By splitting the file into lines, we can use shorter ranges on both line selection and position selection.

Now that we’ve got the outer structure, let’s look at the Name field inside it.

Property: Name

Program: ESSL((EndsWith(Camel Case(\p{Lu}(\p{Ll})+), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+))): 0, 1, …: ε…ε, 1 + ε…ε, 0)

 

This breaks out as:

 

ESSL((EndsWith(/*"name", whitespace, "name")): // find lines that end with this pattern

            0, 1, …: // accept all matching lines

            e…e, 1   // For each matching line, the start position is the first occurrence of an empty string

            +             // (separate start and end positions)

            e…e, 0)  // and the end position is the end of the line

 

Now let’s see this in action.  Add Format-Table to the command line.

PersonInfo                                                                                                                                                              

———-                                                                                                                                                              

{@{ExtentText=Craig Trudeau …   

{@{ExtentText=Merle Baldridge …                     

{@{ExtentText=Los Angeles, CA 98102…

{@{ExtentText=Randolph LaBelle…

{@{ExtentText=Lydia Parsons …

{@{ExtentText=Cheryl Booth …

{@{ExtentText=Shannon Holland …

{@{ExtentText=San Diego, CA 98107…

{@{ExtentText=Hannah McStorey…

{@{ExtentText=Thomas Donnelly …   

 

Notice that we have “Los Angeles” and “San Diego” where we expect names.  These city names contain a space, so they match the beginning position program for PersonInfo.  Let’s provide another example:

{PersonInfo*:Vicente Saul

{Address:2345 Second Ave SE

{City:Los Angeles}, CA 98102}

(425) 555-0102}

 

Because the field we’re concerned about providing another example for is City, we only need to provide its direct hierarchy; we don’t need Name, Street, State, etc.

Now the Format-Table output looks good.  Let’s dig in a bit more with Format-List.  Now we see a couple of incorrect names:

PersonInfo : {@{ExtentText=Randolph LaBelle

             3456 Third Ave

             Fargo, ND 98103

             (425) 555-0183; Name=3456 Third Ave; Phone=(425) 555-0183}}

 

PersonInfo : {@{ExtentText=Hannah McStorey

             8901 Pine St

             Portland, OR 98108

             (425) 555-0108; Name=8901 Pine St; Address=; Phone=(425) 555-0108}}

 

 

As mentioned earlier, learning that a name has an uppercase letter only at the beginning is not always correct.  We’ll add one more example, again defining only the fields necessary to resolve the ambiguity:

 

{PersonInfo*:{Name:Randolph LaBelle}

3456 Third Ave

Fargo, ND 98103

(425) 555-0183}

 

With this, we can see that the full output with Format-Custom (or your favorite formatting command) is correct.  In fact, we can now remove the example for Merle Baldridge because the example we added for Randolph LaBelle has no suffix on the street.

Now let’s see how we can write the examples even more easily.


Defining Implicit Structures

Above, we defined the PersonInfo structure explicitly, with a name and boundaries.  This is not always necessary.  FlashExtract can often infer the boundaries of a parent structure if the first subfield of that structure is defined as a sequence.  In this case, FlashExtract learns an implicit region for the parent structure that extends from the beginning of one instance of the first subfield to just before the beginning of the next instance.  The attached file addresses.ImplicitStruct. template.txt modifies our example above to illustrate this:

 

{Name*:Craig Trudeau}

{Address:{Street:4567 Main St NE}

{City:Buffalo}, {State:NY} {Zip:98052}}

{Phone:(425) 555-0100}

 

{Name*:Vicente Saul}

2345 Second Ave SE

{!Name*:Los Angeles}, CA 98102

(425) 555-0102

 

{Name*:Randolph LaBelle}

3456 Third Ave

Fargo, ND 98103

(425) 555-0183

 

Now there is no PersonInfo field defined, and Name has become a sequence (with the ‘*’ suffix). This also illustrates another aspect of field definitions, a negative example. Without the {!Name*:Los Angeles} definition, “Los Angeles” matches a line-starting expression that will extract it as a name. As before, we need Randolph LaBelle as an example of capitalization inside a name.

The only difference in -debug output from the final template file with PersonInfo is that we don’t have a PersonInfo property here, and Name has a different program:

DEBUG: Property: Name

Program: ESSL((SucceedingStartsWith(Number([0-9]+(\,[0-9]{3})*(\.[0-9]+)?), WhiteSpace(( )+), Camel Case(\p{Lu}(\p{Ll})+))): 0, 1, …: ε…ε, 1 + ε…ε, 0)

 

Our negative example for “Los Angeles” resulted in the best-ranked program changing from looking for a line starting with Camel Case to looking for a line followed by a line that starts with a number.

Although these examples don’t show it, it is possible to define sequences within any parent structure. This includes defining nested sequences within an implicit parent structure:

                {Name*:…}

                                {Phone*:…}

                                {Phone*:…}

                {Name*:…}

 

However, it is often easier to remove ambiguity by defining explicit regions. This is particularly important when there may be zero instances of a child sequence within a parent sequence.


Syntax Summary

As you can probably piece together from the examples above and the release notes, the syntax of a template field specification is:

                {[optional-typecast]namesequence-spec:example-value}

The field specification is enclosed with curly braces. If there are curly braces already in the file they must be escaped by adding a ‘\’ before them (and any ‘\’ characters already in the file must be escaped by doubling them).  Optional-typecast is the usual PowerShell type-cast syntax, a .NET type within square brackets (such as [string] or [int]).  Name is the name of the field.  Sequence-spec is “*” if the field will be a sequence (i.e. will have multiple instances) within its parent, else empty.  All of these are text that is added to the actual data in the template file. “Example-value” is the actual data. This data starts with the character immediately after the “:” following name and ends with the character immediately preceding the field’s closing “}”, including all whitespace.

As illustrated above, fields may be nested; this is done by creating a field specification within the “example-value” of the enclosing field.

ConvertFrom-String does not recognize regular expressions in the field value; it interprets them as the literal string.  For example, this LazyWinAdmin post provides a very nice example of using ConvertFrom-String with the output of netstat.  However, the use of “{State:\s}” works not because it is recognized as a regular expression, but because it tries to match against the literal string “\s”, which happens to result in FlashExtract selecting a program that returns an blank field.  Using “{State:\q}” or “{State:#}” works as well.  The latter, because it does not have an alphabet character, learns a program that avoids a problem in the template that (as of this writing) converts “TIME_WAIT” to “WAIT”; this could also be fixed by adding examples of State with the “_” character.


Common Problems

Many problems are due to over-learning:

·         Specific case.  Sometimes a word will begin with lowercase when your examples are all uppercase, or as above, an uppercase letter will be in the middle of a word.  Or there may be special characters like underscore, apostrophe, or hyphen.

·         A program may learn to search for only a specific number of characters, or for a specific string, if all the examples are the same in certain areas. For example, I recently parsed a file that used spaces for alignment; both my examples had one digit followed by 9 spaces, so it missed lines where 2 digits were followed by 8 spaces.

 

Over-learning can be fixed by adding diverse examples to relax the restrictions. In other cases, as with “Los Angeles” above, the learned program may be too lenient and you may need to specify a negative example.

 

Other things to look for:

·         Be sure that spaces in the template file match those in the data file.  Sometimes there may be trailing spaces or spaces between fields in the original file, and these may be relevant in learning.

·         If you are having trouble getting the correct region boundaries with implicit regions, consider adding explicit regions.

 

We would love to know how ConvertFrom-String meets your needs or where it does not do what you expect.  Please give it a workout and send your examples, problems, suggestions, and other feedback to psdmfb–at–microsoft.com. Happy parsing!

 

Ted Hart [MSFT]

Microsoft Research