Home Automation Part 3 - Using Regular Expressions to Parse Sensor Data

 

In the previous installment I managed to set up basic communications with the sensor network and try out some Link45 commands using my Super-Simple Serial Console. In order to actually do anything with the sensor data, I would need a function that could crack the Link45 responses into sensor IDs and temperature and humidity values.

This seemed like a job for System.Text.RegularExpressions. My only problem was that no matter how carefully I constructed my regular expression, there would always be a bug. The edit-compile-test cycle was getting tedious so I decided to take a detour and code up a simple interactive regular expression tester using WinForms. This post describes the regular expression tester and then explains how I used it to develop the regular expression necessary to extract the sensor ID, temperature, and humidity values from the Link45 response string.

An Interactive Regular Expression Tester

My goal was to create a simple application that could show the results of a regular expression in real time as I was editing the expression. I also wanted to prototype the C# code that I would need to navigate the Regex's resulting hierarchy of Matches, Groups, and Captures to pull out the values matched by my regular expression. If you develop a lot of regular expressions you might consider using a more sophisticated and full featured application such as Chris Sells' excellent RegexDesigner.NET.

My simple application provides two text boxes, one for entering sample data and one for the regular expression (see Figure 1). Edits to either the sample data or the regular expression trigger an application of the regular expression, the results of which are shown in the bottom pane in real time.

Figure 1. A very simple interactive regular expression evaluator.

The application is built in WinForms using C#. You can find the complete source code and executable in an attachment at the end of this post. Here's a rough outline of the steps you would need to take to write a similar application:

  1. From Visual Studio, use File -> New Project to create a Visual C# Windows Application.
  2. In Form1.cs [design]
    1. Set the window title to "Interactive Regular Expression Evaluator"
    2. Add a Label control with Text = "Sample Data"
    3. Add a TextBox control named SampleData to hold the sample input string.
    4. Add another Label with Text = "Regular Expression"
    5. Add a TextBox named RegularExpression to hold the text of the regular expression.
    6. Add another Label with Text = "Results"
    7. Add a read-only TextBox named Results to display the results of the expression application.
    8. Set all TextBox anchors to include Top , Left , and Right . Set the anchor for the Results TextBox to include Bottom .
  3. Wire up the TextChanged event handlers for SampleData and RegularExpression to invoke a new method called ApplyRegularExpression() .
  4. Modify the constructor of Form1 to call ApplyRegularExpression() method on application startup.
  5. Write the ApplyRegularExpression() method.

The heart of the application is the ApplyRegularExpression() method which is shown in Listing 1. Whenever the user changes either the sample data or the regular expression, the ApplyRegularExpression() method is invoked to update the results panel. Let's walk through the code.

Listing 1. The ApplyRegularExpression() method updates the results panel whenever the sample data or regular expression changes.

    1:  void ApplyRegularExpression()
    2:  {
    3:      try
    4:      {
    5:          Results.Text = "";
    6:          string results = "";
    7:   
    8:          Regex r = new Regex(RegularExpression.Text);
    9:          MatchCollection matches = r.Matches(SampleData.Text);
   10:          for (int i=0; i < matches.Count; ++i)
   11:          {
   12:              Match m = matches[i];
   13:              results += "Matches[" + i.ToString() + "]";
   14:              results += "\n\r\n\r";
   15:              for (int j = 0; j < m.Groups.Count; ++j)
   16:              {
   17:                  Group g = m.Groups[j];
   18:   
   19:                  results += "    .Groups[";
   20:                  string name = r.GroupNameFromNumber(j);
   21:   
   22:                  if (name == j.ToString())
   23:                      results += name;
   24:                  else
   25:                      results += "\"" + name + "\"";
   26:                  results += "]\n\r\n\r";
   27:   
   28:                  for (int k=0; k < g.Captures.Count; ++k) 
   29:                  {
   30:                      Capture c = g.Captures[k];
   31:                      results += "        .Captures["+k+"]=\"" + c.Value + "\"\n\r\n\r";
   32:                  }
   33:                  
   34:              }
   35:              
   36:          }
   37:          Results.Text = results;
   38:          RegularExpression.ForeColor = Color.Black;
   39:      }
   40:      catch (System.Exception e)
   41:      {
   42:          Results.Text = e.Message;
   43:          RegularExpression.ForeColor = Color.Red;
   44:      }
   45:  }

 

The try-catch block (lines 3-44) is used to catch errors that occur during the application of the regular expression. If an error is detected, the error message is shown in the results pane (lines 40-44) and the regular expression text is colored red.

Normally the code enters the try block at line 3. The string called results (line 6) is used to accumulate the text that will go into the results panel. Accumulating the results into a string seems to be more performant than accumulating them directly into Results.Text, probably because each change to Results.Text forces the TextBox to redo its text layout computations. In this code, all the changes are accumulated into results and only at the end is the string copied into Results.Text.

The regular expression is applied to the sample text in line 8 and a MatchCollection returns the set of Matches in line 9. The nested for-loops in lines 10-36, 15-34, 28-32 walk through the Match/Group/Capture hierarchy, converting it into human readable text which is accumulated in the results string. The results string is used to set the Text property of the results region on line 37. The only part that is not completely straight forward is the if statement on lines 22-25 which checks to see if the Group is named in order to determine whether to output a numerical or string indexer.

To understand how to actually extract matched values from an input string, it helps to understand the hierarchical Match/Group/Capture data structure created by the Regex in line 9. The root of the hierarchy is a MatchCollection with one Match for each portion of the input string that matches the entire regular expression string (line 9). Each Match contains one or more Groups (line 15). The first Group always corresponds to the regular expression in its entirety. Subsequent Groups correspond to grouping constructs within the regular expression. A grouping construct is just a parenthesized sub-expression of a regular expression. Grouping constructs are sometimes used to capture the input text matching specific portions of a larger match. As an example, one could use a grouping construct to capture the area code from a larger expression that matches phone numbers:

(\d\d\d)-\d\d\d-\d\d\d\d

Grouping constructs can also used when applying quantifiers in regular expression. As an example, one could match the string "abcabcabc" using a quantifier in combination with an expression that matches "abc":

(abc){3}

Each Group contains one or more Captures, each of which corresponds to one of the allowed matches for an expression with a quantifier. As an example, when the expression

(dog|cat){3}

is applied to the string "dogdogcat", the resulting subexpression group will contain three Captures, one each for "dog", "dog", and "cat" (see Figure 2).

Figure 2. An subexpression with a quantifier maps to a group with multiple captures.

As you can see, the Interactive Regular Expression Evaluator makes it easy to quickly examine the Match/Group/Capture hierarchy for a particular combination of sample data and regular expression. I encourage you to follow along in the evaluator as you read through the next section which describes the regular expression that extracts useful values from the Link45 data.

Building An Expression to Extract Sensor ID, Temperature, and Humidity

In the previous installment we saw that a "D" command to the Link45 would trigger a response that looks something like

 2621D453000000F1 19,17.81,64.00,44
262FD453000000E2 19,18.12,64.56,43
EOD

Basically, each sensor returns a line of data of the form

xxxxxxxxxxxxxxxx ss,ccc.cc,fff.ff,hhh

where

  • xxxxxxxxxxxxxxxx is a 16 digit hexadecimal sensor ID
  • ss is the sensor type
  • ccc.cc is the temperature in degrees Celsius
  • fff.ff is the temperature in degrees Fahrenheit
  • hhh is the humidity

After the last sensor, the result set is terminated by a line containing the string "EOD".

Matching the Sensor ID. Let’s start by building the portion of the regular expression that extracts the sensor ID into a named Group called “Id”. We will start by using the [] character class construct to match a single hexadecimal digit. The []-construct will match a single instance of any character that appears between the square brackets. As an example, [abc] would match ‘a’, ‘b’, or ‘c’. We can match a single hexadecimal digit with

[0123456789ABCDEF]

The – operator allows us to specify ranges of Unicode characters in a character class, leading to more compact regular expressions. As an example, [a-z] will match any lower case English letter. Using – , we can write a the following equivalent but more compact expression to match a single hexadecimal digit:

[0-9A-F]

Sensor IDs consist of a sequence of 16 hexadecimal digits. We can use the {} quantifier construct to specify the number of adjacent digits to match. The {} construct modifies a regular expression to match multiple copies of its original matching input. As an example, the regular expression a{3} will match “aaa” and is equivalent to the regular expressions aa{2} and aaa. We can match exactly 16 hexadecimal digits with

[0-9A-F]{16}

Now that we can match the sensor ID, we’d like to actually extract it from the input string. I use a parenthesized subexpression to put the sensor ID into a match Group:

([0-9A-F]{16})

Figure 3 shows the match/group/capture hierarchy for this expression when applied to sample data from the Link45. Matches[0].Groups[0] is the match for the entire regular expression. Matches[0].Groups[1] is the match for the parenthesized subexpression. Since the entire regular expression is parenthesized, the two Groups correspond to the same matched substring.

Figure 3. This regular expression extracts the first sensor ID.

Accessing the extracted sensor ID from code is simple:

 System.Console.WriteLine(Matches[0].Groups[1].Captures[0].Value);

This code is somewhat brittle because it assumes that the sensor ID will always be in the second Group. Any change to the regular expression that introduces another subexpression before the sensor ID subexpression will break the code. We can fix this problem by assigning a name to the sensor ID subexpression using the ?<> construct. Here’s an expression that captures a sequence of 16 hexadecimal digits into a Group called “Id”:

(?<Id>[0-9A-F]{16})

The code to access the sensor ID now becomes:

 System.Console.WriteLine(Matches[0].Groups["Id"].Captures[0].Value);

This code is better because it does not depend on the position of the sensor ID subexpression relative to other subexpressions. Now let's extend the regular expression to grab the temperature.

Matching the Temperature and Humidity. I want to use Fahrenheit temperatures so after matching the sensor ID, my expression will need to skip over the sensor type and Celsius temperature before matching the Fahrenheit temperature. Skipping over the sensor type is easy - we just need to skip over a space, a two digit sensor type, and a comma. This can be done by appending

\s\d\d,

to the end of the regular expression for the sensor ID. In this expression, \s denotes a character class that matches any whitespace and \d denotes a character class that matches any decimal digit in the range 0 to 9.

The Celsius temperature is a bit more complex because when the temperature is below freezing, it starts with a minus sign. After that there are 1-3 decimal digits followed by a decimal point followed two more decimal digits.

We can use the ? quantifier to handle the minus sign. ? modifies an expression to match 0 or 1 instances, so the expression -? matches the minus sign when its cold but doesn't complain when its hot.

Earlier I used the {16} quantifier to match exactly 16 hexadecimal digits in the sensor ID. Another form of the {} quantifier allows you to specify an upper and lower bound for the number of matches. A quantifier of the form {x,y} will match at least x, but no more than y instances. Putting it all together, we can match the Celsius temperature with

-?\d{1,3}\.\d{2}

In this expression, the period is preceded by a backslash to indicate a match to a period. Without the backslash, period corresponds to a character class that matches any character.

We can use the identical expression to match the Fahrenheit temperature, but we'll add a named subexpression to allow us to capture the value into a group:

(?<Temp>-?\d{1,3}\.\d{2})

After this, matching the humidity is a snap - it is just a pair of decimal digits. There is one catch, though. The sensor network could contain sensors that only report temperature. In this case, the humidity value and the comma that separates it from the temperature would not appear in the Link45 data. We can use the ? quantifier to make the entire humidity expression optional:

(,(?<Humidity>\d{2}))?

This expression is less efficient than it could be because it captures the humidity value twice, once with the comma and once without. The reason is that I had to add a second parenthesized subexpression in order to make the ? quantifier apply to the sequence containing the comma and the humidity. We can improve the expression by using ?: to denote a non-capturing group:

(?:,(?<Humidity>\d{2}))?

Now we're ready to put it all together and write one giant expression to grab the sensor IDs, temperatures, and humidity values from the Link45 data. We do this by appending the following expression fragments:

  • Match sensor ID: (?<Id>[0-9A-F]{16})
  • Skip space, sensor type, and comma: \s\d\d,
  • Skip Celsius temperature and comma: -?\d{1,3}\.\d{2},
  • Match Fahrenheit temperature: (?<Temp>-?\d{1,3}\.\d{2})
  • Match optional humidity: (?:,(?<Humidity>\d{2}))?

Here's the final, almost incomprehensible expression:

(?<Id>[0-9A-F]{16})\s\d\d,-?\d{1,3}\.\d{2},(?<Temp>-?\d{1,3}\.\d{2})(?:,(?<Humidity>\d{2}))?

Figure 4 shows the expression in action.

Figure 4. The final regular expression applied to real Link45 data.

Now that we have the regular expression, lets modify the Super-Simple Serial Console from my previous installment to poll the sensor network and print out the sensor IDs, temperatures, and humidity levels.

Polling the Sensor Network. The goal here is to write a small console application that initializes the sensor network and then every 5 seconds, takes a set of readings and prints them to the console. This application will form the heart of a data logging service that we will implement in a future installment.

It turns out that System.IO.Ports.SerialPort has a feature that allows us to greatly simplify the code from the Super-Simple Serial Console. Recall that the Super-Simple Serial Console needed a bit of multiple-thread coordination to avoid printing the prompt before the Link45 had finished responding. Since our new application only runs the Link45 "D" command, we can take advantage of the fact that all of the Link45 responses will be terminated by the characters "EOD", allowing us to use the serial port's ReadTo() method. 

ReadTo() takes a single string argument and blocks until this string appears in the serial port's input buffer or a read timeout limit has been exceeded. When ReadTo() detects the termination string, it returns the string of characters received up until the termination string. The termination string itself is not returned, but it is removed from the serial port's input buffer. Listing 2 shows the sensor reading application.

Listing 2. This application reads the sensor data and prints it to the console.

    1:  using System;
    2:  using System.IO.Ports;
    3:  using System.Text.RegularExpressions;
    4:  using System.Threading;
    5:   
    6:  namespace SensorConsole
    7:  {
    8:      // SensorConsole reads from the sensor network every 5 seconds and prints the
    9:      // sensor id, temperature, and humidity for each sensor.
   10:      class SensorConsole
   11:      {
   12:          private SerialPort port;
   13:   
   14:          public void Run()
   15:          {
   16:              Console.WriteLine("Temperature and Humidity Logger");
   17:              try
   18:              {
   19:                  Open();
   20:   
   21:                  Regex r = new Regex(@"(?<Id>[0-9A-F]{16})\s\d\d,-?\d{1,3}\.\d{2},(?<Temp>-?\d{1,3}\.\d{2})(?:,(?<Humidity>\d{2}))?");
   22:   
   23:                  while (true)
   24:                  {
   25:                      try
   26:                      {
   27:                          port.Write("D");
   28:                          string s = port.ReadTo("EOD");
   29:   
   30:                          MatchCollection matches = r.Matches(s);
   31:                          foreach (Match m in matches)
   32:                          {
   33:                              Int64 id = Int64.Parse(m.Groups["Id"].Value, System.Globalization.NumberStyles.HexNumber);
   34:                              float temp = float.Parse(m.Groups["Temp"].Value);
   35:                              int humidity = int.Parse(m.Groups["Humidity"].Value);
   36:                              Console.WriteLine("{0:X}: {1} degrees F and {2}% humidity", id, temp, humidity);
   37:                          }
   38:                          Console.WriteLine();
   39:                      }
   40:                      catch (System.Exception e)
   41:                      {
   42:                          if (e is System.TimeoutException)
   43:                          {
   44:                              Console.WriteLine("Error: timeout reading sensor data.");
   45:                          }
   46:                          else
   47:                              throw e;
   48:                      }
   49:   
   50:                      // Wait 5s between sensor readings.
   51:                      Thread.Sleep(5000);
   52:                  }
   53:              }
   54:              finally
   55:              {
   56:                  if (port != null && port.IsOpen)
   57:                      port.Close();
   58:              }
   59:          }
   60:   
   61:          void Open()
   62:          {
   63:              port = new SerialPort("COM1", 9600, Parity.None, 8, StopBits.One);
   64:   
   65:              // The iButton Link45 needs DTR and RTS enabled.
   66:              port.DtrEnable = true;
   67:              port.RtsEnable = true;
   68:   
   69:              // Bump up the read timeout from the default 500ms to ensure time for 
   70:              // reporting from large sensor networks.
   71:              port.ReadTimeout = 5000;
   72:   
   73:              // The serial port is now configured and ready to be opened.
   74:              port.Open();
   75:   
   76:              // Shock the Link45 to life by sending a break.
   77:              port.BreakState = true;
   78:              port.BreakState = false;
   79:          }
   80:   
   81:          void Close()
   82:          {
   83:              port.Close();
   84:          }
   85:      }
   86:   
   87:      class Program
   88:      {
   89:          static void Main(string[] args)
   90:          {
   91:              SensorConsole sc = new SensorConsole();
   92:              sc.Run();
   93:          }
   94:      }
   95:  }

Let's walk through the code in execution order, starting at the entry point on line 89. Main() just creates a SensorConsole and invokes its Run() method. The body of the Run() method is protected by a try-finally block (line 17-58) that ensures the serial port is closed before returning from Run() .

The first order of business inside the try block is to call the Open() method (lines 61-79) which initializes the serial port. The Open() method is nearly the same as the one in the Super-Simple Serial Console. The differences are that we no longer need to wire up the serial port's DataReceived event because we are using ReadTo() and we bumped up the serial port's ReadTimeout from 500ms to 5000ms to be on the safe side in case the Link45 takes more than a half a second to return data from a large sensor network.

After returning from Open() , the regular expression is built using the Regex class (line 21). Note the @-sign immediately before the regular expression string. The @-sign indicates to the compiler that the following string should be treated literally with no special meaning for the backslashes. If I didn't use the @-sign, I would have had to escape each of the backslashes with "\\" which would have led to an even less comprehensible regular expression.

The main loop spans from line 23 to 52. Right inside the loop a try block (lines 25-48) is used to separate serial port timeout exceptions from other potential problems. As this code morphs into a more sophisticated always-on data logger, I will need to extend this error handling so that the logger stays up and running in the presence of intermittent communication or sensor failures.

Inside the try block, we issue a "D" command to the Link45 on line 27 and then wait for a response on Line 28. Assuming we didn't get a timeout exception, control passes to line 30 where the our regular expression is applied to the Link45 data. The foreach loop on lines 31-37 walks over each of the Matches, extracting the sensor ID, temperature, and humidity. Since the expression is designed to return a single Capture per Group, I've used the shorthand of m.Groups["Id"].Value instead of the more verbose m.Groups["Id"].Captures[0].Value. The Group values are returned as strings so I pass the ID, temperature, and humidity to Int64.Parse(), float.Parse(), and int.Parse() in order to get the data into the correct types. After converting the strings, we print the values back out to the console (line 36) and head to the top of the foreach loop (line 31) to process the next sensor's results. When we run out of sensors, we print out a blank line (line 38), sleep for 5 seconds (line 51) and then head to the top of the while loop (line 23) to do it all over again.

We can now read the sensor network and extract useful fields like sensor ID, temperature, and humidity. Our next step will be to log the sensor values to a SQL database.

Next Installment: Logging sensor values to a database.  

kick it on DotNetKicks.com

RegExEvaluator.zip