Regex 101 Exercise S4 - Extract load average from a string - Discussion

Exercise S4 - Extract load average from a string

The shop that you work with has a server that writes a log entry every hour in the following format:

8:00 am up 23 day(s), 21:34, 7 users, load average: 0.13

You need to write a utility that lets you track the load average on an hourly basis. Write a regex that extracts the time and the load average from the string.

****

This is pretty close to the first thing I ever did with regular expressions. I had some logfile information I needed to process. I started writing in C++, and if you've ever tried to do lots of character manipulation in C++, you know how much fun that can be.

For this sort of thing, I like to look for good delimiters. To get the time, I'll use "up" as the delimiter, which means I can match with:

.+\s*up

The \s is something new, it means "any whitespace character". I next need to pull out the load average. I'll use "load average:" as the delimiter, so the regex to pull that out is:

load average:\s*[0-9.]+

and I can string them together to get:

.+\s*up # match time
.+? # skip middle section
load\ average:\s*[0-9.]+ # match load average

I added the middle clause to skip the characters in the middle that I don't care about. I also switched to multi-line mode, which means that I need to use RegexOptions.IgnorePatternWhitespace, and that required me to change "load average" to "load\ average" so that the regex engine wouldn't ignore the space (after I stared at it for a minute, wondering why it wasn't working...)

If I run this in regex workbench, it will report:

    0 => 8:00 am up 23 day(s), 21:34, 7 users, load average: 0.13

That tells me that the match worked, but not much else. What I need is a way to extract certain parts of the string, which is done with a "capture" in the regex language. The simplest form of a capture is done by enclosing part of the regex in parenthesis:

(.+)\s*up # match time
.+? # skip middle section
load\ average:\s*([0-9.]+) # match load average

Executing that gives:

    0 => 8:00 am up 23 day(s), 21:34, 7 users, load average: 0.13
1 => 8:00 am
2 => 0.13

The first capture (index 0) is always the entire match, and then subsequent captures correspond to the portions of the match enclosed in parenthesis. In code, if I wanted to pull the time out, I would write something like:

string time = match.Groups[1].Value;

That works fine. I could declare victory, but I don't really like the "Groups[1]" part - it doesn't tell me much. Nicely, the .NET regex variant provides (as do some others) A way to name captures. That allows me to write:

(?<Time>.+)\s*up # match time
.+? # skip middle section
load\ average:\s*(?<LoadAverage>[0-9.]+) # match load average

Running that gives me:

    0 => 8:00 am up 23 day(s), 21:34, 7 users, load average: 0.13
Time => 8:00 am
LoadAverage => 0.13

and I could now write code that looks like:

string time = match.Groups["Time"].Value;

which is very clear - clear enough that I often will not bother with the local variable.

That's gets us to where I wanted to get. You may have noticed that I didn't try to validate the time nor did I use anchors for the beginning and end of the string. In this example, I'm dealing with well formed text - the server log is always going to look the way that it does - and it's not worth the effort or complexity to do more than what I did.