How do I write a regular expression that matches an IPv4 dotted address?


Writing a regular expression that matches an IPv4 dotted address is either easy or hard, depending on how good a job you want to do. In fact, to make things easier, let's match only the decimal dotted notation, leaving out the hexadecimal variant, as well as the non-dotted variants.

For the purpose of this discussion, I'll restrict myself to the common subset of the regular expression languages shared by perl, JScript, and the .NET Framework, and I'll assume ECMA mode, wherein \d matches only the characters 0 through 9. (By default, in the .NET Framework, \d matches any decimal digit, not just 0 through 9.)

The easiest version is just to take any string of four decimal numbers separated by periods.

/^\d+\.\d+\.\d+\.\d+$/

This is nice as far as it goes, but it erroneously accepts strings like "448.90210.0.65535". A proper decimal dotted address has no value larger than 255. But writing a regular expression that matches the integers 0 through 255 is hard work because regular expressions don't understand arithmetic; they operate purely textually. Therefore, you have to describe the integers 0 through 255 in purely textual means.

  • Any single digit is valid (representing 0 through 9).
  • Any nonzero digit followed by another digit is valid (representing 10 through 99).

  • A "1" followed by two digits is valid (100 through 199).
  • A "2" followed by "0" through "4" followed by another digit is valid (200 through 249).

  • A "25" followed by "0" through "5" is valid (250 throuth 255).

Given this textual breakdown of the integers 0 through 255, your first try would be something like this:

/^\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]$/

This can be shrunk a bit by recognizing that the first two rules above could be combined into

  • Any digit, optionally preceded by a nonzero digit, is valid.

yielding

/^[1-9]?\d|1\d\d|2[0-4]\d|25[0-5]$/

Now we just have to do this four times with periods in between:

/^([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])$/

Congratulations, we have just taken a simple description of the dotted decimal notation in words and converted into a monstrous regular expression that is basically unreadable. Imagine you were maintaining a program and stumbled across this regular expression. How long would it take you to figure out what it did?

Oh, and it might not be right yet, because some parsers accept leading zeroes in front of each decimal value without affecting it. (For example, 127.0.0.001 is the same as 127.0.0.1. On the other hand, some parsers treat a leading zero as an octal prefix.) Updating our regular expression to accept leading decimal zeroes means that we now have

/^0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])$/

This is why I both love and hate regular expressions. They are a great way to express simple patterns. And they are a horrific way to express complicated ones. Regular expressions are probably the world's most popular write-only language.

Aha, but you see, all this time diving into regular expressions was a mistake. Because we failed to figure out what the actual problem was. This was a case of somebody "solving" half of their problem and then asking for help with the other half: "I have a string and I want to check whether it is a dotted decimal IPv4 address. I know, I'll write a regular expression! Hey, can anybody help me write this regular expression?"

The real problem was not "How do I write a regular expression to recognize a dotted decimal IPv4 address." The real problem was simply "How do I recognize a dotted decimal IPv4 address." And with this broader goal in mind, you recognize that limiting yourself to a regular expression only made the problem harder.

function isDottedIPv4(s)
{
 var match = s.match(/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/);
 return match != null &&
        match[1] <= 255 && match[2] <= 255 &&
        match[3] <= 255 && match[4] <= 255;
}

WScript.StdOut.WriteLine(isDottedIPv4("127.0.0.001"));
WScript.StdOut.WriteLine(isDottedIPv4("448.90210.0.65535"));
WScript.StdOut.WriteLine(isDottedIPv4("microsoft.com"));

And this was just a simple dotted decimal IPv4 address. Woe unto you if you decide you want to parse e-mail addresses.

Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.

Comments (54)
  1. BryanK says:

    It’s also fairly hard to match a date (in any of the valid ways of creating it — Mm/Dd, Mm/Dd/YY, Mm/Dd/YYYY, Mm-Dd, Mm-Dd-YY, Mm-Dd-YYYY, and don’t get me started on the Dd/Mm variants — oh, and my "syntax" here is that uppercase letters are required, and lowercase are not) in a single regex.

    I had a set of web pages with a client-side validation framework set up using regexes (they worked well for everything else).  Then I had to figure out how to validate a date (and also to ensure it didn’t fall on a weekend!); I quickly decided that I needed to extend the validation framework.  (The validation was repeated on the server end, because clients may not have JS enabled.  But for the ones that did, the client-side validation provided faster feedback.)

    (The framework basically consisted of setting up a JS object whose top-level properties were IDs of textboxes.  The values of these properties were each JS objects in turn, with "regex", "message", and "uppercase" properties ("regex" was the regex to match against, "message" was the message to display if the validation failed, and "uppercase" was whether to convert to uppercase before doing the validation).  Basically I added a "func" property to these objects, that took a function to call with the field being validated; the function could do anything.  If it returned true, the validation succeeded, otherwise it failed.  The function gets called if the "regex" property is undefined.)

    Anyway, yes, regexes are not a panacea.  They are often helpful, though.

    (I’ve seen that page-long regex for matching a valid RFC822 email address somewhere.  Let’s see if I can find it…  Ah yes, here: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html — I had no idea that email addresses could contain comments!)

  2. JimB says:

    "Don’t make operating systems do what they’re not good at…"

    "Don’t make vehicles do what they’re not good at…"

    Nice philososphy, but erroneous.  One has to weigh the benefit of using one tool against the issues when that tool is not the best one for the job.  I won’t argue the specific example here, but the general case is severely flawed.

    If my company operated this way, we’d have our applications written in a half dozen or dozen languages, running on a half dozen platforms in a half dozen locations.    

    The better rule is:

    One must sometimes sub-optimize parts of the item in order to optimize the item as a whole.

    Jim

  3. Richie Hindle says:

    Time for a hackneyed quotation:

    "Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems."  — Jamie Zawinski

  4. Jason says:

    Richie, can you find a single original instance of Zawinski saying that?  It’s something that’s very commonly attributed to him, but from a post in a nonexistent newsgroup (comp.lang.emacs, which doesn’t and hasn’t ever existed).  I only ask because this is one of those things that I have now seen enough to start wondering where it got legs — if it’s really a JWZ posting from the past, or whether it’s just apocryphal.

  5. Sits says:

    Jason:

    (I originally had a long reply here but a tab was accidently closed). Yes JWZ said it but google groups no longer carries the post (it did but there was a reshuffle and it disappeared. I’ve mailed Google and they can’t find it either and suggested perhaps someone asked for it to be deleted) and I have been unable to find the post in context since on the interweb.

    However when digging again I turned up this post which if the date was right would predate jwz’s:

    "When faced with a problem, some people say ‘Let’s use AWK.’  Now they have two problems."

                   – Zalman Stern

    http://groups.google.com/group/alt.quotations/browse_thread/thread/a008433940861d4e/dcee3d3682dd0470?lnk=st+stern&rnum=2&hl=en#dcee3d3682dd0470

  6. Irate Lout says:

    Raymond, does anybody know why or how CAtlRegExp came to have such bizarrely incompatible syntax? And why some assclown^Warguably misguided individual decided to use that syntax in the DevStudio 2003 IDE? That’s the only thing I really hate about that IDE. Well, that and the fact that if you copy text from MSDN and paste it into a .js file, the IDE pastes the HTML formatted text from the clipboard rather than the plain text (e.g. I just copied "ownerDocument" from the help file that SHIPS WITH THE IDE, and it pasted ‘<A href="xmproownerdocument.htm">ownerDocument</A>’).  But never mind that.

  7. Carlos says:

    Re the jwz quote; it appears on his website:

    http://www.jwz.org/hacks/marginal.html

  8. KeyJ says:

    To my knowledge, the shortest (and arguably fastest) way to check an IPv4 numerical address is parsing it with inet_aton() and check the return value. Or do I miss something?

  9. mikeb says:

    The "two problems" quotation can be found (refering to sed instead of regular expressions) in "The Unix Hater’s Handbook" which was published in 1994:

    http://research.microsoft.com/~daniel/uhh-download.html

    In it the author (Daniel Weise?) does not claim to have come up with the saying – instead the book indicates that he ‘should have rememberd that profound truism’.  I bet it was kicking around for a while before ’93.

  10. Simon Cooke says:

    KevJ – inet_aton() appears to be unix-only.

  11. oldnewthing says:

    KeyJ: That works great if your language/framework lets you call inet_aton in the first place. Also, this question is often asked in the larger context of "I have a string and I want to extract any IP addresses from it." In this case, inet_aton isn’t much help.

  12. when digging again I turned up this post

    This saying apparently predates computers.

    My grandfather likes to say that when people have a problem, they always go get a hammer, and soon they have two problems.

    My mother tells me he’s said this since the fifties, and I’ll bet he heard it sometime before that. He’s a plumber and steelworker. Never touched a computer in his life. He can barely read.

    So I can imagine one Cro-Magnon man saying to another, "most neanderthals have a problem and grab a rock, but then they have two problems."

    Because sometimes you need a stick, too.

  13. Maurits says:

    Followed "d matches any decimal digit" link…

    Followed "More examples" link…

    http://samples.gotdotnet.com/quickstart/howto/doc/regexcommon.aspx

    Ugh, some of those are horrible.

    The left side of an email address commonly contains many more characters than just w or –

    Some examples that I have seen: . & + ‘

    Credit cards can have as few as 12 digits (American Express cards frequently do.)

    I’m not even going to comment on the Internet URL one.  I’ll just point out it doesn’t match http://microsoft.com/

  14. j.edwards says:

    Rather humorously, in Google’s just-released <a href="http://www.google.com/coop">Co-op</a&gt; they have an example <a href="http://www.google.com/coop/docs/guide_subscribed_links.html#special">of matching an IP with a regex</a> and they use "(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})".

  15. Cooney says:

    Ick, Ick, Ick

    1. you can’t capture an email address in a regexp. At most, you can exclude invalid addresses.

    2. yeah, and amex splits its numbers up oddly. also, some people don’t add dashes, so ignoring spaces and dashes would allow for a fairly simple regexp.

    3. yeah, the internet URL thing is atrocious – what, all http addresses start with www?

  16. Cooney says:

    "(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})" works fine if you post process with a range check and possibly some sanity checking (like making sure that it isn’t a network address or a broadcast address).

  17. Tim Lesher says:

    Cooney, I think that’s what Raymond meant by "…just use a hammer and a screwdriver": use the regex to mask the dotted fields, then use range-checking to test the fields themselves.

  18. kbiel says:

    There is no math to be done and dotted-decimal IPs are simple patterns.

    Assuming that you are passing in a string that has one IP address per line:

    ((^|.)(2[0-5]{2}|[01][0-9]{2}|[0-9]{1,2})(?=.|$)){4}

  19. foxyshadis says:

    Perl 6 has been brewing since early 2001, though, and doesn’t show any signs of nearing completion soon. Nor does PCRE seem to be including Perl 6 regexes. Maybe it just doesn’t have enough appeal to people. Sad, it seemed like a great improvement when I checked it out a while back.

  20. schwiet says:

    IPv4 addresses don’t necessarily have periods in them.

    For instance, you can use http://2637578507 in IE just as well as http://157.54.65.11.

    In win32, you can use gethostbyname with NI_NUMERICHOST to determine if a string can be interpreted as an IP address.

  21. Cooney says:

    For instance, you can use http://2637578507 in IE just as well as http://157.54.65.11.

    Is this part of the standard or just IE weirdness? The only place I’ve ever heard of this is in relation to IE, and it usually has ‘exploit’ somewhere in the sentence.

  22. oldnewthing says:

    It’s not IE weirdness. Linux works the same way:

    $ ping -c 1 2130706433

    PING 2130706433 (127.0.0.1): 56 data bytes

    64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.8 ms

    $ lynx http://2130706433

    (web page appears)

  23. Dean Earley says:

    To Cooney:

    1) Theres only two addresses you can filter out (without local knowledge) as network and broadcast addresses, 0.0.0.0 and 255.255.255.255. Everything else is perfectly valid (despite MS assuming that everything ending in .255 MUST be a broadcast address)

    1) See the MSDN page on inet_addr()

    The unix man page doesn’t mention the allowed formats but does say it is deprecated.

  24. bmm6o says:

    kbiel: that doesn’t match 249.249.249.249, does it?

    A similar problem comes up in practice when using the Google Analytics.  I had to help a coworker write a regex that matched on our assigned IP addresses (to differentiate public traffic from internal traffic).

  25. Maurits says:

    kbiel’s regex is fixable… for example,

    ^((^|.)(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])){4}$

    is a variant that solves the ^-anchoring issue.

  26. Cooney says:

    Dean:

    Theres only two addresses you can filter out (without local knowledge) as network and broadcast addresses, 0.0.0.0 and 255.255.255.255.

    Actually, it’s more complicated than that – 0.x.x.x is not a valid inet address, except for 0.0.0.0, which means ‘I don’t know my name’

  27. Dean Harding says:

    Maurits: That still matches ".1.1.1.1" for me.

    Besides, my point wasn’t that it’s unfixable, my point is that it is still not a "simple" pattern.

    I think that "d{1,3}.d{1,3}.d{1,3}.d{1,3}" is going to be "good enough" because I don’t think you should be trying to get a regex to do all of your validation anyway – it should do just enough that you can be sure the "common" errors are taken care of quickly.

  28. Gabe says:

    I’m surprised that nobody has chimed in with specific quantifiers:

    /^0*([1-9]?d|1dd|2[0-4]d|25[0-5])(.0*([1-9]?d|1dd|2[0-4]d|25[0-5])){3}$/

    One of the biggest problems with regexes is that they are inadequate for most parsing tasks, so for quite a while now they have stopped describing only regular languages (those that can be decided with finite state automata). They are creeping slowly towards more advanced grammars (things like backreferences are not possible in regular languages).

    Perl 6 is solving this by replacing the "regular" expression engine with one that can describe push-down automata and can have code embedded in it (sort of like a YACC grammar). This means that you could write the parser more like this:

    rule quad {  (d<1,3>) :: <($1 < 256)>  };

    $str =~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /x;

  29. Raptor says:

    Hi, I’ve always been thinking instead of everybody trying to figure out how to do regular expressions why don’t just (someone at MS) publish the regular expressions that are used inside the compilers or .NET? like the one used for:

    Double.Parse(), DateTime.Parse() or, just like this case, IPAddress.Parse()?

    I mean that would something for sure.  And yes I know that not necesarily MS use the regular expression Match() method to parse strings into the required data types.  But PLEASE someone publish them as they are stated in the specification document for the C++, C# or VB.NET compilers. ’cause it must be stated some where!

  30. Dean Harding says:

    Raptor: That’s what Reflector is for anyway.

    An interesting aside, most of ASP.NET is parsed using regular expressions. Have a look at the System.Web.UI.TemplateParser class…

  31. Maurits says:

    > Is this part of the standard?

    No — the HTTP/1.1 standard (RFC 2616) depends on RFC 2396, which has this to say:

    hostport      = host [ ":" port ]

    host          = hostname | IPv4address

    hostname      = *( domainlabel "." ) toplabel [ "." ]

    domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum

    toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

    IPv4address   = 1*digit "." 1*digit "." 1*digit "." 1*digit

    port          = *digit

    Note that neither http://2637578507 nor http://66%2e102%2e7%2e99 match this syntax.

  32. Archangel says:

    Dean: I agree. A basic regex that catches silly things like "206.132..123.1" is probably going to be quite adequate; most of the issues are things you can’t save your user from anyway, like them transposing a couple of digits.

    And a regex that’s too gentle is far preferable to one that’s too strict; I’ve had my fill of websites that insist on a single period in a domain name (yes, some of us do live outside the US) or 5-digit postcodes (ditto!) – you can tell that some webmaster read about regular expressions but didn’t realise what a minefield they were getting into.

  33. Dean Harding says:

    kbiel: That matches ".1.2.3.4"

    Regular expressions, I believe, are really only good for "sanity" checking. *Especially* for email addresses when the only REAL way to check for a valid email address is by actually sending an email to it.

  34. Mike says:

    A wise woman once told me her S.O. once told her "Parsers parse". I think this is a prime example of just that.

    Parsers parse.

  35. oldnewthing says:

    Clearly those methods don’t use regular expressions since there’s more to parsing than matching; you have to parse it too!

    class Double {

    public double Parse(string s, NumberStyles styles)

    {

     Match match = Regex.Match(…, s);

     if (match.Success) {

       return /* what goes here? */

        Double.Parse(match.Value, styles);

     }

     …

    }

    }

  36. IanA says:

    Any use for sscanf here? .."%hu.%hu.%hu.%hu"..?

  37. HA HA HA says:

    IANA – no. theres no use for sscanf anyhwaer. evar. unles u coutnt dalibrate sabotage.

  38. Maurits says:

    Dean: hmmm, you’re right, it matches .1.1.1.1

    But it no longer matches:

    Once upon a time in the faraway land of 1.2.3.4 there lived a magical wizard…

    There’s probably a way to use lookahead to match a digit after the first ^ and fix the leading-. issue.

  39. Dewi Morgan says:

    "There’s probably a way to use lookahead to match a digit after the first ^ and fix the leading-. issue."

    Yup. I’ve used b instead of ^ here, done a forward positive assertion that the first character is a number, and a backward negative assertion that the preceding character was not a dot. So, it’d match the first four octets of a longer string of octets, like so:

    1.2.3.4.5.6.7.8 – would match 1.2.3.4

    b(?<!.)(?=d)((^|.)(25[0-5]|2[0-4]d|1dd|[1-9]?d)){4}$

    Of course, this still doesn’t really work, and still needs program logic, since it doesn’t fail on invalid ranges 0.0.0.0/24, 127.0.0.0/24, 224.0.0.0/29 and known broadcast addresses.

    Some apps will also want to fail on private ranges: 10.0.0.0/24, 192.168.0.0/16, and 172.16.0.0/20.

    Personally, I’d do as Raymond suggests: break it into bytes. If it looked like valid IP syntax, I’d keep a sorted table of known-invalid ranges, and use a binary search to find if the given IP was valid.

    But that’s just me.

  40. kbiel says:

    Yes, Maurits, look-ahead and look-behind would have solved the problems and kept it uncomplicated.  That was how I started, but I had to rearrange and strip out the look-behind to be ECMA compliant.  It appears that I failed to account for one case, but in the whole it works well as a sanity check with your changes.

    Of course, if you are already inside of a .Net assembly, you might use a Regex to help parse the a string or just to check it but at some point you will have to actually write code to parse it if it is going to be any use to you.

    I very much like your hammer and screwdriver analogy.  On the other hand, sometimes you are only given a hammer (javascript) and it works to some degree on screws too if you pound hard enough (sanity checking values before POST/GET).

  41. Nick Lamb says:

    It’s hard to guess what this could possibly be intended for, but any range checks are unwise in the general case. IPv4 addresses really are just 32-bit values, and with a very small number of exceptions (0.0.0.0 being one of them) they are only as magic as the administrators of the network choose to make them.

    e.g. Dewi mentioned 224.0.0.0/29 but that’s not invalid at all, it’s actively in use for multicast traffic. If Dewi’s hypothetical "fail on invalid ranges" code were used in a DNS configuration manager, or a router traffic log then he’d have just removed multicast support from software that didn’t need any extra work to support it. Outstanding.

  42. Everyone’s forgetting that the base requirement was that the pattern match IPV4 addresses, not that it doesn’t match strings which are not IPV4 addresses!!!

    .*

  43. Maurits says:

    I see two broad possible uses here…

    1) Given a string, determine whether it is an IP address

    2) Extract IP addresses from a larger chunk of text

    The two questions go together, but are subtly different.

    # returns 1 for yes, 0 for no

    sub is_ip4($) {

    my $purported_ip = shift;

    $purported_ip =~ /^(d+).(d+).(d+).(d+)$/ or return 0;

    my @octets = ($1, $2, $3, $4);

    for my $octet (@octets) { $octet < 255 or return 0; }

    return 1;

    }

    # returns an array of things that, at first sight, look like IPs

    # in the order they first appear in the text

    sub extract_ips_naively($) {

    my $text = shift;

    my @purported_ips = ();

    my %seen = ();

    while ($text =~ s/(d+.d+.d+.d+)//) {

    my $purported_ip = $1;

    unless ($seen{$1}++) {

    push @purported_ips, $purported_ip;

    }

    }

    return @purported_ips;

    }

    # returns an array of IPs

    sub extract_ips_fershure($)

    {

    my $text = shift;

    return grep { is_ip4($_) } extract_ips_naively($text);

    }

    my $text = join "", <STDIN>;

    print "Naive:n";

    print join "n", extract_ips_naively($text);

    print "nnValidated:n";

    print join "n", extract_ips_validate($text);

  44. Maurits says:

    Oops, should be $octet <= 255, not $octet < 255

  45. work2do says:

    Why can’t MS invent a new EASY expression-matching language?

  46. Chris Brien says:

    I just saw http://makingflan.blogspot.com/2006/03/regular-expressionescence.html

    The type system described, being basically a recursive descent parser, makes it easy to match an IP address. It boils down to the one line of code:

    @ipAddress = (/str0to255,’.’):3, /str0to255;

    where the type str0to255 is defined as code:

    is @string0to255( [digit:1..3] @s )

    {

    ..s.toNumber().is( >=0 <=255 );

    }

    Which is a pretty neat way to do it, I think.

  47. Gerhard Poul says:

    I guess you’d not have thought how short the regex for parsing e-mail addresses really is, huh? :-)

    http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

  48. Norman Diamond says:

    Friday, May 26, 2006 5:35 AM by Chris Brien

    > It boils down to the one line of code:

    > @ipAddress = (/str0to255,’.’):3, /str0to255;

    > where the type str0to255 is defined as code:

    > is @string0to255( [digit:1..3] @s )

    > {

    > ..s.toNumber().is( >=0 <=255 );

    > }

    Let’s uncount comments and braces.

    > @ipAddress = (/str0to255,’.’):3, /str0to255;

    > is @string0to255( [digit:1..3] @s ) { ..s.toNumber().is( >=0 <=255 ); }

    Two is still respectable, it’s just not one.

    Meanwhile…

    > As commenter Maurits put it, "The trick is

    > not to spend time developing a combination

    > hammer/screwdriver,

    If all you have is a combination hammer/screwdriver, then everything starts looking like a combination thumb/eye.

Comments are closed.