Regex 101 Exercise S5 – Strip out any non-letter, non-digit characters


An easy one for this holiday week…


S5 – Strip out any non-letter, non-digit characters


Remove any characters that are not alphanumeric from a string.


 


 

Comments (9)

  1. Hasani says:

    Regex.Replace("cookie plz.", "[^a-z0-9]", "", RegexOptions.CaseInsensitive)

  2. Maurits says:

    Hmmm… are we allowed to assume seven-bit data?

  3. ericgu says:

    You may assume whatever you want.

    You may, however, find out later that your assumptions were wrong…

  4. Maurits says:

    Step 1: write

    string AlphaNumericCharactersOf(string s)

    {

    // TODO: strip non-alphanumeric characters

    return s;

    }

    Step 2: ?

    Step 3: Profit!

    (Apologies to the South Park underpants gnomes 😉

  5. Maurits says:

    Does .NET support POSIX classes?

    # http://www.gnu.org/software/gawk/manual/html_node/Character-Lists.html#table-char-classes

    # For example, before the POSIX standard, you had to write /[A-Za-z0-9]/ to match alphanumeric characters. If your character set had other alphabetic characters in it, this would not match them, and if your character set collated differently from ASCII, this might not even match the ASCII alphanumeric characters. With the POSIX character classes, you can write /[[:alnum:]]/ to match the alphabetic and numeric characters in your character set.

    If so…

    using System.Text.RegularExpressions;

    string AlphaNumericContentsOf(string s)

    {

    return Regex.Replace(s, "[^[:alnum:]]", "", RegexOptions.None);

    }

    Should handle Unicode data intelligently.

  6. Jay R. Wren says:

    Regex.Replace( SomeStringToStrip, "W","");

    I like using the W character class. Basically the upper case version means the opposite of w.

  7. ericgu says:

    Maurits,

    .NET regex doesn’t support POSIX character classes, but I think it supports the unicode stuff.

    Jay,

    W is close, but not quite. If you look at the docs, you will find that it matches a bunch of unicode categories rather than specific characters, and even in ecmascript mode, it includes "_" in addition to the alphanumerics.