Generating initials from a name is trickier than you think


Even though I'm signed in, the page claims that anonymous comments are not allowed, so I'm reduced to posting my comment here and generating a trackback. Some time ago, Robert McLaws wrote a function that generates initials from a name. Let's set aside completely the issue of non-U.S. names; the function doesn't even handle U.S. names correctly.

Given Cal Ripken, Jr., the function comes up with the initials JCR, which is decidedly suboptimal.

Comments (31)
  1. Jeff says:

    So…. um, the point is…? Some of the code you find on the net is crap? Don’t blindly cut’n’paste from your browser to your code editor?

    Is this a real post? Has someone hacked Raymond?

    [Please reread the opening sentence. -Raymond]
  2. SM says:

    It looks like the the comment got severely truncated at the McLaws page.  And even stranger, when you click on the username (Noticias externas? is that you?) it goes to this posting third page, geeks.ms, which seems to be a spanish blog.  Now I’m stumped.

    I think I get the trackback idea, but you seem to have stepped in to a more advanced realm of blogging of which I’m unfamiliar.  ;)

    [Noticias is one of many sites that pretend to be Microsoft bloggers. -Raymond]
  3. Igor Levicki says:

    >Public Shared Function BrainDump(ByVal dotNet As String) As [Value]<<

    Jesus… passing brain content by value… what kind of bandwidth that would need?

    Seems that Mr. McLaws can get away with it though.

  4. Cal Ripken, Jr.

    Hmmm… good point.  How could the function be fixed, without breaking "van Eyck, Jan"?

    Perhaps a table of well-known suffixes (I had to deal with contact management software in a previous job and there is quite a long list of such suffixes)

  5. Igor Levicki says:

    @Maurits:

    Microsoft Office installer solves the problem by suggesting initials and allowing user to modify them.

  6. Keithius says:

    And that’s generally the best way to do it. No matter how clever your algorithm is, it will fail some of the time. Better to let the user correct the mistake (assuming it happens fairly rarely) than to force the user to accept incorrect values.

    Now, getting to "fails rarely" is a whole other ball of wax…

  7. Hieronymous Coward says:

    "No matter how clever your algorithm is, it will fail some of the time."

    Exactly. It is impossible to predict if Gregory Kenneth Van Horn prefers to go by GKVH or GKV or GVH.

  8. anon says:

    "Exactly. It is impossible to predict if Gregory Kenneth Van Horn prefers to go by GKVH or GKV or GVH."

    Actually, he prefers to go by Steve.

  9. anon2 says:

    "Actually, he prefers to go by Steve."

    In college he went by Horny, Horndog, and sometimes Hornmeister.  Funny thing is, Steve is actually a more suitable name to call him by, because he’s not a very memorable person.  You’d think everyone would remember meeting a guy with the name Gregory Kenneth Van Horn or a guy everyone called Horny, Horndog, or Hornmeister but no.  If you ask people if they remember him they always draw a blank – just as if you’d asked whether they remembered meeting a generic Steve.

  10. Yury Shatz says:

    It is of course impossible to get it right in 100% cases. The moment you teach your program that Van is a dutch prefix and a part of last name, you meat a Korean named Kim Van U.

    However, Jr., II., III, etc are easy to capture, and bring you from 90% to 95% right, which is – they make your little function twice as good as it was.

  11. Xepol says:

    I see, and by wrong, you can point to a definitive set of rules for creating initials?

  12. Igor Levicki says:

    >I see, and by wrong, you can point to a definitive set of rules for creating initials?<<

    Yes, present an on-screen keyboard and ask user to punch them in :)

  13. Tom Smith says:

    To be fair, he was writing from a UK perspective ("Scottish like me"), where "Jr." isn’t an issue. I can’t think offhand of a (traditional) UK name where his function would fail; although it could probably do with a tweak for Irish O’Connell-type surnames.

  14. The "ask the user" suggestions don’t go very well when you’re trying to parse a ten-million name flat text file.

    Although even then there’s a germ of truth.  It is entirely reasonable to have a "dunno" return value that can be used to filter interesting names into a smaller file for human review.

  15. Zathrus says:

    My mother’s name is what I use to shoot down name parsing algorithms. Mary Ann Louise Smith  — yes, her first name is "Mary Ann". I’ve yet to see any kind of name parsing algorithm that can even attempt to handle that along with the more common "Foo Baz van Bar".

    Maurits has a reasonable suggestion with a sentinel value; the vast majority of names should be handled trivially.

  16. Randall says:

    Say what you will about Perl, but it’s great for text processing and there’s some crazy stuff in CPAN.

    Lingua::EN::NameParse will parse a huge variety of names and name formats (though not *everything*, I’m sure).  There’s even an option to handle titles like General or Mother Superior.  It’s based on a recursive descent grammar for names.

    Click my name if you want to see the module docs.

  17. Randall says:

    For giggles, I ran the names through Lingua::EN::NameParse.  None of this is meant to argue against anyone’s claims that name parsing is hard and impossible to get exactly right; I just wanted to test the module.

    Hope the linebreaks come through right — if not, the gist is that Mary Ann does confuse the module, and it says there was a parsing error and "Smith" was not matched; Foo Baz van Bar doesn’t confuse it and is parsed correctly.  "Cal Ripken, Jr." confuses it unless the module’s auto_clean option is turned on to strip the comma; in that case, it parses the suffix out right.

    Input             : Foo Baz van Bar

    conjunction_1     :

    conjunction_2     :

    given_name_1      : Foo

    given_name_2      :

    initials_1        :

    initials_2        :

    middle_name       : Baz

    precursor         :

    suffix            :

    surname_1         : Van Bar

    surname_2         :

    title_1           :

    title_2           :

    Name type         : John_Adam_Smith

    Input             : Mary Ann Louise Smith

    conjunction_1     :

    conjunction_2     :

    given_name_1      : Mary

    given_name_2      :

    initials_1        :

    initials_2        :

    middle_name       : Ann

    precursor         :

    suffix            :

    surname_1         : Louise

    surname_2         :

    title_1           :

    title_2           :

    Name type         : John_Adam_Smith

    Parsing Error     : Yes

    Non matching part : Smith

    Input             : Cal Ripken Jr.

    conjunction_1     :

    conjunction_2     :

    given_name_1      : Cal

    given_name_2      :

    initials_1        :

    initials_2        :

    middle_name       :

    precursor         :

    suffix            : Jr

    surname_1         : Ripken

    surname_2         :

    title_1           :

    title_2           :

    Name type         : John_Smith

  18. Nawak says:

    For the "Jr." "II." "III." etc, wouldn’t it be enough to "see" the dot at the end? ("van Eyck, Jan" has none)

    Of course it would just increase the correctness ratio, not put it at 100% because I am sure there are cases where people would put a dot at the end for just another reason or no good reason at all.

    Find a good algorithm with good heuristics (=make one and improve it by testing it ‘medium scale’), and just allow people to easily modify when your algorithm is wrong (not necessarily at creation time btw, else you cannot batch process lists of names, but allow users to edit their automatically created profile)

  19. Jonathan says:

    How would "The Hulk" parse?

  20. Leo Davidson says:

    I’m struggling to think of a situation where you’d want to generate initials from a list of thousands/millions of names. Maybe I’m not imaginative enough.

    If it’s for usernames, document edits, or similar then you need the generated strings to be unique. There will be many collisions with a long list of names so you’ll have to add numbers (or something) to what you generate. In which case the result is ugly, strange and "not my initials" whether or not the initials are well chosen.

  21. Steve says:

    @Leo Davidson : yes there are some situations :)

    one approch could be to have a dictionnary of know n firstnames (and another one of abrev if needed). Then from this you can re-order the names if needed and generate the initials your prefer (as for "Firstname Lastname" like "FL" "FLE" "FLS"). indeed you need to update the algorithm with locals cultures… That’s quite a fun devlopment.

  22. Porges says:

    If you want to see insane name-parsing code, go look at BiBTeX ;)

  23. SCB says:

    The task is further complicated by the fact that, for example, Bill Gates’ first initial should be W (as in William), not B.

  24. Tanveer Badar says:

    You mean you read Robert McLaws. Such a time waste!

  25. Zathrus says:

    "I’m struggling to think of a situation where you’d want to generate initials from a list of thousands/millions of names."

    The typical case isn’t for initials, but taking a single name field and splitting it up into First/Middle/Last fields in a database.

    You might want the initials autogenerated for the same purpose, although I can’t really think of a good reason to do so.

    "the gist is that Mary Ann does confuse the module"

    Heh, I suspected it would. Of course, to be fair, she’s the only multiple-word first name that I’ve ever run across (either in person or in the dreaded single->multiple name case above; which I’ve done on millions of records before), so it’s not too surprising.

    Don’t suppose this is the Randall most famously associated with Perl, is it? (If you ever do read this)

  26. jondr says:

    Slightly OT, but the algorithms that scan for names and addresses for junk mail sometimes are really off in a cute way.  I once worked at the U.S. Fish and Wildlife Service.  We got a letter from Ed McMahon that used the greeting:

    U.S. FISH AND WILDLIFE SERVICE

    <address>

    DEAR MR. FISH,

    You may have already won …

  27. [Mary Ann is] the only multiple-word first name that I’ve ever run across

    You never saw Spiderman?

  28. Zathrus says:

    You never saw Spiderman?

    Yes, except that that’s her first and middle names. Wikipedia, as usual, has an absurd amount of info on this geeky topic.

  29. Xanthir, FCD says:

    Fwiw, my mother-in-law’s first name is Lee Ann.  As well, down here in Texas we have the stereotype of women named Bobby Sue or something (and it’s not a false stereotype either!).  So the two-word first name is far from a unique case, though it is fairly rare.

Comments are closed.

Skip to main content