Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Parsing Name Column - Firstname, Surname

  • 09-03-2009 3:55pm
    #1
    Registered Users, Registered Users 2 Posts: 500 ✭✭✭


    Hi,

    I have a problem. I have a large execl file that has a Name Column.
    Names are in the format:
    Mary Hunt
    Martin S Ryan
    John O'Neill
    And god knows what other horrible formats.
    I need to write something that will extract the names out and put it in 2 columns. Firstname and Surname.

    The language I am writing it in is Python, but I dont want exact code - just the theory. Or sample code in whatever language to point me in the right direction.

    Any help would be appreciated.


Comments

  • Registered Users, Registered Users 2 Posts: 68,317 ✭✭✭✭seamus


    I'm assuming the input file is just plain text?

    From my experience (though I haven't actually checked to see if there's a theory on this), your best bet is not to try and parse it forwards, the way you'd normally think of it.

    That is, you'd normally say "Find the first whitespace character and split the string there". But that falls down at double-barrelled names and names with initials in them and so forth. So if you work backwards - find the last whitespace character and split the string there, you'll have a much higher success rate.

    Of course, this is far from flawless - you'll have names like "De Valera", "Mc Duff", you'll also have people whose names are written without the proper punctuation, such as "O Neill".

    This is where pattern matching comes in. You can find a pretty comprehensive list of this prefixes and test for them too. Which should in theory give you a well split list with most names correctly parsed and caught. You may also be able to test for "odd" names and tell the script to spit those to you for manual processing.


  • Registered Users, Registered Users 2 Posts: 6,570 ✭✭✭daymobrew


    I'm into doing things the easy way:
    - Split name by spaces
    - If two parts then it's easy
    - If three parts then put first part as first name and join rest to be surname.
    - If four parts then do something similar.

    I would report or otherwise flag the 3+ part ones for human review later.

    This simple method might suffice if the number of the non-trivial formations is low enough.


  • Registered Users, Registered Users 2 Posts: 5,618 ✭✭✭Civilian_Target


    We do this a lot where I work, quite similar to what Daymo suggests.

    First tokenize on space. If there's two tokens, done.
    If there's 3 tokens,
    - check the first one for Mr, Mrs, Ms, Miss, Dr, Fr, etc.
    - check the middle token against a list of common surname prefixes, Mc, Mac, O, De, Van. If it matches, attach to the surname
    - If the first name is some variant of Mohamed attach to the last name, otherwise attach to the first name
    If there's 4 tokens, check to reduce it to 3 or 2 like above. If you're still left with 4, split it down the middle


  • Registered Users, Registered Users 2 Posts: 500 ✭✭✭warrenaldo


    Thanks guys.

    Having thought about iot last night it looked like the only logical way of doing it was toeknizing it and handling special cases like outlined above.

    Thanks for all the help. I should be sorted now.


  • Registered Users, Registered Users 2 Posts: 68,317 ✭✭✭✭seamus


    Yep, the other ideas are more elegant than mine :)


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 25 Malached


    All depends on what you want. If you want first name/surname looking for first space will probably work. In western alphabets. If you have the freedom, column for firstname, column for surname. In western societies.


Advertisement