Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Address Data

  • 24-08-2016 1:16pm
    #1
    Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭


    Hi,

    First time poster in here (I think) I'm hoping someone can help me.

    I've been given an excel spreadsheet with customer data (name, address, phone numbers), the problem I have is that it's not as clean as it should be.

    The columns I have are:
    name_first
    name_surname
    company_name
    addr_1
    addr_2
    city
    county
    zip
    phone_day
    phone_eve
    phone_cell
    email_addr

    The problem is that some lines have content in City, others don't, some have mobile numbers in phone_day, is there an easy way to clean up address information like this? there's a few thousand rows so I'd like to have it as automatic as possible.


Comments

  • Registered Users, Registered Users 2 Posts: 23,212 ✭✭✭✭Tom Dunne


    It depends on what you mean by clean up. :)


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    Very good point.

    I would like to have all the counties in 1 field, so if the city field has Ennis,Co. Clare that it takes the Co. Clare and puts it into another field. I'd also like to get all phone numbers starting with 8 into a new field called Mobile


  • Registered Users, Registered Users 2 Posts: 68,317 ✭✭✭✭seamus


    The approach here really depends on the data that you find. Address data is never clean. I have never come across an address database that is perfectly clean.

    You will have lines with missing counties, lines that are completely wrong like "Tallaght, Co. Carlow". Carlow usually appears as the first county in drop-downs, so an unusually higher number of addresses will be in Carlow.

    The other issue you'll have is that not only will the county be in the addr_2 field, the town will sometimes be in the "county" field.

    So it's a matter of doing the best you can with the data that you have. Work on it column-by-column, don't try to fix every column at the same time.

    Find an example of incorrect entry, then figure out how you would correct that entry, and apply that correction to the whole column. Do this for every errant entry you find and you'll be surprised how quickly you start cleaning up the whole thing.

    So for example, if you have "Ennis, Co. Clare" in column D, you can fix that by splitting the text at the comma. So insert three blank columns to the right, highlight column D, choose "Text to Columns" and choose the comma as the delimiter. It will then split every entry containing a comma and insert the text into the column(s) on the right. (Why 3 columns? Because you will also have, "Casa me, Ballyhasadreen, Ennis, Co. Clare" all in the "city" field)
    Then sort the "new" columns Z-A to put all non-blank entries at the top. You can then use CONCATENATE(x,x,x) to merge this data in the "county" field.


    Mobile Phone is an easier one. Assuming "phone_day" is column "H", create a new column (I), and use this formula in i1:
    =IF(MID(H1,1,1) = "8", H1, "")
    

    This will put the number in the new column if it begins with 8, or leave it blank otherwise.


  • Registered Users, Registered Users 2 Posts: 7,745 ✭✭✭StupidLikeAFox


    Clareman wrote: »
    Very good point.

    I would like to have all the counties in 1 field, so if the city field has Ennis,Co. Clare that it takes the Co. Clare and puts it into another field. I'd also like to get all phone numbers starting with 8 into a new field called Mobile

    Very quick and dirty method is use the following formula in the Mobile column:
    =IF(LEFT(A1)=8,A1,"")

    This is assuming the number is in column A. It will check the first number of the column and if its 8 it will populate the cell with the entire number. If its not it will leave it blank.


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    Spot on for the phone numbers, that's a big chunk done. Just realized Excel 2016 uses ; and not ,


  • Advertisement
  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    Is there an easy way of confirming an email address is in the right format, i.e. has @ and a . after the @? Also, is there a quick way to pull out the word after Co., I'm thinking it's the easiest way to pull out the country.


  • Registered Users, Registered Users 2 Posts: 1,931 ✭✭✭PrzemoF


    SEARCH, LEN, LEFT, RIGHT, MID are your friends. an example soon Edit: I'm sorry it's too complicated in excel

    P.S. My way would be export to cvs and do it in python. String parsing in excel can be done, but it's like a race backwards with closed eyes.
    P.S.S Are you familiar with regular expressions? http://stackoverflow.com/questions/22542834/how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops#22542835


  • Registered Users, Registered Users 2 Posts: 788 ✭✭✭pillphil


    Clareman wrote: »
    Spot on for the phone numbers, that's a big chunk done. Just realized Excel 2016 uses ; and not ,

    Took me an hour to realise that a few weeks ago :o


  • Registered Users, Registered Users 2 Posts: 788 ✭✭✭pillphil


    =and(if(find(".", right(A3, len(A3) - find("@, A3) - 1)) > -1, true, false), if(find(.", right(A3, len(A3) - find("@", A3))) > -1, true, false))

    it's not great

    it just looks for an


  • Registered Users, Registered Users 2 Posts: 788 ✭✭✭pillphil


    =RIGHT(trim(A8), find(" ", trim(A8))+1)

    also not great, it will take all the characters after the first space that's not a leading or trailing space.

    it will pull cork out from
    co cork
    co. cork
    
    but not
    co.cork
    co  cork
    co.  cork
    


  • Advertisement
  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    Awesome folks, keep them coming, I'm thinking of splitting the file into 4 different files, 1 for name, 1 for address, 1 for email and 1 for phone, that way I can mess about with each independently and bring them together when they're all done.


  • Registered Users, Registered Users 2 Posts: 68,317 ✭✭✭✭seamus


    pillphil wrote: »
    =and(if(find(".", right(A3, len(A3) - find("@, A3) - 1)) > -1, true, false), if(find(.", right(A3, len(A3) - find("@", A3))) > -1, true, false))

    it's not great

    it just looks for an

    This is good. The one thing I would add is a test for whitespace. You can also clean up the AND statement a bit. If you just want to return the values TRUE or FALSE, you don't need an IF statement.

    =IF(A < B, TRUE, FALSE)

    is exactly the same as

    = A < B

    If you want to invert it, just use =NOT(A < B)

    This means that you can discard the IF statements in your AND clause and you just get
    =AND(
    	FIND(".", RIGHT(A3, LEN(A3) - FIND("@", A3) - 1)) > -1, 
    	FIND(".", RIGHT(A3, LEN(A3) - FIND("@", A3))) > -1,
    	ISERROR(FIND(" ",A3))
    )
    
    It's only cosmetic really, but the less parentheses you can get away with in Excel, the better!
    I added in the 3rd AND there to find whitespace - returns TRUE if nothing is found.

    The full statement in one line is;
    =AND((FIND(".", RIGHT(A3, LEN(A3) - FIND("@, A3) - 1)) > -1), FIND(.", RIGHT(A3, LEN(A3) - FIND("@, A3))) > -1,ISERROR(FIND( ",A3)))


  • Registered Users, Registered Users 2 Posts: 788 ✭✭✭pillphil


    Hmm, just realised one of the conditions is unnecessary, I think. I also didn't check if it began with an
    [S]=AND(
            FIND(".", RIGHT(A1, LEN(A1) - FIND("@", A1, 2) - 1)) > -1, 
            iserror(FIND(" ",A1))
    )[/S]
    

    Should fix that?
    =AND(
            FIND(".", RIGHT(A1, LEN(A1) - FIND("@", A1, 2) - 1)) > -1, 
            iserror(FIND(" ",A1)),
            iserror(find("@", RIGHT(A1, LEN(A1) - FIND("@", A1)))))
    


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    I've found a list of Irish towns (www.shamocracy.com/irishtownlands.csv), I was hoping to be able to use this to add a proper county to each town, I'm thinking of putting the CSV into sheet 2, the address lines into sheet 1 and doing a VLookup, except I'm no good at VLookups, any help?


  • Registered Users, Registered Users 2 Posts: 12,046 ✭✭✭✭L'prof


    Do you have a sample? It might be very hard to catch all cases if the data isn't all entered in the same way though?

    I'm no good at vlookups either, but here's an example: http://superuser.com/questions/1023123/how-to-simulate-a-full-outer-join-in-excel

    The 2nd option on this page is actually my preference as I really like being able to use something similar to SQL in Excel. I've used this before and it works really well


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    Crosshaven,Co. Cork
    Co. Wexford
    Limerick
    Limerick
    Co Clare
    Tipperary
    Co. Tipperary
    Lahinch, Co. Clare
    Dublin 1
    Little Island
    Annacotty
    Limerick
    Co. Kerry
    Castletroy
    Mallow,Co. Cork
    Co. Louth
    Tralee
    Co. Cork
    Co Limerick
    Co. Kerry
    Limerick
    Clare
    Kildare
    Limerick
    Co. Clare
    Dublin 4
    Co. Tipperary
    Dublin 6
    Co. Tipperary
    Ballina,Co. Tipperary

    That's the current city field, I'm going to split them by comma and then put in the county field depending on the town at the left (if that makes sense)


  • Registered Users, Registered Users 2 Posts: 12,046 ✭✭✭✭L'prof


    Clareman wrote: »
    That's the current city field, I'm going to split them by comma and then put in the county field depending on the town at the left (if that makes sense)

    OK, how are you going to deal with the rows that only have counties entered or am I misinterpreting the data?


  • Registered Users, Registered Users 2 Posts: 23,212 ✭✭✭✭Tom Dunne


    Clareman wrote: »
    Crosshaven,Co. Cork
    Co. Wexford
    Limerick
    Limerick
    Co Clare
    Tipperary
    Co. Tipperary
    Lahinch, Co. Clare
    Dublin 1
    Little Island
    Annacotty
    Limerick
    Co. Kerry
    Castletroy
    Mallow,Co. Cork
    Co. Louth
    Tralee
    Co. Cork
    Co Limerick
    Co. Kerry
    Limerick
    Clare
    Kildare
    Limerick
    Co. Clare
    Dublin 4
    Co. Tipperary
    Dublin 6
    Co. Tipperary
    Ballina,Co. Tipperary

    That's the current city field, I'm going to split them by comma and then put in the county field depending on the town at the left (if that makes sense)

    Id do a number of passes over the data, to cater for each of the different permutations.

    One would search for commas, that is easy enough to separate out.

    Next, possible do a county search (anything with Co. or co.) and then work out the logic of counties vs. towns. Would it be safe to assume Cork and Limerick refer to the cities as opposed to the counties?


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    Tom Dunne wrote: »
    Id do a number of passes over the data, to cater for each of the different permutations.

    One would search for commas, that is easy enough to separate out.

    Next, possible do a county search (anything with Co. or co.) and then work out the logic of counties vs. towns. Would it be safe to assume Cork and Limerick refer to the cities as opposed to the counties?

    I think both for Limerick and Cork but I'd be happy to leave it to Cities


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    Tom Dunne wrote: »
    Id do a number of passes over the data, to cater for each of the different permutations.

    One would search for commas, that is easy enough to separate out.

    Next, possible do a county search (anything with Co. or co.) and then work out the logic of counties vs. towns. Would it be safe to assume Cork and Limerick refer to the cities as opposed to the counties?

    Makes sense, I'll give that a go. I'm breaking for holidays this evening but once I get it done I'll upload the file here


  • Advertisement
  • Closed Accounts Posts: 9,700 ✭✭✭tricky D


    Clareman wrote: »
    I've found a list of Irish towns (www.shamocracy.com/irishtownlands.csv), I was hoping to be able to use this to add a proper county to each town, I'm thinking of putting the CSV into sheet 2, the address lines into sheet 1 and doing a VLookup, except I'm no good at VLookups, any help?
    You're welcome!!


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    tricky D wrote: »
    You're welcome!!

    Is that your site? Very handy thank you very much for that


  • Closed Accounts Posts: 9,700 ✭✭✭tricky D


    /a bit OT/

    Indeed it is.

    Had vague plans to do something with it years ago, but it just ended up as a repository for random stuff.


  • Moderators, Sports Moderators, Regional Midwest Moderators Posts: 24,028 Mod ✭✭✭✭Clareman


    tricky D wrote: »
    /a bit OT/

    Indeed it is.

    Had vague plans to do something with it years ago, but it just ended up as a repository for random stuff.

    If you have a look up tool to add a county field to an excel spreadsheet based on the town name that would be very handy :D


  • Registered Users, Registered Users 2 Posts: 7,468 ✭✭✭Evil Phil


    tricky D wrote: »
    /a bit OT/

    Indeed it is.

    Had vague plans to do something with it years ago, but it just ended up as a repository for random stuff.

    /Totally OT/

    Include a page for licensing terms like Creative Commons, or an Open Data License or something. Also an index of what data you have would be cool.

    /Infracts self/


  • Registered Users, Registered Users 2 Posts: 1,931 ✭✭✭PrzemoF


    /OT continues.../
    Upload it to github. If you do it I'll do some cleaning (there are duplicate entries) and add a column with English county names.


  • Closed Accounts Posts: 9,700 ✭✭✭tricky D


    ^ All that sounds like work, much easier to just rename it:pac::D:cool::p


  • Registered Users, Registered Users 2 Posts: 5,876 ✭✭✭The J Stands for Jay


    Clareman wrote: »
    Spot on for the phone numbers, that's a big chunk done. Just realized Excel 2016 uses ; and not ,

    I think changing your settings so that the decimal point is a full stop rather than a comma will fix that.


Advertisement