Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

PDF info extraction problem NEED HELP

Options
  • 20-02-2007 7:50pm
    #1
    Closed Accounts Posts: 40


    Hi all,

    Doing a project at work at the moment and I'm in dire straits.
    Basically I have all these PDF docs with names and addresses and other details arranged in rows and columns. The Second column contains the infornation I require.

    What I am hoping somebody can help me with is how to extract only the data in column 2 and import it into a spreadsheet.

    The second part of the problem is that the detailss are arranged under each other i.e

    Name
    Address 1
    Address 2
    etc...

    I need to be able to import this info into Excel in the following layout: name, address 1......

    Under pressure with this one so all help appreciated.

    P.s S/w used: Adobe Acrobat 8 Pro, Adobe Illustrator CS & MS Office 2002


Comments

  • Closed Accounts Posts: 40 AceHigh


    Like the guy with the 170Gb lost info...I too will Die...but not by my hands.

    Lollypops all round if somebody can sort me out...

    No we don't sell lollypops but I'm a big fan of Chub a Chubs. Big tin on my desk...
    :):):):):):):):) Me with the project finished.
    :eek: :eek: :eek: :eek: Closest the funnies come to a corpse...


  • Registered Users Posts: 1,064 ✭✭✭Snowbat


    The person that should die is the one who thought it was a good idea to store this kind of data in PDF files ;)

    Can you export the PDFs to txt?
    Is there white space (blank line) separating each contact?

    If yes to both, it can be easily scripted.


  • Closed Accounts Posts: 40 AceHigh


    Yes Snobat I can export them to txt but all the info is listed under each other with no breaks.

    Also yes there is a space between contacts...

    At the moment I am selecting the text from the PDF, copying it into the excel sheet and rearranging it from vertical to horizontal. I have upwards of 1000 contacts to do.


    What do you mean by writing a script...


  • Closed Accounts Posts: 1,577 ✭✭✭Heinrich


    AceHigh wrote:
    Yes Snobat I can export them to txt but all the info is listed under each other with no breaks.

    Also yes there is a space between contacts...

    At the moment I am selecting the text from the PDF, copying it into the excel sheet and rearranging it from vertical to horizontal. I have upwards of 1000 contacts to do.


    What do you mean by writing a script...

    If we could see a sample of your file it might help.


  • Registered Users Posts: 1,064 ✭✭✭Snowbat


    There's a very handy tool called Sed (comes standard in Linux but there is a win32 binary) that can be used eg. to replace tabs with commas (or other delimiter), replace end of lines with commas (or other delimiter), and merge adjacent lines. The processed file can be renamed to .csv and imported into a spreadsheet.

    If you want to post some (suitably munged) sample data exported to txt, I'll see what I can cook up.


  • Advertisement
  • Closed Accounts Posts: 40 AceHigh


    http://www.corkcoco.ie/co/pdf/41144343.pdf

    This link is a sample of what I'm talking about.

    I need to get the text under applicant name and address and transfer the details as simply as i can to a spreadsheet with details entered horizontally.


  • Registered Users Posts: 1,064 ✭✭✭Snowbat


    You didn't give me .txt and Adobe's online conversion tool seems to hang on this file so I used the gmail trick and saved the page as pdfsrc.html

    Then, in Linux:
    cat pdfsrc.html | egrep 'left: 131|left: 135|left: 443|left: 444' | cut -d '"' -f 3 | sed 's/>C<\/div>//' | sed 's/>P<\/div>//' | sed 's/>O<\/div>//' | sed 's/>A<\/div>//' | grep -v '><b>' | sed 's/^>//' | sed 's/<\/div>//' | sed 's/&amp;/\&/' | sed 's/, /\n/g' | sed 's/,//'> pdfsrc2.txt
    
    We make use of the fact that the second column data is all at coordinate left: 131 (except a couple of oddballs at 135). We also make use of the App Type codes at 443 and 444 to give us line breaks between each contact. Then, filter out unneeded bits of code, turn any instance of comma-space into a line feed, and strip any remaining commas as we'll be using comma as the delimiter in our CSV (comma separated values) file.

    At this point you can review the intermediate text file and check if everything looks in order.

    Then,
    cat pdfsrc2.txt | sed 's/$/,/' | sed 's/^,/__/' | sed -e :a -e '/,$/N; s/,\n/,/; ta' | sed 's/,__$//' > pdfsrc3.csv
    
    Here we add commas to the end of every line, replace any comma at the start of a line with double underscore (this should hit only the line breaks and we need them to be something other than commas to make the next bit work properly), append the following line to any line ending with a comma, remove the __, sequence, and output to a CSV file that can be imported to any spreadsheet.


    Caveats: The longer names in the html file were wrapped onto two rows and therefore end up as separate values (eg. row 86 has sirname in column B) - nothing I can do about that. I also see some duplicate lines - if you need to get rid of dupes you can concatenate all the csv files together and pipe through sort and uniq.


    Note: attachment pdfsrc3.csv renamed to pdfsrc3.csv.txt as boards file filter doesn't like .csv


  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 90,843 Mod ✭✭✭✭Capt'n Midnight


    have you checked the copyright and data protection implications ??

    if you had office 2003 (or any part of the office 2003) you could print to the microsoft document imaging printer and then open the .mdi or .tif file in microsoft documant imaging and OCR the text to word.

    sometimes it keeps formats


  • Registered Users Posts: 23,212 ✭✭✭✭Tom Dunne


    AceHigh wrote:
    Yes Snobat I can export them to txt but all the info is listed under each other with no breaks.

    Ah, but if you look at the text file closely, you will see the following:

    Each address is preceded by a reference number, most of which are in the format 'NN/NNNN' where N is a number. You have to be careful to differentiate between this and a date in the format 'DD/MM/YYYY'

    Each address is also followed by a single letter, corresponding to App. Type in the pdf document.
    AceHigh wrote:
    What do you mean by writing a script...

    He means writing a short program that will sift throught the data, based on what I posted above, and extract the addresses. Looks like Perl is your man here.

    Unfortunately, I don't have the time to re-learn Perl (it's been a few years since I have used it), nor do I have time to write the script.


  • Closed Accounts Posts: 40 AceHigh


    Thanks Tom, Capt'n Midnight & Snowbat. I don't mean for anyone to spend hours writing a program to do this. There's better things to spend time on.

    I just put it out to see if there was any quick way / application to do same.

    Many thanks for all replies anyway.

    AceHigh


  • Advertisement
Advertisement