PDF info extraction problem NEED HELP

AceHigh · 20-02-2007 07:50PM #1

Hi all,

Doing a project at work at the moment and I'm in dire straits.
Basically I have all these PDF docs with names and addresses and other details arranged in rows and columns. The Second column contains the infornation I require.

What I am hoping somebody can help me with is how to extract only the data in column 2 and import it into a spreadsheet.

The second part of the problem is that the detailss are arranged under each other i.e

Name
Address 1
Address 2
etc...

I need to be able to import this info into Excel in the following layout: name, address 1......

Under pressure with this one so all help appreciated.

P.s S/w used: Adobe Acrobat 8 Pro, Adobe Illustrator CS & MS Office 2002

AceHigh · 20-02-2007 08:33PM

Like the guy with the 170Gb lost info...I too will Die...but not by my hands.

Lollypops all round if somebody can sort me out...

No we don't sell lollypops but I'm a big fan of Chub a Chubs. Big tin on my desk...

Me with the project finished.
:eek: :eek: :eek: :eek: Closest the funnies come to a corpse...

Snowbat · 20-02-2007 08:42PM

The person that should die is the one who thought it was a good idea to store this kind of data in PDF files

Can you export the PDFs to txt?
Is there white space (blank line) separating each contact?

If yes to both, it can be easily scripted.

AceHigh · 20-02-2007 08:48PM

Yes Snobat I can export them to txt but all the info is listed under each other with no breaks.

Also yes there is a space between contacts...

At the moment I am selecting the text from the PDF, copying it into the excel sheet and rearranging it from vertical to horizontal. I have upwards of 1000 contacts to do.

What do you mean by writing a script...

Heinrich · 20-02-2007 08:53PM

AceHigh wrote:

Yes Snobat I can export them to txt but all the info is listed under each other with no breaks.

Also yes there is a space between contacts...

At the moment I am selecting the text from the PDF, copying it into the excel sheet and rearranging it from vertical to horizontal. I have upwards of 1000 contacts to do.

What do you mean by writing a script...

If we could see a sample of your file it might help.

Snowbat · 20-02-2007 09:02PM

There's a very handy tool called Sed (comes standard in Linux but there is a win32 binary) that can be used eg. to replace tabs with commas (or other delimiter), replace end of lines with commas (or other delimiter), and merge adjacent lines. The processed file can be renamed to .csv and imported into a spreadsheet.

If you want to post some (suitably munged) sample data exported to txt, I'll see what I can cook up.

AceHigh · 20-02-2007 09:03PM

http://www.corkcoco.ie/co/pdf/41144343.pdf

This link is a sample of what I'm talking about.

I need to get the text under applicant name and address and transfer the details as simply as i can to a spreadsheet with details entered horizontally.

Snowbat · 21-02-2007 02:01AM

You didn't give me .txt and Adobe's online conversion tool seems to hang on this file so I used the gmail trick and saved the page as pdfsrc.html

Then, in Linux:

cat pdfsrc.html | egrep 'left: 131|left: 135|left: 443|left: 444' | cut -d '"' -f 3 | sed 's/>C<\/div>//' | sed 's/>P<\/div>//' | sed 's/>O<\/div>//' | sed 's/>A<\/div>//' | grep -v '><b>' | sed 's/^>//' | sed 's/<\/div>//' | sed 's/&amp;/\&/' | sed 's/, /\n/g' | sed 's/,//'> pdfsrc2.txt

We make use of the fact that the second column data is all at coordinate left: 131 (except a couple of oddballs at 135). We also make use of the App Type codes at 443 and 444 to give us line breaks between each contact. Then, filter out unneeded bits of code, turn any instance of comma-space into a line feed, and strip any remaining commas as we'll be using comma as the delimiter in our CSV (comma separated values) file.

At this point you can review the intermediate text file and check if everything looks in order.

Then,

cat pdfsrc2.txt | sed 's/$/,/' | sed 's/^,/__/' | sed -e :a -e '/,$/N; s/,\n/,/; ta' | sed 's/,__$//' > pdfsrc3.csv

Here we add commas to the end of every line, replace any comma at the start of a line with double underscore (this should hit only the line breaks and we need them to be something other than commas to make the next bit work properly), append the following line to any line ending with a comma, remove the __, sequence, and output to a CSV file that can be imported to any spreadsheet.

Caveats: The longer names in the html file were wrapped onto two rows and therefore end up as separate values (eg. row 86 has sirname in column

- nothing I can do about that. I also see some duplicate lines - if you need to get rid of dupes you can concatenate all the csv files together and pipe through sort and uniq.

Note: attachment pdfsrc3.csv renamed to pdfsrc3.csv.txt as boards file filter doesn't like .csv

Capt'n Midnight · 21-02-2007 02:19AM

have you checked the copyright and data protection implications ??

if you had office 2003 (or any part of the office 2003) you could print to the microsoft document imaging printer and then open the .mdi or .tif file in microsoft documant imaging and OCR the text to word.

sometimes it keeps formats

Tom Dunne · 21-02-2007 11:26AM

AceHigh wrote:

Yes Snobat I can export them to txt but all the info is listed under each other with no breaks.

Ah, but if you look at the text file closely, you will see the following:

Each address is preceded by a reference number, most of which are in the format 'NN/NNNN' where N is a number. You have to be careful to differentiate between this and a date in the format 'DD/MM/YYYY'

Each address is also followed by a single letter, corresponding to App. Type in the pdf document.

AceHigh wrote:

What do you mean by writing a script...

He means writing a short program that will sift throught the data, based on what I posted above, and extract the addresses. Looks like Perl is your man here.

Unfortunately, I don't have the time to re-learn Perl (it's been a few years since I have used it), nor do I have time to write the script.

AceHigh · 21-02-2007 08:20PM

Thanks Tom, Capt'n Midnight & Snowbat. I don't mean for anyone to spend hours writing a program to do this. There's better things to spend time on.

I just put it out to see if there was any quick way / application to do same.

Many thanks for all replies anyway.

AceHigh

PDF info extraction problem NEED HELP

Comments