Web (text) Scraping

Pj! · 25-01-2012 02:00PM #1

Can those more knowledgeable than me advise on how best to 'scrape' text from the web?
If I had a website with 20 new links on it each day,

eg:

www.dmt.com/ansdfiksfd
www.gtp.com/ajklds
www.rte.com/power
www.wer.com/amsjdfs
www.sok.com/msnd
www.lop.com/lsjsd
www.joy.com/skjhjds
etc.

and each link contained a text article. Is there a software that I could use to capture the articles (or the whole page) from inside each link without having to visit each one?

TonyStark · 25-01-2012 02:03PM

Pj! wrote: »

Can those more knowledgeable than me advise on how best to 'scrape' text from the web?
If I had a website with 20 new links on it each day,

eg:

www.dmt.com/ansdfiksfd
www.gtp.com/ajklds
www.rte.com/power
www.wer.com/amsjdfs
www.sok.com/msnd
www.lop.com/lsjsd
www.joy.com/skjhjds
etc.

and each link contained a text article. Is there a software that I could use to capture the articles (or the whole page) from inside each link without having to visit each one?

You'd have to visit each one, or rather the software you'd use would but it would do so automatically. Also the software would need to run out of some scheduler.

Pj! · 25-01-2012 02:06PM

Yeah I'm looking for a software that could do it for me.
Would such a thing be available?

I wouldn't mind pressing the 'go' button each day. It wouldn't necessarily have to schedule itself each day.

TonyStark · 25-01-2012 06:44PM

Pj! wrote: »

Yeah I'm looking for a software that could do it for me.
Would such a thing be available?

I wouldn't mind pressing the 'go' button each day. It wouldn't necessarily have to schedule itself each day.

Such things do exist. PM me for more details and a sample of the sites you want scraped and some of the data that you want to pull out.

The caveat is that if the site owner changes the site you need to change the scraper to cater for the change. It depends what you want to pull off the site.

jebuz · 27-01-2012 04:52PM

https://scraperwiki.com is your friend, it also allows you to schedule the runs. It stores the scraped data in a database which you can either download or hit via an API, excellent service.

frezzabelle · 28-01-2012 01:31AM

I think you may have to write a parser for each link. There's planty of nice Java css selectors/parsers out there.

Reamer Fanny · 28-01-2012 01:34AM

Pj! wrote: »

Can those more knowledgeable than me advise on how best to 'scrape' text from the web?
If I had a website with 20 new links on it each day,

eg:

www.dmt.com/ansdfiksfd
www.gtp.com/ajklds
www.rte.com/power
www.wer.com/amsjdfs
www.sok.com/msnd
www.lop.com/lsjsd
www.joy.com/skjhjds
etc.

and each link contained a text article. Is there a software that I could use to capture the articles (or the whole page) from inside each link without having to visit each one?

You could use cURL and the xpath function in PHP with some kind of regular expression to parse the HTML and extract only the articles text.

Pj! · 28-01-2012 08:19PM

It's all getting a bit technical for me but very happy with the suggestions. Having a good look around scraperwiki. Thanks jebuz.

I might just pay to get a scraper created.

Liam Byrne · 29-01-2012 06:10PM

Be careful of copyright issues; make sure you have permission to scrape the content.

Also, ensure the sources are reputable; re-publishing inaccurate or libellous content leaves you open to legal issues.

Pj! · 29-01-2012 07:03PM

Thanks Liam.
I don't want to re-publish anything. Just keep a record for my own use.

Web (text) Scraping

Comments