Advertisement
Help Keep Boards Alive. Support us by going ad free today. See here: https://subscriptions.boards.ie/.
https://www.boards.ie/group/1878-subscribers-forum

Private Group for paid up members of Boards.ie. Join the club.
Hi all, please see this major site announcement: https://www.boards.ie/discussion/2058427594/boards-ie-2026

Web (text) Scraping

Comments

  • Registered Users, Registered Users 2 Posts: 851 ✭✭✭TonyStark


    Pj! wrote: »
    Can those more knowledgeable than me advise on how best to 'scrape' text from the web?
    If I had a website with 20 new links on it each day,

    eg:

    www.dmt.com/ansdfiksfd
    www.gtp.com/ajklds
    www.rte.com/power
    www.wer.com/amsjdfs
    www.sok.com/msnd
    www.lop.com/lsjsd
    www.joy.com/skjhjds
    etc.


    and each link contained a text article. Is there a software that I could use to capture the articles (or the whole page) from inside each link without having to visit each one?


    You'd have to visit each one, or rather the software you'd use would but it would do so automatically. Also the software would need to run out of some scheduler.


  • Closed Accounts Posts: 3,783 ✭✭✭Pj!


    Yeah I'm looking for a software that could do it for me.
    Would such a thing be available?

    I wouldn't mind pressing the 'go' button each day. It wouldn't necessarily have to schedule itself each day.


  • Registered Users, Registered Users 2 Posts: 851 ✭✭✭TonyStark


    Pj! wrote: »
    Yeah I'm looking for a software that could do it for me.
    Would such a thing be available?

    I wouldn't mind pressing the 'go' button each day. It wouldn't necessarily have to schedule itself each day.

    Such things do exist. PM me for more details and a sample of the sites you want scraped and some of the data that you want to pull out.

    The caveat is that if the site owner changes the site you need to change the scraper to cater for the change. It depends what you want to pull off the site.


  • Registered Users, Registered Users 2 Posts: 1,771 ✭✭✭jebuz


    https://scraperwiki.com is your friend, it also allows you to schedule the runs. It stores the scraped data in a database which you can either download or hit via an API, excellent service.


  • Registered Users, Registered Users 2 Posts: 33 frezzabelle


    I think you may have to write a parser for each link. There's planty of nice Java css selectors/parsers out there.


  • Advertisement
  • Closed Accounts Posts: 2,828 ✭✭✭Reamer Fanny


    Pj! wrote: »
    Can those more knowledgeable than me advise on how best to 'scrape' text from the web?
    If I had a website with 20 new links on it each day,

    eg:

    www.dmt.com/ansdfiksfd
    www.gtp.com/ajklds
    www.rte.com/power
    www.wer.com/amsjdfs
    www.sok.com/msnd
    www.lop.com/lsjsd
    www.joy.com/skjhjds
    etc.


    and each link contained a text article. Is there a software that I could use to capture the articles (or the whole page) from inside each link without having to visit each one?

    You could use cURL and the xpath function in PHP with some kind of regular expression to parse the HTML and extract only the articles text.


  • Closed Accounts Posts: 3,783 ✭✭✭Pj!


    It's all getting a bit technical for me but very happy with the suggestions. Having a good look around scraperwiki. Thanks jebuz.

    I might just pay to get a scraper created.


  • Closed Accounts Posts: 18,152 ✭✭✭✭Liam Byrne


    Be careful of copyright issues; make sure you have permission to scrape the content.

    Also, ensure the sources are reputable; re-publishing inaccurate or libellous content leaves you open to legal issues.


  • Closed Accounts Posts: 3,783 ✭✭✭Pj!


    Thanks Liam.
    I don't want to re-publish anything. Just keep a record for my own use.


Advertisement
Advertisement