Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Screen Scraping

  • 03-10-2012 3:08pm
    #1
    Registered Users, Registered Users 2 Posts: 17,974 ✭✭✭✭


    I'm looking at screen scraping data from two different sites, combining the data and importing them into a searchable database, will be for private use and non-profit but moreso want to test it.

    How would I go about it, one of the sites I'm grabbing information from uses Flash. I googled around for a bit but there's loads of stuff out there, some useless.


Comments

  • Closed Accounts Posts: 2,930 ✭✭✭COYW


    I did something like this before with Perl a few years ago. I used this library to retrieve and store the page contents.

    I noted that each page on the site that I needed to scrape had a similar url structure, so I generated a list of the variables, looped through them built the url and pulled down all the pages.

    I extracted the content I wanted from the pages using regular expressions and stored it in a db for analysis.

    How much flash does this site have?


  • Registered Users, Registered Users 2 Posts: 1,082 ✭✭✭Feathers


    Flash is a black box in terms of content. You could grab the SWF/FLV, but you're not going to be able to extract text, links or images from within it in any easy way.


  • Registered Users, Registered Users 2 Posts: 252 ✭✭sf80


    Many flash objects will read their data from text/xml; try decompiling it, you might get a very easy to parse resource.


  • Registered Users, Registered Users 2 Posts: 1,082 ✭✭✭Feathers


    sf80 wrote: »
    Many flash objects will read their data from text/xml; try decompiling it, you might get a very easy to parse resource.

    Sure, that's true - was thinking of easy vs regular screen scraping :)


  • Registered Users, Registered Users 2 Posts: 17,974 ✭✭✭✭Gavin "shels"


    COYW wrote: »
    I did something like this before with Perl a few years ago. I used this library to retrieve and store the page contents.

    I noted that each page on the site that I needed to scrape had a similar url structure, so I generated a list of the variables, looped through them built the url and pulled down all the pages.

    I extracted the content I wanted from the pages using regular expressions and stored it in a db for analysis.

    How much flash does this site have?

    Seems fun... :D Any more useful links?

    It's only the flash part of the site I'm looking for, it's like a chart that contains stats which I'm looking to extract the data and save it to a database.


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 1,266 ✭✭✭Overflow


    To newish frameworks for screen scraping, they are basically headless browsers, you can programmatically browse a page just like a user would and scrape what you need. Zombie use's Node.js, both are really easy to get started with, if you know some javascript.

    Phantom.js
    http://phantomjs.org/

    Zombie.js
    http://zombie.labnotes.org/


  • Registered Users, Registered Users 2 Posts: 17,974 ✭✭✭✭Gavin "shels"


    Just another quick question on this, is their any way of setting a timer in which to grab the data?


  • Registered Users, Registered Users 2 Posts: 26,584 ✭✭✭✭Creamy Goodness


    look into running it from crontab http://en.wikipedia.org/wiki/Cron

    be wary though of having it run at the same time every night, if you're predictable the sites you're scraping could get wise and block your IP or spike the data.


  • Registered Users, Registered Users 2 Posts: 1,266 ✭✭✭Overflow


    Just another quick question on this, is their any way of setting a timer in which to grab the data?

    Very hard to say without knowing how you implemented your scraper.


Advertisement