Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Help scraping data

Options
  • 30-01-2008 7:38pm
    #1
    Registered Users Posts: 1,322 ✭✭✭


    Hey,

    Im currently doing a large project on public transport and I need the dublin bus data off their website. I've been emailing them looking for some kinda of db dump but am getting nowhere with them.

    I was thinking of using ruby or perl or maybe python to scrap it from the html. Its just the html isn't clearly defined, just tables etc.

    Any ideas on which language to use? and also any suggestions on an approach to the html structure problem.

    I'd appreciate any help on this. Its either this or sit for hours typing out the data. Id rather spend a couple weeks learning to do it in ruby,perl tbh, so would my sanity :p


Comments

  • Registered Users Posts: 981 ✭✭✭fasty


    You could use regular expressions to pull data from their site. The language isn't really important, most of them support regex stuff these days.


  • Closed Accounts Posts: 1,444 ✭✭✭Cantab.


    Fill up an array X[] with all the bus numbers

    Then foreach X[]
    {
    getXML(http://www.dublinbus.ie/your_journey/viewer.asp?route=$x)
    }

    Now all you need to do is look at the DOM and extract the relevant data.

    Xerces parser for Java is great -- you can rapidly get apps running using the Eclipse IDE.

    If your server doesn't support Java (likely), then I'd go with a perl/c++ solution. Xerces XML parser is brilliantly efficient is also available for Perl and C++.


  • Closed Accounts Posts: 1,444 ✭✭✭Cantab.


    Also, I'm not sure about the legalities of using Dublin Bus's information. I'm sure you can scrape it, but how you use that information is a different thing...


  • Registered Users Posts: 7,400 ✭✭✭Trampas


    Usually the case is if for private use its ok (College include) only if you intend to use it commercially then you could be breaking some legal rights.

    I did one last year by scraping Ryanair and Aer Lingus.

    I did by reading the returning html code.

    I used .NET


  • Registered Users Posts: 1,322 ✭✭✭Mad_Max


    Thats excellent guys thanks for the advice. Yeah its a college project. I don't have any other intentions for the data, so I assume its the same as me just looking at the site and writing it down.

    Cantab is that getXML function javascript or something?


  • Advertisement
  • Closed Accounts Posts: 1,444 ✭✭✭Cantab.


    Mad_Max wrote: »
    Thats excellent guys thanks for the advice. Yeah its a college project. I don't have any other intentions for the data, so I assume its the same as me just looking at the site and writing it down.

    Cantab is that getXML function javascript or something?

    No, 'getXML' is pseudo-code...

    Copy the page to memory, parse through Xerces and extract the relevant information.

    Shouldn't take an experienced programmer more than a day to do.


  • Registered Users Posts: 1,322 ✭✭✭Mad_Max


    oh right lol. I've used that method in javascript before so just got confuzzled.

    Thanks a mill for the help.


Advertisement