Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Timescaping the Irish web

Options
  • 26-11-2011 10:03am
    #1
    Registered Users Posts: 7,329 ✭✭✭


    I've been working on an algorithm to date Irish websites to produce a timescape of how Irish websites are developed. Basically it determines the possible age of a website from indicators in the HTML and also the Last-Modified tag in the headers.

    Some search engines, as I understand it, may use the Last-Modified tag to see when content was updated and then subsequently use if-modified to see if there's been any change. What happens when the Last-Modified tag served by the webserver is massively out of synch with the content's modification date? Would this nuke the site in terms of a search engine's appraisal of the site's freshness?

    Some Last-Modified dates are genuine and there are sites out there that have not been modified for over ten years.

    Regards...jmcc


Comments

  • Registered Users Posts: 6,464 ✭✭✭MOH


    There may be something of use somewhere in here - there's posts on tons of current and potential SEO signals based on looking at registered patents, might help.


  • Registered Users Posts: 7,329 ✭✭✭jmcc


    MOH wrote: »
    There may be something of use somewhere in here - there's posts on tons of current and potential SEO signals based on looking at registered patents, might help.
    It is a fascinating site. Thanks for the link. Basically I have a dataset of the Irish web and have been testing the algorithm that I developed to see how it breaks down in terms of updates. This is what the raw combined Last-Mod and year indications percentages look like:

    | 2011 | 59.9609 |
    | 2010 | 16.2162 |
    | 2009 | 8.9542 |
    | 2008 | 4.9075 |
    | 2007 | 2.8276 |
    | 2006 | 1.6123 |
    | 2005 | 0.9239 |
    | 2004 | 0.7447 |
    | 2003 | 2.5319 |
    | 2002 | 0.1736 |
    | 2001 | 0.1718 |
    | 2000 | 0.3845 |
    | 1999 | 0.0812 |
    | 1998 | 0.0614 |
    | 1997 | 0.0496 |
    | 1996 | 0.0471 |
    | 1995 | 0.0341 |
    | 1994 | 0.0397 |
    | 1993 | 0.0304 |
    | 1992 | 0.0298 |
    | 1991 | 0.0217 |
    | 1970 | 0.1848 |
    | 1969 | 0.0006 |

    Not all of the sites return Last-Modified data but some are way out of synch with the site content's possible date. One good example would be a site with an LM date in 2000 having references to Facebook and Google Analytics in the HTML. Not all sites have year or copyright indications either and this makes it somewhat more difficult to estimate the site's update status.

    Regards...jmcc


  • Registered Users Posts: 16,402 ✭✭✭✭Trojan


    Can you access archive.org via API (or scraper)? Could be very useful.


  • Registered Users Posts: 7,329 ✭✭✭jmcc


    Trojan wrote: »
    Can you access archive.org via API (or scraper)? Could be very useful.
    I have an account on archive.org but I don't think that I've applied to access it via API. Their dataset is a bit patchy though with large segments of the Irish web missing. Much of this is down to the fact that many Irish websites still have poor external linking so any search engine or crawler that is dependent on following links will miss these sites completely.

    I've got some backups of data from surveys going back to 2007 (at least - there are more around but I have to remember where I left them). I've got a rough database schema worked out for doing the timescape.

    Regards...jmcc


Advertisement