Some help, please

ezra_ · 21-01-2019 3:04pm #1

Hi there, I'm looking for some help.

I tried searching for the answer in Stack, but realised pretty quickly that I didn't know what to search for!

I'm looking to write a scripts that basically mimics the 'Sources' tab in Chrome's developer tools, but with a couple of tweaks.

I'm basically looking that, for a given website e.g. boards.ie,

I get a list of calls that are made when you hit that page
For each call, I get the file that is being called (e.g., css file, .js calls, content pages etc
I get the IP of the entity being called
The IP is resolved in terms of country
All this is stored in some db

I'm happy to tinker about in Python (I'm guessing that PHP won't help me here) and I'm quite confident in SQL.

However, I keep ending up in webscrape scripts which, while close, don't actually help me.

Can anyone help with a steer as to where to start?

Talisman · 21-01-2019 6:08pm

You should check out the documentation for the window.performance object.

Try this code out in the console of your browser - it will list out each of the resources loaded for the page - that's your starting point.

resources = window.performance.getEntriesByType("resource");
resources.forEach(function (resource) {
    console.log(resource.name)
})

22-01-2019 12:29am

It's not entirely clear what you want. Is this something that will happen when you visit the page in your browser? Or is it a stand-alone program that you pass a URL to? Given you talk about Python*, I'm assuming the latter. In which case, you've got quite a job on your hands.

To start simply, you can use the requests library to grab "http://www.boards.ie". You'll end up with something like this:
[html]<!DOCTYPE html>
<html>
<head>

<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-NV5D5XZ');</script>


<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name="viewport" content="width=device-width,initial-scale=1" />
<title>boards.ie - Now Ye're Talkin'</title>
<link rel="icon" href="https://b-static.net/www/i/favicon_16.ico?v=1" type="image/x-icon" />
<link rel="stylesheet" type="text/css" media="screen" href="https://b-static.net/www/css/themes/default/reset.css?v=502"/>
<link rel="stylesheet" type="text/css" media="screen" href="https://b-static.net/www/css/themes/default/boards-www.css?v=502"/>
<link rel="stylesheet" type="text/css" media="screen" href="https://b-static.net/www/css/themes/default/homepage.css?v=502"/>
<script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script type="text/javascript" src="https://b-static.net/www/js/boards-www.js?v=502"></script>
<script type="text/javascript" src="https://b-static.net/www/js/ext/handlebars.runtime.js?v=502"></script>
<script type="text/javascript" src="https://b-static.net/www/tmpl/index.js?v=502"></script>
...[/html]

i.e. the HTML source of the page. It would be reasonably simple to parse the source for 'script', 'stylesheet' and 'img' objects. But what if the source of one of the javascript files loads a whole other set of resources? The only way to find that would be to parse each source. You're now essentially talking about writing an entire javascript engine from scratch.

It might be possible using Selenium. It would use an already existing browser to do the requests and parsing. Some Python/Selenium approaches are suggested here: https://stackoverflow.com/questions/19786525/how-to-list-loaded-resources-with-selenium-phantomjs

*Not sure why you think PHP couldn't help. Effectively, anything you can do in Python you can do in PHP. Python's probably easier to do it in

ezra_ · 22-01-2019 10:22am

Thanks guys.

In short, I'm agnostic as to the form of the script, but given the amount of scraping scripts written in Python, I assumed it was a good place to start. Also, I have some experience in Python before.

Talisman - that pretty much presents a good summary of what specifically I'm looking to do. Is there a python equivalent of windows.performance?

28064212 - I'm pretty much just looking for the sources; what they actually bring back is actually immaterial in this instance. If you were to visualise this on a map, I'd like to place the server of the primary URL, and then show where (geograpically) other calls are made to and what the nature of these calls are.

Is that making sense?

22-01-2019 11:17am

ezra_ wrote: »

28064212 - I'm pretty much just looking for the sources; what they actually bring back is actually immaterial in this instance. If you were to visualise this on a map, I'd like to place the server of the primary URL, and then show where (geograpically) other calls are made to and what the nature of these calls are.

I think you're still missing the point. Take the boards.ie homepage as an example. If you look in the Sources tool, one of the resources that is loaded is called "lscache.min.js" ([noparse]https://pool.journalmedia.ie/js/lscache.min.js)[/noparse]. But if you look at the HTML source of boards.ie, you won't find lscache.min.js linked anywhere. That's because the page loads a different script called pool.journalmedia.ie/js/pool.min.js, which in turn calls the lscache.min.js. And there's nothing to stop lscache.min.js calling yet another script, which calls another script, and on, and on. The only way for you to know that would be to download all the linked resources, and check if they call anything else.

If you just downloaded boards.ie and checked the resources on that page, you'd think there was about 12 loaded. In reality, when you visit boards.ie in your browser, there's around 37 requests made

ezra_ · 22-01-2019 11:24am

28064212 wrote: »

I think you're still missing the point. Take the boards.ie homepage as an example. If you look in the Sources tool, one of the resources that is loaded is called "lscache.min.js" ([noparse]https://pool.journalmedia.ie/js/lscache.min.js)[/noparse]. But if you look at the HTML source of boards.ie, you won't find lscache.min.js linked anywhere. That's because the page loads a different script called pool.journalmedia.ie/js/pool.min.js, which in turn calls the lscache.min.js. And there's nothing to stop lscache.min.js calling yet another script, which calls another script, and on, and on. The only way for you to know that would be to download all the linked resources, and check if they call anything else.

If you just downloaded boards.ie and checked the resources on that page, you'd think there was about 12 loaded. In reality, when you visit boards.ie in your browser, there's around 37 requests made

I get that, however, for this particular exercise - that doesn't actually matter. Its the first calls which are crucial, any further scripts are outside of the analysis.

Down the line, it will be needed, but for the moment, only the first calls are needed.

22-01-2019 11:52am

ezra_ wrote: »

I get that, however, for this particular exercise - that doesn't actually matter. Its the first calls which are crucial, any further scripts are outside of the analysis.

Then a simplified approach would be to use the requests library to call the URL, and use the Beautiful Soup library to parse the returned HTML. Pull out the 'src' attribute of <script>, <img> and <iframe> tags, and the 'href' attribute of <link> tags. This will give you a list of URLs that are called.

DNS lookups given a hostname are simple in python, so you just need to get the hostname from each URL, probably using urlparse. Once you have the list of IPs, you need a IP-to-location method, geoip2 looks the most commonly used one

ezra_ wrote: »

Down the line, it will be needed, but for the moment, only the first calls are needed.

Still think you're going down the wrong route if this is going to be needed. The approach above isn't feasibly expandable to cover this requirement, you'll have to throw it out and start again

ezra_ · 22-01-2019 11:54am

28064212 wrote: »

Still think you're going down the wrong route if this is going to be needed. The approach above isn't feasibly expandable to cover this requirement, you'll have to throw it out and start again

You are probably right, however, I won't know until I see the results from the first step and where that leads!

Talisman · 22-01-2019 3:38pm

ezra_ wrote: »

Talisman - that pretty much presents a good summary of what specifically I'm looking to do. Is there a python equivalent of windows.performance?

Window.performance is a browser API - you will need to use a web browser to avail of it. A Primer for Web Performance Timing APIs

If you were to create an extension for your web browser you could utilise the available APIs and save yourself reinventing the wheel.

The Firefox browser implements some APIs that are available for extensions. One of these gives you the ability to make DNS requests within the browser - dns - the developer documentation gives you sample JavaScript code that shows how simple the task becomes.

function resolved(record) {
  console.log(record.addresses)
}

let resolving = browser.dns.resolve("example.com")

resolving.then(resolved)

// > e.g. Array [ "73.284.240.12" ]

Everything you seem to need is already performed by the browser so scraping the content with Python doesn't seem the most effective approach. If you are determined to use Python then none of this is of interest to you but you should take a look at the WebPagetest project - it's a free service and everything is on Github so you're free to mash it into what you need.

EDIT: I forgot to mention that for your IP geolocation requirement you can avail of Maxmind's free GeoLite2 database provided you meet their simple attribution requirement.

Some help, please

Comments