Regex issue (in Python)

ezra_ · 30-01-2019 11:11am #1

Hi there,

I'm struggling with a regex issue.

Using Beautiful Soup, I've pulled down the contents of a given page. I'm looking to find all calls that are made on that page. Using BS, I can find all calls such as links, and scripts where the source is within the tag.

However, there are some scripts that I can't get. These have the form <script> blah</script>

I'm looking to
a) Search between the Script Tags
b) Pull something that ends .js

But I'm failing miserably. I've been messing about with re.findall and re.compile, and all I have been able to do is identify the .js itself - not even pull the string which it is attached to.

Can someone help?

Talisman · 30-01-2019 11:34am

Beautiful Soup alone is not meant for the purpose for which you intend to use it - you need a different tool/library.

Look at the requests-html library - it parses HTML and has JavaScript support.

Alternatively you could use Selenium with Beautiful Soup - here's a short web scraping example posted on Medium : Better web scraping in Python with Selenium, Beautiful Soup, and pandas

ezra_ · 30-01-2019 11:40am

Talisman wrote: »

Beautiful Soup alone is not meant for the purpose for which you intend to use it - you need a different tool/library.

Look at the requests-html library - it parses HTML and has JavaScript support.

Alternatively you could use Selenium with Beautiful Soup - here's a short web scraping example posted on Medium : Better web scraping in Python with Selenium, Beautiful Soup, and pandas

I don't think I explained myself properly.

I'm not looking to follow the javascript links; should they contain more links, I won't see these and that is ok. I'm not engaging in a traditional scraping exercise.

What I am trying to do is find the regex pattern that will pull the address that ends in .js from between two script tags. Whether this is performed on a BS variable, or just the output of a requests pull, I don't mind. All I am interested in is the address ending .js

Or have I misunderstood what you were saying?

30-01-2019 12:10pm

A 'simple' regex approach

['|"]([^'|"]*.js[^'|"]*)['|"]

You can try it here: https://regex101.com/ using the sample text below:

var _pool = _pool || [];
	(function(){
		var s = document.createElement('script');
		s.type = 'text/javascript'; s.async = true;
		s.src = '//pool.journalmedia.ie/js/pool.min.js?sss'
		s.src = "//pool.journalmedia.ie/js/pool.min.js?ss2s"
		var x = document.getElementsByTagName('script')[0];
		x.parentNode.insertBefore(s, x);
	})();

It'll pull out the 2 .js links.

But it'll be defeated by even simple changes. Say the script was structured like this:

var _pool = _pool || [];
	(function(){
		var s = document.createElement('script');
		s.type = 'text/javascript'; s.async = true;
		base = '//pool.journalmedia.ie/js/'
		s.src = base + 'pool.min.js?sss'
		s.src = base + "pool.min.js?ss2s"
		var x = document.getElementsByTagName('script')[0];
		x.parentNode.insertBefore(s, x);
	})();

You won't be able to identify that 'base' needs to be prefixed to the two .js matches. The only way to actually do it properly would be to essentially build a full Javascript compiler. And even then, there's no requirement that a script has to end with .js. What if the script is loading a javascript file called 'pool.min'?

There's way too many issues with your current approach. As per your previous thread, you should be looking into using the tool that already exists for getting every request a page makes: your browser. It already does literally everything you're looking for, you just have to figure out how to automate it e.g. using Selenium

Talisman · 30-01-2019 12:18pm

Unless you have a means of compiling the JavaScript you are hacking your way to a solution. You can use a regular expression but it won't find URLs that are dynamically created in JavaScript.

You could try something like this - I haven't tested the code but it should get you into the ball park.

from bs4 import BeautifulSoup
import re

bs = BeautifulSoup(html_doc, 'html.parser')
expression = '([-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})?\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)\.js'

# Get all of the script tags
script_tags = bs.find_all('script')
for script in script_tags:
    # Use the regular expression to find the URL strings that contain '.js'
    urls = re.findall(expression, script)
    for url in urls
        print(url)

You might find the regular expression that you need on regexr.com

ezra_ · 07-02-2019 10:56pm

Got it working in the end (or at least enough for what I need it to do.

#Find all js and pull the address
string = str(soup)    
regex = r"([-a-zA-Z0-9@:%_\+.~#?&\/\/=]*)\.js"

matches = re.findall(regex, string)
matchlen = len(matches)

for x in range(0, matchlen):
    srcs.append(matches[x])

Regex issue (in Python)

Comments