Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Regex issue (in Python)

  • 30-01-2019 10:11am
    #1
    Registered Users, Registered Users 2 Posts: 1,363 ✭✭✭


    Hi there,

    I'm struggling with a regex issue.

    Using Beautiful Soup, I've pulled down the contents of a given page. I'm looking to find all calls that are made on that page. Using BS, I can find all calls such as links, and scripts where the source is within the tag.

    However, there are some scripts that I can't get. These have the form <script> blah</script>

    I'm looking to
    a) Search between the Script Tags
    b) Pull something that ends .js

    But I'm failing miserably. I've been messing about with re.findall and re.compile, and all I have been able to do is identify the .js itself - not even pull the string which it is attached to.

    Can someone help?


Comments

  • Registered Users, Registered Users 2 Posts: 6,335 ✭✭✭Talisman


    Beautiful Soup alone is not meant for the purpose for which you intend to use it - you need a different tool/library.

    Look at the requests-html library - it parses HTML and has JavaScript support.

    Alternatively you could use Selenium with Beautiful Soup - here's a short web scraping example posted on Medium : Better web scraping in Python with Selenium, Beautiful Soup, and pandas


  • Registered Users, Registered Users 2 Posts: 1,363 ✭✭✭ezra_


    Talisman wrote: »
    Beautiful Soup alone is not meant for the purpose for which you intend to use it - you need a different tool/library.

    Look at the requests-html library - it parses HTML and has JavaScript support.

    Alternatively you could use Selenium with Beautiful Soup - here's a short web scraping example posted on Medium : Better web scraping in Python with Selenium, Beautiful Soup, and pandas

    I don't think I explained myself properly.

    I'm not looking to follow the javascript links; should they contain more links, I won't see these and that is ok. I'm not engaging in a traditional scraping exercise.

    What I am trying to do is find the regex pattern that will pull the address that ends in .js from between two script tags. Whether this is performed on a BS variable, or just the output of a requests pull, I don't mind. All I am interested in is the address ending .js

    Or have I misunderstood what you were saying?


  • Registered Users, Registered Users 2 Posts: 10,948 ✭✭✭✭28064212


    A 'simple' regex approach
    ['|"]([^'|"]*.js[^'|"]*)['|"]
    
    You can try it here: https://regex101.com/ using the sample text below:
    var _pool = _pool || [];
    	(function(){
    		var s = document.createElement('script');
    		s.type = 'text/javascript'; s.async = true;
    		s.src = '//pool.journalmedia.ie/js/pool.min.js?sss'
    		s.src = "//pool.journalmedia.ie/js/pool.min.js?ss2s"
    		var x = document.getElementsByTagName('script')[0];
    		x.parentNode.insertBefore(s, x);
    	})();
    
    It'll pull out the 2 .js links.

    But it'll be defeated by even simple changes. Say the script was structured like this:
    var _pool = _pool || [];
    	(function(){
    		var s = document.createElement('script');
    		s.type = 'text/javascript'; s.async = true;
    		base = '//pool.journalmedia.ie/js/'
    		s.src = base + 'pool.min.js?sss'
    		s.src = base + "pool.min.js?ss2s"
    		var x = document.getElementsByTagName('script')[0];
    		x.parentNode.insertBefore(s, x);
    	})();
    
    You won't be able to identify that 'base' needs to be prefixed to the two .js matches. The only way to actually do it properly would be to essentially build a full Javascript compiler. And even then, there's no requirement that a script has to end with .js. What if the script is loading a javascript file called 'pool.min'?

    There's way too many issues with your current approach. As per your previous thread, you should be looking into using the tool that already exists for getting every request a page makes: your browser. It already does literally everything you're looking for, you just have to figure out how to automate it e.g. using Selenium

    Boardsie Enhancement Suite - a browser extension to make using Boards on desktop a better experience (includes full-width display, keyboard shortcuts, dark mode, and more). Now available through your browser's extension store.

    Firefox: https://addons.mozilla.org/addon/boardsie-enhancement-suite/

    Chrome/Edge/Opera: https://chromewebstore.google.com/detail/boardsie-enhancement-suit/bbgnmnfagihoohjkofdnofcfmkpdmmce



  • Registered Users, Registered Users 2 Posts: 6,335 ✭✭✭Talisman


    Unless you have a means of compiling the JavaScript you are hacking your way to a solution. You can use a regular expression but it won't find URLs that are dynamically created in JavaScript.

    You could try something like this - I haven't tested the code but it should get you into the ball park. :)
    from bs4 import BeautifulSoup
    import re
    
    bs = BeautifulSoup(html_doc, 'html.parser')
    expression = '([-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})?\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)\.js'
    
    # Get all of the script tags
    script_tags = bs.find_all('script')
    for script in script_tags:
        # Use the regular expression to find the URL strings that contain '.js'
        urls = re.findall(expression, script)
        for url in urls
            print(url)
    
    You might find the regular expression that you need on regexr.com


  • Registered Users, Registered Users 2 Posts: 1,363 ✭✭✭ezra_


    Got it working in the end (or at least enough for what I need it to do.
    #Find all js and pull the address
    string = str(soup)    
    regex = r"([-a-zA-Z0-9@:%_\+.~#?&\/\/=]*)\.js"
    
    matches = re.findall(regex, string)
    matchlen = len(matches)
    
    for x in range(0, matchlen):
        srcs.append(matches[x])
    


  • Advertisement
Advertisement