Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi all! We have been experiencing an issue on site where threads have been missing the latest postings. The platform host Vanilla are working on this issue. A workaround that has been used by some is to navigate back from 1 to 10+ pages to re-sync the thread and this will then show the latest posts. Thanks, Mike.
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Data in hidden in html files

  • 02-06-2020 1:37am
    #1
    Registered Users Posts: 12


    Got a python script for searching for when metadata is hidden in files but I dont think its working as expecting. I create a text file with the below data

    <!\-\-.+\-\->
    <[Mm][Ee][Tt][Aa]

    and use it as a parameter in the below but its returning every file in the directory

    so I am presuming its not working correctly as ive manually checked the files

    #!/usr/bin/python

    import sys
    import os
    import subprocess

    directory = "."

    if len(sys.argv) > 1:
    directory = sys.argv[1]

    for root, directories, files in os.walk(directory):
    for name in files:
    print os.path.join(root, name)
    subprocess.call(["grep", "-f", "htmlPatterns.txt", os.path.join(root, name)])




    Also created an extra html file just called Bob.html and just wrote Bob in it but the script returned that as well so pretty sure its not working.


Comments

  • Registered Users, Registered Users 2 Posts: 10,818 ✭✭✭✭28064212


    When you say it's "returning" every file... you're printing the file name before grepping it, so your code will print out the name of every file it scans. Is that what you want it to do?

    Boardsie Enhancement Suite - a browser extension to make using Boards on desktop a better experience (includes full-width display, keyboard shortcuts, dark mode, and more). Now available through your browser's extension store.

    Firefox: https://addons.mozilla.org/addon/boardsie-enhancement-suite/

    Chrome/Edge/Opera: https://chromewebstore.google.com/detail/boardsie-enhancement-suit/bbgnmnfagihoohjkofdnofcfmkpdmmce



  • Registered Users Posts: 12 jpc1


    I want it to look for hidden data in all files basically.

    Using comments <!-- comment -->
    In META tags: <meta ...
    In CDATA sections: <!CDATA[...

    So i thought could search for when meta, cdata or a comment is put in .

    Im doing it manually at the moment just trying to learn but obviously thats not feasible on bigger projects. sorry for my crappy code, havent really used python before or reg ex/grep


  • Registered Users, Registered Users 2 Posts: 10,818 ✭✭✭✭28064212


    There's a much easier way of doing this. grep has an "-r" switch, meaning recursive. Use that switch and pass it a directory, and you don't need python at all

    Boardsie Enhancement Suite - a browser extension to make using Boards on desktop a better experience (includes full-width display, keyboard shortcuts, dark mode, and more). Now available through your browser's extension store.

    Firefox: https://addons.mozilla.org/addon/boardsie-enhancement-suite/

    Chrome/Edge/Opera: https://chromewebstore.google.com/detail/boardsie-enhancement-suit/bbgnmnfagihoohjkofdnofcfmkpdmmce



  • Registered Users, Registered Users 2 Posts: 7,453 ✭✭✭jmcc


    Beautifulsoup might be a better option. It has a learning curve but it is excellent at stripping data from HTML.
    https://www.crummy.com/software/BeautifulSoup/
    https://www.crummy.com/software/BeautifulSoup/bs4/doc/


    Regards...jmcc


  • Registered Users, Registered Users 2 Posts: 6,198 ✭✭✭Talisman


    BeautifulSoup will run into problems with CDATA. HTML parsers do not recognise the CDATA markers which is why they are usually commented out in HTML code. The lxml parser will strip out CDATA content by default. See https://lxml.de/api.html#cdata for details. The solution would be to use a regular expression to find the CDATA content.

    The following piece of Python code will create three iterables (metas, cdatas, comments) but I won't vouch for the CDATA regex. :D
    import re
    import requests
    from bs4 import BeautifulSoup, Comment
    
    response = requests.get(WEBPAGE_URL)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content)
        metas = soup.find_all("meta")
        cdatas = soup.findAll(text=re.compile("CDATA"))
        comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    


  • Advertisement
Advertisement