Data in hidden in html files

jpc1 · 02-06-2020 1:37am #1

Got a python script for searching for when metadata is hidden in files but I dont think its working as expecting. I create a text file with the below data

<!\-\-.+\-\->
<[Mm][Ee][Tt][Aa]

and use it as a parameter in the below but its returning every file in the directory

so I am presuming its not working correctly as ive manually checked the files

#!/usr/bin/python

import sys
import os
import subprocess

directory = "."

if len(sys.argv) > 1:
directory = sys.argv[1]

for root, directories, files in os.walk(directory):
for name in files:
print os.path.join(root, name)
subprocess.call(["grep", "-f", "htmlPatterns.txt", os.path.join(root, name)])

Also created an extra html file just called Bob.html and just wrote Bob in it but the script returned that as well so pretty sure its not working.

02-06-2020 9:54am

When you say it's "returning" every file... you're printing the file name before grepping it, so your code will print out the name of every file it scans. Is that what you want it to do?

jpc1 · 02-06-2020 5:51pm

I want it to look for hidden data in all files basically.

Using comments 
In META tags: <meta ...
In CDATA sections: <!CDATA[...

So i thought could search for when meta, cdata or a comment is put in .

Im doing it manually at the moment just trying to learn but obviously thats not feasible on bigger projects. sorry for my crappy code, havent really used python before or reg ex/grep

02-06-2020 6:24pm

There's a much easier way of doing this. grep has an "-r" switch, meaning recursive. Use that switch and pass it a directory, and you don't need python at all

jmcc · 06-06-2020 1:47am

Beautifulsoup might be a better option. It has a learning curve but it is excellent at stripping data from HTML.
https://www.crummy.com/software/BeautifulSoup/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Regards...jmcc

Talisman · 11-06-2020 6:01am

BeautifulSoup will run into problems with CDATA. HTML parsers do not recognise the CDATA markers which is why they are usually commented out in HTML code. The lxml parser will strip out CDATA content by default. See https://lxml.de/api.html#cdata for details. The solution would be to use a regular expression to find the CDATA content.

The following piece of Python code will create three iterables (metas, cdatas, comments) but I won't vouch for the CDATA regex.

import re
import requests
from bs4 import BeautifulSoup, Comment

response = requests.get(WEBPAGE_URL)

if response.status_code == 200:
    soup = BeautifulSoup(response.content)
    metas = soup.find_all("meta")
    cdatas = soup.findAll(text=re.compile("CDATA"))
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))

Data in hidden in html files

Comments