When is a webpage not a webpage!?

azzeretti · 11-06-2013 11:29am #1

This is kind of an "lateral thinking" kind of question that I am having a mental block about.
I have a process that checks an IMAP account for incoming email. It strips the mail of attachments and based on their Content-Type I do certain things. The problem is that an email sent from a client using HTML is considered an HTML attachment, which is fine. Except when the body of the email is empty. Because the content is "text/html" it will be processed as HTML. Examining an empty email in HTML shows that there is underlying HTML but the actual appearance to the user (via their client) is a blank page. (For example, Outlook embeds some crappy MS OWA styling all within a correctly formed HTML page e.g <head><body> etc.). I would like to ignore blank pages.

So, does anyone have any ideas/tricks how I might determine if a HTML page is empty!? I was hoping I could write out the page to file and parse it for "<body>" tags but as I have said, MS (and maybe other apps) will have these tags in blank email bodies too!

CreepingDeath · 12-06-2013 9:45am

Are there any fields in the e-mail header you can query, eg. message size?

Alternatively, some form of XSLT transformation to strip out all formatting/layout tags to leave the content ?

You don't say what programming language/libraries you're using, so different answers may apply. Needs more info.

azzeretti · 12-06-2013 11:28am

CreepingDeath wrote: »

Are there any fields in the e-mail header you can query, eg. message size?

Alternatively, some form of XSLT transformation to strip out all formatting/layout tags to leave the content ?

You don't say what programming language/libraries you're using, so different answers may apply. Needs more info.

I can query any part of the header, no problem. It wouldn't help though as if the message has, say 3 attachments, then the empty HTML body is considered a 4th attachment (or not if there are no attachments - I still need to parse the message body!) and processes it as any other.

The problem with XSLT transformation is firstly, the overhead!, and secondly the unknown factor of the formatting; i.e different mail clients will have different formatting etc.

I am using Perl by the way.

stevenmu · 12-06-2013 11:39am

I was actually thinking of something similar recently, but luckily I didn't need to do it in the end. HTML is a pain to parse properly.

One simple thing you could do is use a simple regex to strip out any and all HTML tags, and then see what you are left with. The obvious problem with that is that an email might just be an image, or possibly even an iframe (I think that's possible anyway, not sure). The other issue would be that a simple regex would be to just strip out any content between angle brackets, but you could also have genuine non-HTML content between angle brackets that would get caught up too.

CreepingDeath · 12-06-2013 12:47pm

azzeretti wrote: »

The problem with XSLT transformation is firstly, the overhead!, and secondly the unknown factor of the formatting; i.e different mail clients will have different formatting etc.

People worry too much about "overhead".
You said you could detect the HTML messages by the mime type ie. "text/html".
So you only need to process those.
If it adds 1 second on to your processing time, that's still better than displaying blank pages to the user, and you can process the e-mail folder with multiple threads if performance is an issue.

Do you have an example of one of these empty HTML pages to post?
Is the formatting embedded in the body tag?

I don't see why you couldn't knock up an XSLT just to output the text content of the body tag.

If you don't like XSLT, then maybe some regular expressions eg.

<body>(.*)</body>

As Stephen mentioned above...

azzeretti · 12-06-2013 2:31pm

CreepingDeath wrote: »

People worry too much about "overhead".
You said you could detect the HTML messages by the mime type ie. "text/html".

I don't understand this. An email could have 10 attachments, multi-content types. There could be 4 actual ("interesting") HTML pages and 2 others (the email body say, if the message was sent using HTML and maybe a signature inserted, or image! - each with the the header signifying HTML)

Do you have an example of one of these empty HTML pages to post?
Is the formatting embedded in the body tag?

I don't see why you couldn't knock up an XSLT just to output the text content of the body tag.

If you don't like XSLT, then maybe some regular expressions eg.

<body>(.*)</body>

As Stephen mentioned above...

But there will be lots of text between the body tags, mostly for styling, and this changes for almost every email client I have sent from - that's my problem. There is no method of parsing the body of the email to determine if the HTML included is "real", required HTML, or just some rubbish styled to display an blank page! An example of one email client (for a completly blank page) is :

<body lang=EN-IE link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><o:p>&nbsp;</o:p></p></div></body>

If I used a regular expression to dump the content of the body tag above I would, wrongly, have text displayed.

Graham · 12-06-2013 2:48pm

azzeretti wrote: »

If I used a regular expression to dump the content of the body tag above I would, wrongly, have text displayed.

Can you grab everything between the body tags, drop the html tags and see what's left?

stevenmu · 12-06-2013 5:22pm

If you're open to 3rd party components, there's stuff like SGML Reader which gives a number of useful functions for parsing HTML, or HTML Agility Pack, I haven't used it myself but looks like it might be a little more useful for your needs.

CreepingDeath · 13-06-2013 10:47am

I was playing around with Regular Expressions ( Eclipse has a regular expression plugin ).

You could use a find and replace RegEx of

<[^>]*>(.*)</[^>]*>

And only keep the capturing group 1.

But keep repeating it until the string you get out of RegEx is the same as the string you put in, ie. no more replacements.

So a sample input of

<body lang=EN-IE link=blue vlink=purple>1<div class=WordSection1>2<p class=MsoNormal>3<o:p>&nbsp;</o:p></p></div></body>

becomes...

Iteration 1

1<div class=WordSection1>2<p class=MsoNormal>3<o:p>&nbsp;</o:p></p></div>

Iteration 2

12<p class=MsoNormal>3<o:p>&nbsp;</o:p></p>

Iteration 3

123<o:p>&nbsp;</o:p>

Iteration 4

123&nbsp;

Iteration 5

123&nbsp;

Note : No change between iteration 4 and 5 so stop replacements.

Fussgangerzone · 17-06-2013 5:22pm

I don't know Perl, but might this approach work?

Strip the html tags altogether
Decode the html entities to actual characters using HTML::Entities
Trim any whitespace

If the resulting piece of text has a length of zero, would this prove that the email was empty. I don't know anything about line breaks and so on in this context, just an idea.

When is a webpage not a webpage!?

Comments