Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

When is a webpage not a webpage!?

Options
  • 11-06-2013 11:29am
    #1
    Registered Users Posts: 1,477 ✭✭✭


    This is kind of an "lateral thinking" kind of question that I am having a mental block about.
    I have a process that checks an IMAP account for incoming email. It strips the mail of attachments and based on their Content-Type I do certain things. The problem is that an email sent from a client using HTML is considered an HTML attachment, which is fine. Except when the body of the email is empty. Because the content is "text/html" it will be processed as HTML. Examining an empty email in HTML shows that there is underlying HTML but the actual appearance to the user (via their client) is a blank page. (For example, Outlook embeds some crappy MS OWA styling all within a correctly formed HTML page e.g <head><body> etc.). I would like to ignore blank pages.

    So, does anyone have any ideas/tricks how I might determine if a HTML page is empty!? I was hoping I could write out the page to file and parse it for "<body>" tags but as I have said, MS (and maybe other apps) will have these tags in blank email bodies too!


Comments

  • Closed Accounts Posts: 8,016 ✭✭✭CreepingDeath


    Are there any fields in the e-mail header you can query, eg. message size?

    Alternatively, some form of XSLT transformation to strip out all formatting/layout tags to leave the content ?

    You don't say what programming language/libraries you're using, so different answers may apply. Needs more info.


  • Registered Users Posts: 1,477 ✭✭✭azzeretti


    Are there any fields in the e-mail header you can query, eg. message size?

    Alternatively, some form of XSLT transformation to strip out all formatting/layout tags to leave the content ?

    You don't say what programming language/libraries you're using, so different answers may apply. Needs more info.

    I can query any part of the header, no problem. It wouldn't help though as if the message has, say 3 attachments, then the empty HTML body is considered a 4th attachment (or not if there are no attachments - I still need to parse the message body!) and processes it as any other.

    The problem with XSLT transformation is firstly, the overhead!, and secondly the unknown factor of the formatting; i.e different mail clients will have different formatting etc.

    I am using Perl by the way.


  • Moderators, Society & Culture Moderators Posts: 9,689 Mod ✭✭✭✭stevenmu


    I was actually thinking of something similar recently, but luckily I didn't need to do it in the end. HTML is a pain to parse properly.

    One simple thing you could do is use a simple regex to strip out any and all HTML tags, and then see what you are left with. The obvious problem with that is that an email might just be an image, or possibly even an iframe (I think that's possible anyway, not sure). The other issue would be that a simple regex would be to just strip out any content between angle brackets, but you could also have genuine non-HTML content between angle brackets that would get caught up too.


  • Closed Accounts Posts: 8,016 ✭✭✭CreepingDeath


    azzeretti wrote: »
    The problem with XSLT transformation is firstly, the overhead!, and secondly the unknown factor of the formatting; i.e different mail clients will have different formatting etc.

    People worry too much about "overhead".
    You said you could detect the HTML messages by the mime type ie. "text/html".
    So you only need to process those.
    If it adds 1 second on to your processing time, that's still better than displaying blank pages to the user, and you can process the e-mail folder with multiple threads if performance is an issue.


    Do you have an example of one of these empty HTML pages to post?
    Is the formatting embedded in the body tag?

    I don't see why you couldn't knock up an XSLT just to output the text content of the body tag.

    If you don't like XSLT, then maybe some regular expressions eg.

    <body>(.*)</body>

    As Stephen mentioned above...


  • Registered Users Posts: 1,477 ✭✭✭azzeretti


    People worry too much about "overhead".
    You said you could detect the HTML messages by the mime type ie. "text/html".
    I don't understand this. An email could have 10 attachments, multi-content types. There could be 4 actual ("interesting") HTML pages and 2 others (the email body say, if the message was sent using HTML and maybe a signature inserted, or image! - each with the the header signifying HTML)
    Do you have an example of one of these empty HTML pages to post?
    Is the formatting embedded in the body tag?

    I don't see why you couldn't knock up an XSLT just to output the text content of the body tag.

    If you don't like XSLT, then maybe some regular expressions eg.

    <body>(.*)</body>

    As Stephen mentioned above...

    But there will be lots of text between the body tags, mostly for styling, and this changes for almost every email client I have sent from - that's my problem. There is no method of parsing the body of the email to determine if the HTML included is "real", required HTML, or just some rubbish styled to display an blank page! An example of one email client (for a completly blank page) is :
    <body lang=EN-IE link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><o:p>&nbsp;</o:p></p></div></body>
    

    If I used a regular expression to dump the content of the body tag above I would, wrongly, have text displayed.


  • Advertisement
  • Moderators, Society & Culture Moderators Posts: 17,642 Mod ✭✭✭✭Graham


    azzeretti wrote: »
    If I used a regular expression to dump the content of the body tag above I would, wrongly, have text displayed.

    Can you grab everything between the body tags, drop the html tags and see what's left?


  • Moderators, Society & Culture Moderators Posts: 9,689 Mod ✭✭✭✭stevenmu


    If you're open to 3rd party components, there's stuff like SGML Reader which gives a number of useful functions for parsing HTML, or HTML Agility Pack, I haven't used it myself but looks like it might be a little more useful for your needs.


  • Closed Accounts Posts: 8,016 ✭✭✭CreepingDeath


    I was playing around with Regular Expressions ( Eclipse has a regular expression plugin ).

    You could use a find and replace RegEx of
    <[^>]*>(.*)</[^>]*>
    
    And only keep the capturing group 1.

    But keep repeating it until the string you get out of RegEx is the same as the string you put in, ie. no more replacements.

    So a sample input of
    <body lang=EN-IE link=blue vlink=purple>1<div class=WordSection1>2<p class=MsoNormal>3<o:p>&nbsp;</o:p></p></div></body>
    

    becomes...

    Iteration 1
    1<div class=WordSection1>2<p class=MsoNormal>3<o:p>&nbsp;</o:p></p></div>
    

    Iteration 2
    12<p class=MsoNormal>3<o:p>&nbsp;</o:p></p>
    

    Iteration 3
    123<o:p>&nbsp;</o:p>
    

    Iteration 4
    123&nbsp;
    

    Iteration 5
    123&nbsp;
    

    Note : No change between iteration 4 and 5 so stop replacements.


  • Registered Users Posts: 371 ✭✭Fussgangerzone


    I don't know Perl, but might this approach work?

    1. Strip the html tags altogether
    2. Decode the html entities to actual characters using HTML::Entities
    3. Trim any whitespace
    If the resulting piece of text has a length of zero, would this prove that the email was empty. I don't know anything about line breaks and so on in this context, just an idea.


Advertisement