Python iteration help please.

Dermot Illogical · 19-08-2014 02:35PM #1

Hi all,

I'm trying to divide a large text file which contains what is basically a dump of an email account. I want to extract individual emails and write them to individual files.
Each mail starts with "From " and ends with --LFLFLF
I've written the following script but it only extracts the first mail from the file and then stops.
Can someone help identify where I'm going wrong?
I'm using IDLE 2.7 on Win7

import re

target = raw_input("Filename please: ")

def SplitEmails(infile):
    with open(infile, 'r') as f:
        count = 0
        for result in re.findall('(^From\s.*?--\n\n\n)', f.read(), re.S):
            try:
                count += 1
                result = result.rstrip()
                fn = 'email' + str(count) + '.txt'
                with open(fn, 'w') as f:
                    f.write(result)
                    f.close()
            except:
                continue

SplitEmails(target)

srsly78 · 19-08-2014 03:08PM

Probably because there are some carriage returns in there as well, not just LFLFLF. It's platform dependent, windows and unix treat newline differently.

Breakpoint the code and see exactly what symbols it encounters.

Dermot Illogical · 19-08-2014 03:21PM

srsly78 wrote: »

Probably because there are some carriage returns in there as well, not just LFLFLF. It's platform dependent, windows and unix treat newline differently.

Breakpoint the code and see exactly what symbols it encounters.

Thanks.
I have it open in notepad++ and set to display both LF and CR so I'm reasonably certain there aren't any CRs screwing it up.
It matches the 1st email and writes it out perfectly. If I remove that email from the original file it will match the next one only, and so on.
The regex is getting exactly what I want, but only once.

I'll try breakpoint, but will need to google it 1st as I'm basically winging it here.

Dermot Illogical · 19-08-2014 04:25PM

It's always something small, isn't it?
Adding re.M has fixed it, although I'm sure there are a million better ways to do it.

import re

target = raw_input("Filename please: ")

def SplitEmails(infile):
    with open(infile, 'r') as f:
        count = 0
        for result in re.findall('(^From\s.*?--\n\n\n)', f.read(), re.S | re.M):
            try:
                count += 1
                result = result.rstrip()
                fn = 'email' + str(count) + '.txt'
                with open(fn, 'w') as f:
                    f.write(result)
                    f.close()
            except:
                continue

SplitEmails(target)

Python iteration help please.

Comments