I modified your regex pattern somewhat and added names to the groups. Using names, it is easier to read the structure of the pattern, and you can access the matched substrings with the
MatchObject.groupdict() method. The rest is almost the same code as the non-regex solution.
-
import re
-
-
rgx = re.compile(r'%s %s %s' % ('(?P<first_name>(?P<first_initial>[A-Z])?\w+)', \
-
'(?P<last_name>(?P<last_initial>[A-Z])?\w+)', \
-
'(?P<book_title>.+)'))
-
f = open(fn)
-
output = []
-
for line in f:
-
m = rgx.search(line)
-
dd = m.groupdict()
-
output.append(['%s%s' % (dd['first_initial'].lower(), dd['last_name'].lower()), \
-
'_'.join([word.lower() for word in dd['book_title'].split()])])
-
-
f.close()
-
-
for item in output:
-
print item
>>> ['psmith', 'the_lobster_story']
['cbower', 'in_the_closet']
['tmartin', 'how_to_paint_your_furniture']
>>>