473,325 Members | 2,671 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

Parsing text

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.

Thanks,
Victor

Dec 19 '05 #1
9 4660
sicvic wrote:
I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.


That's a good start. Maybe you could post the code that you've already
got that does this, and people could comment on it and help you along.
(I'm suggesting that partly because this almost sounds like homework,
but you'll benefit more by doing it this way than just by having an
answer handed to you whether this is homework or not.)

-Peter

Dec 20 '05 #2
sicvic wrote:
I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"
...
Thanks,
Victor


You did not specify the "key phrase" that you are looking for, so for
the sake
of this example I will assume that it is "key phrase".
I assume that you don't want "key phrase" or "---------------------" to
be returned
as part of your match, so we use minimal group matching (.*?)
You also want your regular expression to use the re.DOTALL flag because
this
is how you match across multiple lines. The simplest way to set this
flag is
to simply put it at the front of your regular expression using the (?s)
notation.

This gives you something like this:
print re.findall ("(?s)key phrase(.*?)---------------------",
your_string_to_search) [0]

So what that basically says is:
1. Match multiline -- that is, match across lines (?s)
2. match "key phrase"
3. Capture the group matching everything (?.*)
4. Match "---------------------"
5. Print the first match in the list [0]

Yours,
Noah

Dec 20 '05 #3
On 19 Dec 2005 15:15:10 -0800, "sicvic" <mo************@gmail.com> wrote:
I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.

This sounds like homework, so just a (big) hint: have a look at itertools
dropwhile and takewhile. The solution is potentially a one-liner, depending
on your matching criteria (e.g., case-sensitive fixed string vs regular expression).

Regards,
Bengt Richter
Dec 20 '05 #4
Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do.

Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.

Dec 20 '05 #5
"sicvic" <mo************@gmail.com> wrote in
news:11**********************@f14g2000cwb.googlegr oups.com:
Not homework...not even in school (do any universities even
teach classes using python?). Just not a programmer. Anyways I
should probably be more clear about what I'm trying to do.

Since I cant show the actual output file lets say I had an
output file that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like
this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded
loop where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
Basically I need to add statements that after finding that line
copy all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.


Something like this, maybe?

"""
This iterates through a file, with subloops to handle the
special cases. I'm assuming that Jimmy and Sarah are not the
only people of interest. I'm also assuming (for no very good
reason) that you do want the separator lines, but do not want
the "Person:" lines in the output file. It is easy enough to
adjust those assumptions to taste.

Each "Person:" line will cause a file to be opened (if it is
not already open, and will write the subsequent lines to it
until the separator is found. Be aware that all files remain
open unitl the loop at the end closes them all.
"""

outfs = {}
f = open('shouldBeDatabase.txt')
for line in f:
if line.find('Person:') >= 0:
ofkey = line[line.find('Person:')+7:].strip()
if not ofkey in outfs:
outfs[ofkey] = open('%s.txt' % ofkey, 'w')
outf = outfs[ofkey]
while line.find('-----------------------------') < 0:
line = f.next()
outf.write('%s' % line)
f.close()
for k,v in outfs.items():
v.close()

--
rzed
Dec 20 '05 #6
sicvic wrote:
Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver


It may be the output of another process but it's the input file as far
as the parsing code is concerned.

The code below gives the following output, if that's any help ( just
adapting Noah's idea above). Note that it deals with the input as a
single string rather than line by line.
Jimmy
Jimmy.txt

Current Location: Denver
Next Location: Chicago

Sarah
Sarah.txt

Current Location: San Diego
Next Location: Miami
Next Location: New York


data='''
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
'''

import StringIO
import re
src = StringIO.StringIO(data)

for name in ['Jimmy', 'Sarah']:
exp = "(?s)Person: %s(.*?)--" % name
filename = "%s.txt" % name
info = re.findall(exp, src.getvalue())[0]
print name
print filename
print info

hth

Gerard

Dec 20 '05 #7
sicvic wrote:
Not homework...not even in school (do any universities even teach
classes using python?). Yup, at least 6, and 20 wouldn't surprise me.
The code I currently have looks something like this:
...
f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

Using re here seems pretty excessive.
How about:
...
f = open(sys.argv[1]) # opens input file ### get comments right
source = iter(f) # files serve lines at their own pace. Let them
for line in source:
if line.endswith('Person: Jimmy\n'):
dest = person_jimmy
elif line.endswith('Person: Sarah\n'):
dest = person_sarah
else:
continue
while line != '---------------\n':
dest.write(line)
line = source.next()
f.close()
person_jimmy.close()
person_sarah.close()

--Scott David Daniels
sc***********@acm.org
Dec 20 '05 #8
Thank you everyone!!!

I got a lot more information then I expected. You guys got my brain
thinking in the right direction and starting to like programming.
You've got a great community here. Keep it up.

Thanks,
Victor

Dec 20 '05 #9
On 20 Dec 2005 08:06:39 -0800, "sicvic" <mo************@gmail.com> wrote:
Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do. Ok, not homework.

Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.

Ok, I generalized on your theme of extracting file chunks to named files,
where the beginning line has the file name. I made '.txt' hardcoded extension.
I provided a way to direct the output to a (I guess not necessarily sub) directory
Not tested beyond what you see. Tweak to suit.

----< extractfilesegs.py >--------------------------------------------------------
"""
Usage: [python] extractfilesegs [source [outdir [startpat [endpat]]]]
where source is -tf for test file, a file name, or an open file
outdir is a directory prefix that will be joined to output file names
startpat is a regular expression with group 1 giving the extracted file name
endpat is a regular expression whose match line is excluded and ends the segment
"""
import re, os

def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', stop='-'*30):
rxstart = re.compile(start)
rxstop = re.compile(stop)
if isinstance(linesrc, basestring): linesrc = open(linesrc)
lineit = iter(linesrc)
files = []
for line in lineit:
match = rxstart.search(line)
if not match: continue
name = match.group(1)
filename = name.lower() + '.txt'
filename = os.path.join(outdir, filename)
#print 'opening file %r'%filename
files.append(filename)
fout = open(filename, 'a') # append in case repeats?
fout.write(match.group(0)+'\n') # did you want aaa bbb stuff?
for data_line in lineit:
if rxstop.search(data_line):
#print 'closing file %r'%filename
fout.close() # don't write line with ending mark
fout = None
break
else:
fout.write(data_line)
if fout:
fout.close()
print 'file %r ended with source file EOF, not stop mark'%filename
return files

def get_testfile():
from StringIO import StringIO
return StringIO("""\
....irrelevant leading
stuff ...
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
irrelevant
trailing stuff ...

with a blank line
""")

if __name__ == '__main__':
import sys
args = sys.argv[1:]
if not args: raise SystemExit(__doc__)
tf = args.pop(0)
if tf=='-tf': fin = get_testfile()
else: fin = tf
if not args:
files = extractFileSegs(fin)
elif len(args)==1:
files = extractFileSegs(fin, args[0])
elif len(args)==2:
files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank line?
else:
files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1])
print '\nFiles created:'
for fname in files:
print ' "%s"'% fname
if tf == '-tf':
for fpath in files:
print '====< %s >====\n%s============'%(fpath, open(fpath).read())
----------------------------------------------------------------------------------

Running on your test data:

[15:19] C:\pywk\clp>md extracteddata

[15:19] C:\pywk\clp>py24 extractfilesegs.py -tf

Files created:
"extracteddata\jimmy.txt"
"extracteddata\sarah.txt"
====< extracteddata\jimmy.txt >====
Person: Jimmy
Current Location: Denver
Next Location: Chicago
============
====< extracteddata\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:20] C:\pywk\clp>md xd

[15:20] C:\pywk\clp>py24 extractfilesegs.py -tf xd (Jimmy) ----

Files created:
"xd\jimmy.txt"
====< xd\jimmy.txt >====
Jimmy
Current Location: Denver
Next Location: Chicago
============

[15:21] C:\pywk\clp>py24 extractfilesegs.py -tf xd "Person: (Sarah)" ----

Files created:
"xd\sarah.txt"
====< xd\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:22] C:\pywk\clp>py24 extractfilesegs.py -tf xd "^(irrelevant)"

Files created:
"xd\irrelevant.txt"
====< xd\irrelevant.txt >====
irrelevant
trailing stuff ...
============

HTH, NO WARRANTIES ;-)
Regards,
Bengt Richter
Dec 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
4
by: ralphNOSPAM | last post by:
Is there a function or otherwise some way to pull out the target text within an XML tag? For example, in the XML tag below, I want to pull out 'CALIFORNIA'. ...
3
by: Pir8 | last post by:
I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story>...
7
by: Lucas Tam | last post by:
Hi all, Does anyone know of a GOOD example on parsing text with text qualifiers? I am hoping to parse text with variable length delimiters/qualifiers. Also, qualified text could run onto...
4
by: Earl | last post by:
I'm curious if there are others who have a better method of accepting/parsing phone numbers. I've used a couple of different techniques that are functional but I can't really say that I'm totally...
2
by: JaythePCguy | last post by:
Hi, I am trying to write a text parser to group all nonprintable and control characters, spaces and space delimited words in different groups using Regex class. Using a parsing of...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
3
by: toton | last post by:
Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
1
by: martinsson | last post by:
Hi all! I'm pretty mad about this... dont know what is going on. Im parsing XML file that looks like this: <something> __<item att="something">text<item> __<item...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.