Parsing text - Python

sicvic

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.

Thanks,
Victor

Dec 19 '05 #1

Subscribe Reply

4674

Peter Hansen

sicvic wrote:

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.

That's a good start. Maybe you could post the code that you've already
got that does this, and people could comment on it and help you along.
(I'm suggesting that partly because this almost sounds like homework,
but you'll benefit more by doing it this way than just by having an
answer handed to you whether this is homework or not.)

-Peter

Dec 20 '05 #2

Noah

sicvic wrote:

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"
...
Thanks,
Victor

You did not specify the "key phrase" that you are looking for, so for
the sake
of this example I will assume that it is "key phrase".
I assume that you don't want "key phrase" or "---------------------" to
be returned
as part of your match, so we use minimal group matching (.*?)
You also want your regular expression to use the re.DOTALL flag because
this
is how you match across multiple lines. The simplest way to set this
flag is
to simply put it at the front of your regular expression using the (?s)
notation.

This gives you something like this:
print re.findall ("(?s)key phrase(.*?)---------------------",
your_string_to_ search) [0]

So what that basically says is:
1. Match multiline -- that is, match across lines (?s)
2. match "key phrase"
3. Capture the group matching everything (?.*)
4. Match "---------------------"
5. Print the first match in the list [0]

Yours,
Noah

Dec 20 '05 #3

Bengt Richter

On 19 Dec 2005 15:15:10 -0800, "sicvic" <mo************ @gmail.com> wrote:

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.

This sounds like homework, so just a (big) hint: have a look at itertools
dropwhile and takewhile. The solution is potentially a one-liner, depending
on your matching criteria (e.g., case-sensitive fixed string vs regular expression).

Regards,
Bengt Richter

Dec 20 '05 #4

sicvic

Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do.

Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt ', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt ', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Per son: Jimmy', line):
person_jimmy.wr ite(line)
elif re.search(r'Per son: Sarah', line):
person_sarah.wr ite(line)

#closes all files

person_jimmy.cl ose()
person_sarah.cl ose()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.

Dec 20 '05 #5

rzed

"sicvic" <mo************ @gmail.com> wrote in
news:11******** **************@ f14g2000cwb.goo glegroups.com:

Not homework...not even in school (do any universities even
teach classes using python?). Just not a programmer. Anyways I
should probably be more clear about what I'm trying to do.

Since I cant show the actual output file lets say I had an
output file that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt ', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt ', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Per son: Jimmy', line):
person_jimmy.wr ite(line)
elif re.search(r'Per son: Sarah', line):
person_sarah.wr ite(line)

#closes all files

person_jimmy.cl ose()
person_sarah.cl ose()
f.close()

However this only would produces output files that look like
this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded
loop where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
Basically I need to add statements that after finding that line
copy all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.

Something like this, maybe?

"""
This iterates through a file, with subloops to handle the
special cases. I'm assuming that Jimmy and Sarah are not the
only people of interest. I'm also assuming (for no very good
reason) that you do want the separator lines, but do not want
the "Person:" lines in the output file. It is easy enough to
adjust those assumptions to taste.

Each "Person:" line will cause a file to be opened (if it is
not already open, and will write the subsequent lines to it
until the separator is found. Be aware that all files remain
open unitl the loop at the end closes them all.
"""

outfs = {}
f = open('shouldBeD atabase.txt')
for line in f:
if line.find('Pers on:') >= 0:
ofkey = line[line.find('Pers on:')+7:].strip()
if not ofkey in outfs:
outfs[ofkey] = open('%s.txt' % ofkey, 'w')
outf = outfs[ofkey]
while line.find('-----------------------------') < 0:
line = f.next()
outf.write('%s' % line)
f.close()
for k,v in outfs.items():
v.close()

--
rzed

Dec 20 '05 #6

Gerard Flanagan

sicvic wrote:

Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver

It may be the output of another process but it's the input file as far
as the parsing code is concerned.

The code below gives the following output, if that's any help ( just
adapting Noah's idea above). Note that it deals with the input as a
single string rather than line by line.
Jimmy
Jimmy.txt

Current Location: Denver
Next Location: Chicago

Sarah
Sarah.txt

Current Location: San Diego
Next Location: Miami
Next Location: New York

data='''
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
'''

import StringIO
import re
src = StringIO.String IO(data)

for name in ['Jimmy', 'Sarah']:
exp = "(?s)Person : %s(.*?)--" % name
filename = "%s.txt" % name
info = re.findall(exp, src.getvalue())[0]
print name
print filename
print info

hth

Gerard

Dec 20 '05 #7

Scott David Daniels

sicvic wrote:

Not homework...not even in school (do any universities even teach
classes using python?). Yup, at least 6, and 20 wouldn't surprise me.
The code I currently have looks something like this:
...
f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Per son: Jimmy', line):
person_jimmy.wr ite(line)
elif re.search(r'Per son: Sarah', line):
person_sarah.wr ite(line)

Using re here seems pretty excessive.
How about:
...
f = open(sys.argv[1]) # opens input file ### get comments right
source = iter(f) # files serve lines at their own pace. Let them
for line in source:
if line.endswith(' Person: Jimmy\n'):
dest = person_jimmy
elif line.endswith(' Person: Sarah\n'):
dest = person_sarah
else:
continue
while line != '---------------\n':
dest.write(line )
line = source.next()
f.close()
person_jimmy.cl ose()
person_sarah.cl ose()

--Scott David Daniels
sc***********@a cm.org

Dec 20 '05 #8

sicvic

Thank you everyone!!!

I got a lot more information then I expected. You guys got my brain
thinking in the right direction and starting to like programming.
You've got a great community here. Keep it up.

Thanks,
Victor

Dec 20 '05 #9

Bengt Richter

On 20 Dec 2005 08:06:39 -0800, "sicvic" <mo************ @gmail.com> wrote:

Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do. Ok, not homework.

Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt ', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt ', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Per son: Jimmy', line):
person_jimmy.wr ite(line)
elif re.search(r'Per son: Sarah', line):
person_sarah.wr ite(line)

#closes all files

person_jimmy.c lose()
person_sarah.c lose()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.

Ok, I generalized on your theme of extracting file chunks to named files,
where the beginning line has the file name. I made '.txt' hardcoded extension.
I provided a way to direct the output to a (I guess not necessarily sub) directory
Not tested beyond what you see. Tweak to suit.

----< extractfilesegs .py >--------------------------------------------------------
"""
Usage: [python] extractfilesegs [source [outdir [startpat [endpat]]]]
where source is -tf for test file, a file name, or an open file
outdir is a directory prefix that will be joined to output file names
startpat is a regular expression with group 1 giving the extracted file name
endpat is a regular expression whose match line is excluded and ends the segment
"""
import re, os

def extractFileSegs (linesrc, outdir='extract eddata', start=r'Person: \s+(\w+)', stop='-'*30):
rxstart = re.compile(star t)
rxstop = re.compile(stop )
if isinstance(line src, basestring): linesrc = open(linesrc)
lineit = iter(linesrc)
files = []
for line in lineit:
match = rxstart.search( line)
if not match: continue
name = match.group(1)
filename = name.lower() + '.txt'
filename = os.path.join(ou tdir, filename)
#print 'opening file %r'%filename
files.append(fi lename)
fout = open(filename, 'a') # append in case repeats?
fout.write(matc h.group(0)+'\n' ) # did you want aaa bbb stuff?
for data_line in lineit:
if rxstop.search(d ata_line):
#print 'closing file %r'%filename
fout.close() # don't write line with ending mark
fout = None
break
else:
fout.write(data _line)
if fout:
fout.close()
print 'file %r ended with source file EOF, not stop mark'%filename
return files

def get_testfile():
from StringIO import StringIO
return StringIO("""\
....irrelevant leading
stuff ...
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
irrelevant
trailing stuff ...

with a blank line
""")

if __name__ == '__main__':
import sys
args = sys.argv[1:]
if not args: raise SystemExit(__do c__)
tf = args.pop(0)
if tf=='-tf': fin = get_testfile()
else: fin = tf
if not args:
files = extractFileSegs (fin)
elif len(args)==1:
files = extractFileSegs (fin, args[0])
elif len(args)==2:
files = extractFileSegs (fin, args[0], args[1], '^$') # stop on blank line?
else:
files = extractFileSegs (fin, args[0], '|'.join(args[1:-1]), args[-1])
print '\nFiles created:'
for fname in files:
print ' "%s"'% fname
if tf == '-tf':
for fpath in files:
print '====< %s >====\n%s====== ======'%(fpath, open(fpath).rea d())
----------------------------------------------------------------------------------

Running on your test data:

[15:19] C:\pywk\clp>md extracteddata

[15:19] C:\pywk\clp>py2 4 extractfilesegs .py -tf

Files created:
"extracteddata\ jimmy.txt"
"extracteddata\ sarah.txt"
====< extracteddata\j immy.txt >====
Person: Jimmy
Current Location: Denver
Next Location: Chicago
============
====< extracteddata\s arah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:20] C:\pywk\clp>md xd

[15:20] C:\pywk\clp>py2 4 extractfilesegs .py -tf xd (Jimmy) ----

Files created:
"xd\jimmy.t xt"
====< xd\jimmy.txt >====
Jimmy
Current Location: Denver
Next Location: Chicago
============

[15:21] C:\pywk\clp>py2 4 extractfilesegs .py -tf xd "Person: (Sarah)" ----

Files created:
"xd\sarah.t xt"
====< xd\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:22] C:\pywk\clp>py2 4 extractfilesegs .py -tf xd "^(irreleva nt)"

Files created:
"xd\irrelevant. txt"
====< xd\irrelevant.t xt >====
irrelevant
trailing stuff ...
============

HTH, NO WARRANTIES ;-)
Regards,
Bengt Richter

Dec 20 '05 #10

Similar topics

2879

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed loaded into cache, the slideshow doesn't look very nice. I am not sure how/when to call the slideshow() function to make sure it starts after the preload has been completed.

Javascript

2649

Parsing XML Tags Help

by: ralphNOSPAM | last post by:

Is there a function or otherwise some way to pull out the target text within an XML tag? For example, in the XML tag below, I want to pull out 'CALIFORNIA'. <txtNameUSState>CALIFORNIA</txtNameUSState>

PHP

3495

Parsing complex xml file with C#

by: Pir8 | last post by:

I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story> <story_id>112233</story_id> <pub_name>Puleen's Publication</pub_name> <pub_code>PP</pub_code> <edition_date>20031201</edition_date>

C# / C Sharp

5122

Text Parsing with Qualifiers

by: Lucas Tam | last post by:

Hi all, Does anyone know of a GOOD example on parsing text with text qualifiers? I am hoping to parse text with variable length delimiters/qualifiers. Also, qualified text could run onto mulitple lines and contain characters like vbcrlf (thus the multiple lines). Anyhow, any help would be appreciated. Thanks!

Visual Basic .NET

5255

Parsing phone numbers

by: Earl | last post by:

I'm curious if there are others who have a better method of accepting/parsing phone numbers. I've used a couple of different techniques that are functional but I can't really say that I'm totally happy with either. 1. My first technique was to restrict the users to entries that could only be 3 character, 3 characters, 4 character (area code, prefix, suffix, respectively). I would null out any inputs that were non-numeric (except the...

.NET Framework

2093

Regex Text parsing

by: JaythePCguy | last post by:

Hi, I am trying to write a text parser to group all nonprintable and control characters, spaces and space delimited words in different groups using Regex class. Using a parsing of (?<Commands>)|(?<Spaces>)|(?<Text>+) on my sample text of \tOne\ncar red \fcar\a blue car\r\n \r\n does not work as indetended. Specially, the spaces are grouped as commands and the Text grouping ends up grouping words delimited by spaces. Here is the sample...

.NET Framework

4055

Parsing Baseball Stats

by: ankitdesai | last post by:

I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics" portions of Babe Ruth page (http://www.baseballprospectus.com/dt/ruthba01.shtml) and store that info in a CSV file. Also, I would like to do this for numerous players whose IDs I have stored in a text file (e.g.: cobbty01, ruthba01, speaktr01, etc.)....

Python

4376

parsing an ifstream to get some specific text

by: toton | last post by:

Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in the file) with the location. And for a particular section I parse only that section. The file is something like, .... DATAS

C / C++

4491

Command language parsing - how formal to get?

by: Chris Carlen | last post by:

Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple command set consisting of several single letter commands which take no arguments. A few additional single letter commands take arguments:

C / C++

2145

XML parsing delirium

by: martinsson | last post by:

Hi all! I'm pretty mad about this... dont know what is going on. Im parsing XML file that looks like this: <something> __<item att="something">text<item> __<item att="something">text<item> __<item att="something">text <span class="some">inside text</span> text<item>

Flash / Actionscript

8418

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8940

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8840

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8694

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

5718

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4237

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4433

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2830

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1831

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General