By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,676 Members | 2,236 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,676 IT Pros & Developers. It's quick & easy.

Search for 2 words in a file and print out all the lines between these 2 words

P: 4
I need to read in a file and split it into lines. Then search for a start and an end word and print out all the words in between including the start and end words.

For example: (file)
a
b
foo
c
d
bar
e

output:
foo
c
d
bar

I tried to search on the web but without any luck.
I'm just a beginner, any help is appreciated thanks.

Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. f = open('input, 'r')
  4. lines = f.readlines()
  5. for line in lines:
  6.     #if line.startswith("main"):
  7.     # and endswith("end"):
  8.     print re.split(r"\s|," , line)
Dec 4 '10 #1
Share this Question
Share on Google+
9 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
This is how I would approach it:

Create an open file object for reading
Iterate on the file object for line in fileObj:
Strip the white space characters from line
If line starts with the start word, initialize a list to contain the results
Start a try/except block to catch an EOFError, because fileObj.next() will raise the error upon reaching the end of the file
Start a while loop while True:
Use fileObj.next().strip() to read the next line
If line starts with the end word, append the word to the list and break out of both loops
If line does not start with the end word, append the word to the list and continue
Then you can print the words like this:print "\n".join(resultsList)
Dec 5 '10 #2

P: 4
Thank you bvdet for the quick reply.
I tried to follow your approach, but i got stuck the second loop (while loop). I get an error when i do try to leave both loops. I am not sure where to use the breaks to leave of the loops. Thanks again for your help
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. fileObj = open(inputFile, 'r')
  4. #lines = f.readlines()
  5. aList = []
  6. for line in fileObj:
  7.     sLine = line.strip()
  8.     #print lStrip
  9.     if sLine.startswith("start"):
  10.         aList = sLine
  11.         #print aList
  12.         try:
  13.             while True:
  14.                 try:
  15.                     fileObj.next().strip()
  16.                     if sLine.startswith("end"):
  17.                         aList.append(sLine)
  18.                         break
  19.         break
  20.                     else:
  21.                         aList.append(sLine)
  22.                 except EOFError:
  23.                     print "something went wrong"
Dec 5 '10 #3

Sean Pedersen
P: 30
Expand|Select|Wrap|Line Numbers
  1. def search(fname, start, end):
  2.  
  3.     file = (line.strip() for line in open(fname))
  4.  
  5.     line = file.next()
  6.     while line != start: line = file.next()
  7.     yield line
  8.  
  9.     line = file.next()
  10.     while line != end: yield line; line = file.next()
  11.     yield line
  12.  
  13. for item in search("file.txt", "foo", "bar"):
  14.     print item
Dec 5 '10 #4

bvdet
Expert Mod 2.5K+
P: 2,851
dann,

You are not too far off. Encapsulate the loops in a function and use the return statement. Also, you must make an assignment to sLine in the inner loop.
Expand|Select|Wrap|Line Numbers
  1. def main():
  2.     fileObj = open(inputfile, 'r')
  3.     #lines = f.readlines()
  4.     for line in fileObj:
  5.         sLine = line.strip()
  6.         if sLine.startswith("foo"):
  7.             aList = [sLine,]
  8.             try:
  9.                 while True:
  10.                     sLine = fileObj.next().strip()
  11.                     if sLine.startswith("bar"):
  12.                         aList.append(sLine)
  13.                         return aList
  14.                     else:
  15.                         aList.append(sLine)
  16.             except EOFError:
  17.                 print "something went wrong"
  18.  
  19. print main()
Here is another way of doing it. The following returns the words in between start and end and was intended to be used on a sentence. It could easily be converted for your application.
Expand|Select|Wrap|Line Numbers
  1. from string import punctuation as stuff
  2.  
  3. def words_between(s, first, second):
  4.     # return the words between first and second words
  5.     words = [word.lower() for word in s.split()]
  6.     # catch error if word not in sentence
  7.     try:
  8.         # strip punctuation for matching
  9.         idx1 = [word.strip(stuff) for word in words].index(first.lower())
  10.         # start search for second word after idx1
  11.         idx2 = [word.strip(stuff) \
  12.                 for word in words].index(second.lower(),idx1+1)
  13.         return words[idx1+1:idx2]
  14.     except ValueError, e:
  15.         return "ValueError: %s" % e
  16.  
  17. sentence = "The small country was ruled by a truculent dictator."
  18. print words_between(sentence, "was", "dictator")
Dec 5 '10 #5

P: 4
bvdet,

Thanks again and to the others who replied.
That's what I need it to do, but after it finds the start word it strips all the lines and I want to keep the file structure as it was. I tried to remove ".strip()" from "sLine = fileObj.next().strip()", but that didn't work. With my previous code i used readlines().

Another question, after that the alist has stored the values between the start and end words I need to loop through the alist for new start and end words. Is it better to use the current function and add a new inner loop to it or define a new function and loop through the alist?
Dec 7 '10 #6

Sean Pedersen
P: 30
The little generator I posted doesn't modify the file. You tried removing the method .strip from your file iterator, but it didn't do what? It's my understanding readlines buffers the entire file, which is not good for big data.

If you keep the file open, simply seek to offset 0. And if not you'll need to reopen it.
Dec 8 '10 #7

Expert 100+
P: 621
Is it better to use the current function and add a new inner loop to it or define a new function and loop through the alist?
I definitely prefer passing the list to a separate function for readability but it is personal preference.
but after it finds the start word it strips all the lines and I want to keep the file structure as it was.
You can strip() and not store the results:
Expand|Select|Wrap|Line Numbers
  1. ## snipped from code posted above
  2.                 while True:
  3.                     sLine = fileObj.next()         ## eliminate strip()
  4.                     ## added here but does not alter sLine
  5.                     if sLine.strip().startswith("bar"):
  6.                         aList.append(sLine)
  7.                         return aList
  8.                     else:
  9.                         aList.append(sLine) 
Dec 9 '10 #8

P: 4
Thanks everybody with your help my code is doing what it should do.
I am catching all the code between a start and an end point and storing it in a list.
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. parsing =False
  4. aList = []
  5. mylist = []
  6. bBlock = []
  7.  
  8. fileObj = open('inputfile', 'r')
  9. for line in fileObj:
  10.     if line.find(".ent") != -1:
  11.         print "Now parsing **************************************"
  12.         line
  13.         parsing = True
  14.     if line.find(".end") != -1:
  15.         print "Stopped parsing **********************************"
  16.         parsing = False
  17.  
  18.     if parsing:
  19.         instruct = re.split(r"\s|," , line)
  20.         aList.append(instruct)
  21.     else:
  22.         mylist.append(line)
Next thing I am trying to do is to itrate/loop through my list (aList) I have stored earlier, trying to find a word that ends with double point(:) in my list. I tried the following code but I didn't succedeed. The error code I get is this:
File "\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
Any help is appreciated.
Expand|Select|Wrap|Line Numbers
  1. for line in aList: 
  2.     if re.search(":", line):
  3.         bBlock.append(line)
Dec 15 '10 #9

bvdet
Expert Mod 2.5K+
P: 2,851
It could be a problem with this if block:
Expand|Select|Wrap|Line Numbers
  1. if parsing:
  2.         instruct = re.split(r"\s|," , line)
  3.         aList.append(instruct)
re.split() will return a list to the identifier instruct. re.search() expects a string, not a list.
Dec 15 '10 #10

Post your reply

Sign in to post your reply or Sign up for a free account.