I am trying to make a Python2.6 script on a Win32 that will read all the text files stored in a directory and print only the lines containing actual data. A sample file -
Set : 1
Date: 10212009
12 34 56
25 67 90
End Set
********
Set: 2
Date: 10222009
34 56 89
25 67 89
End Set
In the above example file, I want to print only the lines 3, 4 and 9, 10 (the actual data values). The program does this iteratively on all txt files.
I wrote the script as below and am testing it on a single txt file as I go.
My logic is to read the input files one by one and search for a start string. As soon as the match is found, start searching for end string. when both are found, print the lines from start string to end string.Repeat on the rest of the file before opening another file.
The problem I am having is that it successfully reads the Set 1 of data, but then screws up on subsequent sets in the file. For set 2, it identifies the no. of lines to read, but prints them starting at incorrect line number.
A little digging leads to following explanations -
1. Using seek and tell to reposition the 2nd iteration of the loop, which did not work since the file is read from buffer and that screws up "tell" value.
2. Opening the file in binary mode helped someone, but it is not working for me.
3. Open the file with 0 buffer mode, but it did not work.
Second problem I am having is when it prints data from Set 1, it inserts a blank line between 2 lines of data values. How can I get rid of it?
Note: Ignore all references to next_run in the code below. I was trying it out for repositioning line read. Subsequent searches for start string should begin from the last position of end string
Expand|Select|Wrap|Line Numbers
- #!C:/Python26 python
- # Import necessary modules
- import os, glob, string, sys, fileinput, linecache
- from goto import goto, label
- # Set working path
- path = 'C:\\System_Data'
- # --------------------
- # PARSE DATA MODULE
- # --------------------
- # Define the search strings for data
- start_search = "Set :"
- end_search ="End Set"
- # For Loop to read the input txt files one by one
- for inputfile in glob.glob( os.path.join( path, '*.txt' ) ):
- inputfile_fileHandle = open ( inputfile, 'rb', 0 )
- print( "Current file being read: " +inputfile )
- # start_line initializes to first line
- start_line = 0
- # After first set of data is extracted, next_run will store the position to read the rest of the file
- # next_run = 0
- # start reading the input files, one line by one line
- for line in inputfile:
- line = inputfile_fileHandle.readline()
- start_line += 1
- # next_run+=1
- # If a line matched with the start_search string
- has_match = line.find( start_search )
- if has_match >= 0:
- print ( "Start String found at line number: %d" %( start_line ) )
- # Store the location where the search will be restarted
- # next_run = inputfile_fileHandle.tell() #inputfile_fileHandle.lineno()
- print ("Current Position: %d" % next_run)
- end_line = start_line
- print ( "Start_Line: %d" %start_line )
- print ( "End_Line: %d" %end_line )
- #print(line)
- for line in inputfile:
- line = inputfile_fileHandle.readline()
- #print (line)
- end_line += 1
- has_match = line.find(end_search)
- if has_match >= 0:
- print 'End String found at line number: %d' % (end_line)
- # total lines to print:
- k=0
- # for loop to print all the lines from start string to end string
- for j in range(0,end_line-start_line-1):
- print linecache.getline(inputfile, start_line +1+ j )
- k+=1
- print ( "Number of lines Printed: %d " %k )
- # Using goto to get out of 2 loops at once
- goto .re_search_start_string
- label .re_search_start_string
- #inputfile_fileHandle.seek(next_run,0)
- inputfile_fileHandle.close ()