File parser - Python

Angelic Devil

I'm building a file parser but I have a problem I'm not sure how to
solve. The files this will parse have the potential to be huge
(multiple GBs). There are distinct sections of the file that I
want to read into separate dictionaries to perform different
operations on. Each section has specific begin and end statements
like the following:

KEYWORD
..
..
..
END KEYWORD

The very first thing I do is read the entire file contents into a
string. I then store the contents in a list, splitting on line ends
as follows:
file_lines = file_contents.s plit('\n')
Next, I build smaller lists from the different sections using the
begin and end keywords:
begin_index = file_lines.inde x(begin_keyword )
end_index = file_lines.inde x(end_keyword)
small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]
I then plan on parsing each list to build the different dictionaries.
The problem is that one begin statement is a substring of another
begin statement as in the following example:
BAR
END BAR

FOOBAR
END FOOBAR
I can't just look for the line in the list that contains BAR because
FOOBAR might come first in the list. My list would then look like

[foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]

I don't really want to use regular expressions, but I don't see a way
to get around this without doing so. Does anyone have any suggestions
on how to accomplish this? If regexps are the way to go, is there an
efficient way to parse the contents of a potentially large list using
regular expressions?

Any help is appreciated!

Thanks,
Aaron

--
"Tis better to be silent and be thought a fool, than to speak and
remove all doubt."
-- Abraham Lincoln

Aug 30 '05 #1

Subscribe Reply

2021

William Park

Angelic Devil <aa********@gma il.com> wrote:

BAR
END BAR

FOOBAR
END FOOBAR

man csplit

--
William Park <op**********@y ahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

Aug 30 '05 #2

Rune Strand

It's not clear to me from your posting what possible order the tags may
be inn. Assuming you will always END a section before beginning an new,
eg.

it's always:

A
some A-section lines.
END A

B
some B-section lines.
END B

etc.

And never:

A
some A-section lines.
B
some B-section lines.
END B
END A

etc.

is should be fairly simple. And if the file is several GB, your ought
to use a generator in order to overcome the memory problem.

Something like this:
def make_tag_lookup (begin_tags):
# create a dict with each {begin_tag : end_tag}
end_tags = [('END ' + begin_tag) for begin_tag in begin_tags]
return dict(zip(begin_ tags, end_tags))
def return_sections (filepath, lookup):
# Generator returning each section

inside_section = False

for line in open(filepath, 'r').readlines( ):
line = line.strip()
if not inside_section:
if line in lookup:
inside_section = True
data_section = []
section_end_tag = lookup[line]
section_begin_t ag = line
data_section.ap pend(line) # store section start tag
else:
if line == section_end_tag :
data_section.ap pend(line) # store section end tag
inside_section = False
yield data_section # yield entire section

else:
data_section.ap pend(line) #store each line within section
# create the generator yielding each section
#
sections = return_sections (datafile,
make_tag_lookup (list_of_begin_ tags))

for section in sections:
for line in section:
print line
print '\n'

Aug 30 '05 #3

MrJean1

Take a closer look at SimpleParse/mxTextTools

<//www.python.org/pypi/SimpleParse/2.0.1a3>

We have used these to parse log files of several 100 MB with simple and
complex grammars up to 250+ productions. Highly recommended.

/Jean Brouwers

PS) For an introduction see also this story
<http://www-128.ibm.com/developerworks/linux/library/l-simple.html>

Aug 30 '05 #4

infidel

Angelic Devil wrote:

I'm building a file parser but I have a problem I'm not sure how to
solve. The files this will parse have the potential to be huge
(multiple GBs). There are distinct sections of the file that I
want to read into separate dictionaries to perform different
operations on. Each section has specific begin and end statements
like the following:

KEYWORD
.
.
.
END KEYWORD

The very first thing I do is read the entire file contents into a
string. I then store the contents in a list, splitting on line ends
as follows:
file_lines = file_contents.s plit('\n')
Next, I build smaller lists from the different sections using the
begin and end keywords:
begin_index = file_lines.inde x(begin_keyword )
end_index = file_lines.inde x(end_keyword)
small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]
I then plan on parsing each list to build the different dictionaries.
The problem is that one begin statement is a substring of another
begin statement as in the following example:
BAR
END BAR

FOOBAR
END FOOBAR
I can't just look for the line in the list that contains BAR because
FOOBAR might come first in the list. My list would then look like

[foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]

I don't really want to use regular expressions, but I don't see a way
to get around this without doing so. Does anyone have any suggestions
on how to accomplish this? If regexps are the way to go, is there an
efficient way to parse the contents of a potentially large list using
regular expressions?

Any help is appreciated!

Thanks,
Aaron

Some time ago I was toying around with writing a tool in python to
parse our VB6 code (the original idea was to write our own .NET
conversion tool because the Wizard that comes with VS.NET sucks hard on
some things). I tried various parsing tools and EBNF grammars but VB6
isn't really an EBNF-esque syntax in all cases, so I needed something
else. VB6 syntax is similar to what you have, with all kinds of
different "Begin/End" blocks, and some files can be rather big. Also,
when you get to conditionals and looping constructs you can have
seriously nested logic, so the approach I took was to imitate a SAX
parser. I created a class that reads VB6 source line by line, and
calls empty "event handler" methods (just like SAX) such as
self.begin_type or self.begin_proc edure and self.end_type or
self.end_proced ure. Then I created a subclass that actually
implemented those event handlers by building a sort of tree that
represents the program in a more abstract fashion. I never got to the
point of writing the tree out in a new language, but I had fun hacking
on the project for a while. I think a similar approach could work for
you here.

Aug 30 '05 #5

Mike C. Fletcher

infidel wrote:

Angelic Devil wrote:

....
Some time ago I was toying around with writing a tool in python to
parse our VB6 code (the original idea was to write our own .NET
conversion tool because the Wizard that comes with VS.NET sucks hard on
some things). I tried various parsing tools and EBNF grammars but VB6
isn't really an EBNF-esque syntax in all cases, so I needed something
else.

....

You may find this project interesting to play with:
http://vb2py.sourceforge.net/index.html

Have fun,
Mike

--
_______________ _______________ _______________ ___
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

Aug 30 '05 #6

Angelic Devil

"Rune Strand" <ru*********@gm ail.com> writes:
Thanks. This shows definate promise. I've already tailored it for
what I need, and it appears to be working.
--
"Society in every state is a blessing, but Government, even in its best
state, is but a necessary evil; in its worst state, an intolerable one."
-- Thomas Paine

Aug 30 '05 #7

Similar topics

4310

How to recognize whether file has XML format or not?

by: Dale | last post by:

How to recognize whether file has XML format or not? Here is the code segment: XmlDocument* pDomDocument = new XmlDocument(); try { pDomDocument->Load(strFileName ) ; } catch(Exception* e) {

.NET Framework

1985

Iterating command switches from a data file - have a working solutionbut it seems inefficient

by: News | last post by:

Hi everyone, My goal is to pull command switches/options from a file and then assign the values to select variables which would eventually be included in a class object. The data file looks something like this but the switches could be in any order and not all may be used. -m quemanager -s server -p port -k key -o object -c 20 -t test@email.com

Python

2383

Can't see the forest for the trees - when reading file, only processingfirst line

by: News | last post by:

Hi Everyone, The attached code creates client connections to websphere queue managers and then processes an inquiry against them. The program functions when it gets options from the command line. It also works when pulling the options from a file.

Python

3503

Option parser question - reading options from file as well as commandline

by: Andrew Robert | last post by:

Hi Everyone. I tried the following to get input into optionparser from either a file or command line. The code below detects the passed file argument and prints the file contents but the individual swithces do not get passed to option parser.

Python

10282

Validation of XML file against external XSD Schema using Xerces CDT

by: christian.eickhoff | last post by:

Hi Everyone, I am currently implementing an XercesDOMParser to parse an XML file and to validate this file against its XSD Schema file which are both located on my local HD drive. For this purpose I set the corresponding XercesDOMParser feature as shown in the upcoming subsection of my code. As far as I understand, the parsing process should throw an DOMException in case the XML file doesn't match the Schema file (e.g. Element...

.NET Framework

1481

File Closing Problem in 2.3 and 2.4, Not in 2.5

by: Carroll, Barry | last post by:

Greetings: Please forgive me if this is the wrong place for this post. I couldn't find a more acceptable forum. If there is one, please point me in the right direction. I am part of a small team writing a table-driven automated testing framework for embedded software. The tables, which contain rows of keywords and data that drive the testing, are stored as plain-text "Comma-Separated Value" or .csv files. Each table can call other...

Python

9753

error while parsing XML file in AIX

by: sherihan2007 | last post by:

Hi while am running perl script which parses an XML file in AIX following error is getting:(i have given use XML::parser in the script) Can't load '/usr/opt/perl5/lib/site_perl/5.8.2/aix-thread-multi/auto/XML/Parser/Expat/Expat.so' for module XML::Parser::Expat: 05 09-022 Cannot load module /usr/opt/perl5/lib/site_perl/5.8.2/aix-thread-multi/auto/XML/Parser/Expat/Expat.so. 0509-150 Dependent module libexpat.a(libexpat.so.0)...

Perl

3544

C function use to read & write an XML file

by: jinendrashankar | last post by:

i am getting following error in my code help me to slove this issue $ gcc -Wall -g -I/usr/include/libxml2/libxml -c create_xml.c In file included from create_xml.c:2: /usr/include/libxml2/libxml/tree.h:20:31: libxml/xmlversion.h: No such file or directory /usr/include/libxml2/libxml/tree.h:880:30: libxml/xmlmemory.h: No such file or directory In file included from create_xml.c:3: /usr/include/libxml2/libxml/parser.h:12:25:...

C / C++

1376

problem when parsing the xml file

by: reddyth | last post by:

Dear All, I wanted to parse an XML file and print the element's content. I have the following code for the same. I have printed the ourput too. The problem is it is printing unwanted spaces and new lines in the output. Help me avoid this problem. use XML::Parser; my $parser = XML::Parser->new( Handlers => { Init => \&handle_doc_start,

Perl

1527

Parsing a file with iterators

by: Luis Zarrabeitia | last post by:

I need to parse a file, text file. The format is something like that: TYPE1 metadata data line 1 data line 2 .... data line N TYPE2 metadata data line 1 ....

Python

8428

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8337

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8851

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8628

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6181

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5650

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4175

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

2754

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1739

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General