471,075 Members | 1,218 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,075 software developers and data experts.

C parsing fun

Helo ppl!

At the job I was given the task to make a script to analyze C++ code
based on concepts my boss had. To do this I needed to represent C++
code structure in Python somehow. I read the docs for Yapps, pyparsing
and other stuff like those, then I came up with a very simple idea. I
realized that bracketed code is almost like a Python list, except I
have to replace curly brackets with squared ones and surround the
remaining stuff with quotes. This process invokes no recursion or node
objects, only pure string manipulations so I believe it's really fast.
Finally I can get the resulting list by calling eval() with the
string.

For example when I need to parse a class definition, I only need to
look for a list item containing the pattern "*class*", and the next
item will be the contents of the class as another list.

You can grab the code at:

http://kiri.csing.hu/stack/python/bloppy-0.1.zip

(test script [test.py] included)

Feb 5 '07 #1
12 1917
and the great thing is that the algorithm can be used with any
language that structures the code with brackets, like PHP and many
others.

Feb 5 '07 #2
based on concepts my boss had. To do this I needed to represent C++
code structure in Python somehow. I read the docs for Yapps, pyparsing
and other stuff like those, then I came up with a very simple idea. I
realized that bracketed code is almost like a Python list, except I
have to replace curly brackets with squared ones and surround the
remaining stuff with quotes. This process invokes no recursion or node
yes that's a nice solution
sometimes it's not enough though (won't work on code obfuscated with
macros)

anyway if you need something more sophisticated then i'd recommend
gccxml or it's python binding:

http://www.language-binding.net/pygccxml/pygccxml.html

Feb 5 '07 #3
Thx for responding, Szabolcs! I've already tried that, but couldn't
manage to get it to work. The source I tried to parse is a huge MSVC
7.1 solution containing about 38 projects, and I believe the code is
so complex that it has too many different dependencies and GCC just
can't handle them. Btw I'm not deeply familiar with C++ compilers, so
maybe it was because of compiler misconfiguration, but I really don't
know...

Szabolcs Nagy írta:
based on concepts my boss had. To do this I needed to represent C++
code structure in Python somehow. I read the docs for Yapps, pyparsing
and other stuff like those, then I came up with a very simple idea. I
realized that bracketed code is almost like a Python list, except I
have to replace curly brackets with squared ones and surround the
remaining stuff with quotes. This process invokes no recursion or node

yes that's a nice solution
sometimes it's not enough though (won't work on code obfuscated with
macros)

anyway if you need something more sophisticated then i'd recommend
gccxml or it's python binding:

http://www.language-binding.net/pygccxml/pygccxml.html
Feb 5 '07 #4
In <11**********************@v33g2000cwv.googlegroups .com>,
karoly.kiripolszky wrote:
and the great thing is that the algorithm can be used with any
language that structures the code with brackets, like PHP and many
others.
But it fails if brackets appear in comments or literal strings.

Ciao,
Marc 'BlackJack' Rintsch

Feb 5 '07 #5

Marc 'BlackJack' Rintsch írta:
In <11**********************@v33g2000cwv.googlegroups .com>,
karoly.kiripolszky wrote:
and the great thing is that the algorithm can be used with any
language that structures the code with brackets, like PHP and many
others.

But it fails if brackets appear in comments or literal strings.

Ciao,
Marc 'BlackJack' Rintsch
Feb 5 '07 #6
You're right, thank you for the comment! I will look after how to
avoid this.

Marc 'BlackJack' Rintsch írta:
In <11**********************@v33g2000cwv.googlegroups .com>,
karoly.kiripolszky wrote:
and the great thing is that the algorithm can be used with any
language that structures the code with brackets, like PHP and many
others.

But it fails if brackets appear in comments or literal strings.

Ciao,
Marc 'BlackJack' Rintsch
Feb 5 '07 #7
Károly Kiripolszky wrote:
You're right, thank you for the comment! I will look after how to
avoid this.
And after you have resolved this 'small' ;-) detail you will probably
notice, that some full functional and in wide use being parser have
still trouble with this ...

Claudio
>
Marc 'BlackJack' Rintsch írta:
>In <11**********************@v33g2000cwv.googlegroups .com>,
karoly.kiripolszky wrote:
>>and the great thing is that the algorithm can be used with any
language that structures the code with brackets, like PHP and many
others.
But it fails if brackets appear in comments or literal strings.

Ciao,
Marc 'BlackJack' Rintsch
Feb 5 '07 #8
I've found a brute-force solution. In the preprocessing phase I simply
strip out the comments (things inside comments won't appear in the
result) and replace curly brackets with these symbols: #::OPEN::# and
#::CLOSE::#. After parsing I convert them back. In fact I can disclude
commented lines from the analyzis as I only have to cope with
production code.

Claudio Grondi írta:
Károly Kiripolszky wrote:
You're right, thank you for the comment! I will look after how to
avoid this.
And after you have resolved this 'small' ;-) detail you will probably
notice, that some full functional and in wide use being parser have
still trouble with this ...

Claudio

Marc 'BlackJack' Rintsch írta:
In <11**********************@v33g2000cwv.googlegroups .com>,
karoly.kiripolszky wrote:

and the great thing is that the algorithm can be used with any
language that structures the code with brackets, like PHP and many
others.
But it fails if brackets appear in comments or literal strings.

Ciao,
Marc 'BlackJack' Rintsch
Feb 5 '07 #9
http://kiri.csing.hu/stack/python/bloppy-0.2.zip

Test data now also contains brackets in literal strings.

Claudio Grondi írta:
Károly Kiripolszky wrote:
You're right, thank you for the comment! I will look after how to
avoid this.
And after you have resolved this 'small' ;-) detail you will probably
notice, that some full functional and in wide use being parser have
still trouble with this ...

Claudio

Marc 'BlackJack' Rintsch írta:
In <11**********************@v33g2000cwv.googlegroups .com>,
karoly.kiripolszky wrote:

and the great thing is that the algorithm can be used with any
language that structures the code with brackets, like PHP and many
others.
But it fails if brackets appear in comments or literal strings.

Ciao,
Marc 'BlackJack' Rintsch
Feb 5 '07 #10
Helo again!

When I came up with this idea on how to parse C files with ease, I was
at home and I only have access to the sources in subject in the
office. So I've tried the previously posted algorithm on the actual
source today and I realized my originally example data I've ran the
test with was so simple, that with some header files the algorithm
still failed. I had to make some further changes and by now I was able
to parse 1135 header files in 5 seconds with no errors.

This may be considered as spamming, but this package is so small I
don't wan't to create a page for it on SF or Google Code. Furthermore
I want to provide people who find this topic a working solution, so
here's the latest not-so-elegant-brute-force-but-fast parser:

http://kiri.csing.hu/stack/python/bloppy-0.3.zip

On Feb 5, 1:43 pm, "karoly.kiripolszky" <karoly.kiripols...@gmail.com>
wrote:
Helo ppl!

At the job I was given the task to make a script to analyze C++ code
based on concepts my boss had. To do this I needed to represent C++
code structure in Python somehow. I read the docs for Yapps, pyparsing
and other stuff like those, then I came up with a very simple idea. I
realized that bracketed code is almost like a Python list, except I
have to replace curly brackets with squared ones and surround the
remaining stuff with quotes. This process invokes no recursion or node
objects, only pure string manipulations so I believe it's really fast.
Finally I can get the resulting list by calling eval() with the
string.

For example when I need to parse a class definition, I only need to
look for a list item containing the pattern "*class*", and the next
item will be the contents of the class as another list.

You can grab the code at:

http://kiri.csing.hu/stack/python/bloppy-0.1.zip

(test script [test.py] included)

Feb 6 '07 #11
Károly Kiripolszky <ka****************@gmail.comwrote:
I've found a brute-force solution. In the preprocessing phase I simply
strip out the comments (things inside comments won't appear in the
result) and replace curly brackets with these symbols: #::OPEN::# and
#::CLOSE::#.
This fails when the code already has the strings "#::OPEN::#" and
"#::CLOSE::" in it.

--
Roberto Bonvallet
Feb 8 '07 #12
Yes, of course. But you can still fine-tune the code for the sources
you want to parse. The C++ header files I needed to analyze contained
no such strings. I believe there are very few real-life .h files out
there containing those. In fact I chose #::OPEN::# and #::CLOSE::#
because they're more foreign to C++ like eg. ::OPEN or #OPEN would be.
I hope this makes sense. :)

Roberto Bonvallet írta:
Károly Kiripolszky <ka****************@gmail.comwrote:
I've found a brute-force solution. In the preprocessing phase I simply
strip out the comments (things inside comments won't appear in the
result) and replace curly brackets with these symbols: #::OPEN::# and
#::CLOSE::#.

This fails when the code already has the strings "#::OPEN::#" and
"#::CLOSE::" in it.

--
Roberto Bonvallet
Feb 8 '07 #13

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by Gerrit Holl | last post: by
16 posts views Thread by Terry | last post: by
9 posts views Thread by ankitdesai | last post: by
5 posts views Thread by randy | last post: by
13 posts views Thread by Chris Carlen | last post: by
7 posts views Thread by Daniel Fetchinson | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.