By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,543 Members | 2,037 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,543 IT Pros & Developers. It's quick & easy.

Stripping C-style comments using a Python regexp

P: n/a
Hi Folks,

I'm trying to strip C/C++ style comments (/* ... */ or // ) from
source code using Python regexps.

If I don't have to worry about comments embedded in strings, it seems
pretty straightforward (this is what I'm using now):

cpp_pat = re.compile(r"""
/\* .*? \*/ | # C comments
// [^\n\r]* # C++ comments
""",re.S|re.X)
s = file('myprog.cpp').read()
cpp_pat.sub(' ',s)

However, the sticking point is dealing with tokens like /* embedded
within a string:

const char *mystr = "This is /*trouble*/";

I've inherited a working Perl script, which I'd like to reimplement in
Python so that I don't have to spawn a new Perl process in my Python
program each time I want to strip comments from a file. The Perl script
looks like this:

#!/usr/bin/perl -w

$/ = undef; # no line delimiter
$_ = <>; # read entire file

s! ((['"]) (?: \\. | .)*? \2) | # skip quoted strings
/\* .*? \*/ | # delete C comments
// [^\n\r]* # delete C++ comments
! $1 || ' ' # change comments to a single space
!xseg; # ignore white space, treat as single line
# evaluate result, repeat globally
print;

The Perl regexp above uses some sort of conditional to deal with this,
by replacing a quoted string with itself if the initial match is a
quoted string. Is there some equivalent feature in Python regexps?

Lorin

Jul 27 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
> Is there some equivalent feature in Python regexps?

cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

def subfunc(match):
if match.group(2):
return match.group(2)
else:
return ''

stripped_c_code = cpp_pat.sub(subfunc, c_code)
....I suppose this is what the Perl code might do, but I'm not sure,
since trying to read it hurts my brain...

Jul 27 '05 #2

P: n/a
#------------------------------------------------------------------------
import re, sys

def q(c):
"""Returns a regular expression that matches a region delimited by c,
inside which c may be escaped with a backslash"""

return r"%s(\\.|[^%s])*%s" % (c, c, c)

single_quoted_string = q('"')
double_quoted_string = q("'")
c_comment = r"/\*.*?\*/"
cxx_comment = r"//[^\n]*[\n]"

rx = re.compile("|".join([single_quoted_string, double_quoted_string,
c_comment, cxx_comment]), re.DOTALL)

def replace(x):
x = x.group(0)
if x.startswith("/"): return ' '
return x

result = rx.sub(replace, sys.stdin.read())
sys.stdout.write(result)
#------------------------------------------------------------------------

The regular expression matches ""-strings, ''-character-constants,
c-comments, and c++-comments. The replace function returns ' ' (space)
when the matched thing was a comment, or the original thing otherwise.
Depending on your use for this code, replace() should return as many
'\n's as are in the matched thing, or ' ' otherwise, so that line
numbers remain unchanged.

Basically, the regular expression is a tokenizer, and replace() chooses
what to do with each recognized token. Things not recognized as tokens
by the regular expression are left unchanged.

Jeff
PS this is the test file I used:
/* ... */ xyzzy;
456 // 123
const char *mystr = "This is /*trouble*/";
/* * */
/* /* */
// /* /* */
/* // /* */
/*
* */

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFC57hHJd01MZaTXX0RAsE4AKCAmR8fPkU6BNofAZQhn1 X9qdWNMQCgn+8c
ex2GXeRAF+P2d3HJuRDs6zo=
=J5YT
-----END PGP SIGNATURE-----

Jul 27 '05 #3

P: n/a
> Is there some equivalent feature in Python regexps?

cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

def subfunc(match):
if match.group(2):
return match.group(2)
else:
return ''

stripped_c_code = cpp_pat.sub(subfunc, c_code)
....I suppose this is what the Perl code might do, but I'm not sure,
since trying to read it hurts my brain...

Jul 27 '05 #4

P: n/a
Neat! I didn't realize that re.sub could take a function as an
argument. Thanks.

Lorin

Jul 27 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.