473,788 Members | 2,787 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Stripping C-style comments using a Python regexp

Hi Folks,

I'm trying to strip C/C++ style comments (/* ... */ or // ) from
source code using Python regexps.

If I don't have to worry about comments embedded in strings, it seems
pretty straightforward (this is what I'm using now):

cpp_pat = re.compile(r"""
/\* .*? \*/ | # C comments
// [^\n\r]* # C++ comments
""",re.S|re .X)
s = file('myprog.cp p').read()
cpp_pat.sub(' ',s)

However, the sticking point is dealing with tokens like /* embedded
within a string:

const char *mystr = "This is /*trouble*/";

I've inherited a working Perl script, which I'd like to reimplement in
Python so that I don't have to spawn a new Perl process in my Python
program each time I want to strip comments from a file. The Perl script
looks like this:

#!/usr/bin/perl -w

$/ = undef; # no line delimiter
$_ = <>; # read entire file

s! ((['"]) (?: \\. | .)*? \2) | # skip quoted strings
/\* .*? \*/ | # delete C comments
// [^\n\r]* # delete C++ comments
! $1 || ' ' # change comments to a single space
!xseg; # ignore white space, treat as single line
# evaluate result, repeat globally
print;

The Perl regexp above uses some sort of conditional to deal with this,
by replacing a quoted string with itself if the initial match is a
quoted string. Is there some equivalent feature in Python regexps?

Lorin

Jul 27 '05 #1
4 4670
> Is there some equivalent feature in Python regexps?

cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

def subfunc(match):
if match.group(2):
return match.group(2)
else:
return ''

stripped_c_code = cpp_pat.sub(sub func, c_code)
....I suppose this is what the Perl code might do, but I'm not sure,
since trying to read it hurts my brain...

Jul 27 '05 #2
#------------------------------------------------------------------------
import re, sys

def q(c):
"""Returns a regular expression that matches a region delimited by c,
inside which c may be escaped with a backslash"""

return r"%s(\\.|[^%s])*%s" % (c, c, c)

single_quoted_s tring = q('"')
double_quoted_s tring = q("'")
c_comment = r"/\*.*?\*/"
cxx_comment = r"//[^\n]*[\n]"

rx = re.compile("|". join([single_quoted_s tring, double_quoted_s tring,
c_comment, cxx_comment]), re.DOTALL)

def replace(x):
x = x.group(0)
if x.startswith("/"): return ' '
return x

result = rx.sub(replace, sys.stdin.read( ))
sys.stdout.writ e(result)
#------------------------------------------------------------------------

The regular expression matches ""-strings, ''-character-constants,
c-comments, and c++-comments. The replace function returns ' ' (space)
when the matched thing was a comment, or the original thing otherwise.
Depending on your use for this code, replace() should return as many
'\n's as are in the matched thing, or ' ' otherwise, so that line
numbers remain unchanged.

Basically, the regular expression is a tokenizer, and replace() chooses
what to do with each recognized token. Things not recognized as tokens
by the regular expression are left unchanged.

Jeff
PS this is the test file I used:
/* ... */ xyzzy;
456 // 123
const char *mystr = "This is /*trouble*/";
/* * */
/* /* */
// /* /* */
/* // /* */
/*
* */

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFC57hHJd0 1MZaTXX0RAsE4AK CAmR8fPkU6BNofA ZQhn1X9qdWNMQCg n+8c
ex2GXeRAF+P2d3H JuRDs6zo=
=J5YT
-----END PGP SIGNATURE-----

Jul 27 '05 #3
> Is there some equivalent feature in Python regexps?

cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

def subfunc(match):
if match.group(2):
return match.group(2)
else:
return ''

stripped_c_code = cpp_pat.sub(sub func, c_code)
....I suppose this is what the Perl code might do, but I'm not sure,
since trying to read it hurts my brain...

Jul 27 '05 #4
Neat! I didn't realize that re.sub could take a function as an
argument. Thanks.

Lorin

Jul 27 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
4999
by: Margaret MacDonald | last post by:
I've been going mad trying to figure out how to do this--it should be easy! Allow the user to enter '\_sometext\_', i.e., literal backslash, underscore, some text, literal backslash, underscore and, after submitting via POST to a preg_replace filter, get back '_sometext_' (i.e., the same thing with the literal backslashes stripped)
3
1745
by: Steveo | last post by:
I am currently stripping HTML from a string with the following code. (I know it's not the best way to strip HTML but bear with me) re.compile("<.*?>") I wanted to allow all H1 and H2 tags so i changed it to: re.compile("<*?>") This seemed to work but it also allowed the HTML tag(basically anythin
1
2498
by: Andy Jefferies | last post by:
I'm having problems stripping out the whitespace at the beginning of a particular element. In the XML snippet I've highlighted tabs and returns as ^I and ^M respectively: <para> ^I ^I ^I^M Some text with occasional highlighting. Some text with occasional^M highlighting. Some text with occasional <high>highlighting</high>.^M Some text with occasional highlighting. Some <high>text</high> with^M occasional highlighting.</para>^M
2
2323
by: Patrick | last post by:
Hello, after learning that I was taking a class in VB.NET, I have been drafted to solve all my companies VB/scripting problems - hey, I should know everything; I've already taken 6 classes ;) I should have been quiet about it, but then I would never be reimbursed. Oh well. I have been asked to write a program to ping a NetBIOS name, get the IP, and compare the 3rd octet to a list to get the computer's location. So far, I can ping the IP,...
7
4874
by: Raj | last post by:
Hi I was hoping someone could suggest a simple way of stripping non-numeric data from a string of numbers. For example, if I have "ADB12458789\n" I would like to remove the letters and the newline from this string. I am new to C so am sure this is simple ut I don't know how to do it! Sorry!
4
6675
by: Lu | last post by:
Hi, i am currently working on ASP.Net v1.0 and is encountering the following problem. In javascript, I'm passing in: "somepage.aspx?QSParameter=<RowID>Chèques</RowID>" as part of the query string. However, in the code behind when I tried to get the query string value by calling Request.QueryString("QSParameter"), the value I got is: "<RowID>Chques</RowID>". The special character "è" has been stripped out. The web.config file is...
5
2457
by: David Sawyer | last post by:
I am trying to read in an HTML file and strip out the HTML code so that all I have left is the text of the body. Does anyone have any suggestions for doing this? Any HTML stripping routines or objects that perform the function?
4
4173
by: Spondishy | last post by:
Hi, I'm looking for help with a regular expression and c#. I want to remove all tags from a piece of html except the following. <a> <b> <h1> <h2>
7
1388
by: Benway | last post by:
Hey all, I have a file name like Eng-Cat-01-01-01.txt. I need to do a loop that starts stripping the letters from the front of this file name (which I'll store as a variable) until it reaches the "Cat" part. So I would have a variable "Cat-01-01-01.txt" that I can use to build up another string. Trouble is, I'm lost. Can't figure out how to do this in VB.net. Could anyone point me in the right direction?
7
3095
by: FFMG | last post by:
Hi, I have a form that allows users to comment, add entries and so on. But what a lot of them do is copy and paste directly from MS Word to my forms. almost all browsers will accept the post and give the impression that everything is saved properly. But, that is not the case when it comes time to displaying the message
0
9656
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9498
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10366
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10110
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
6750
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5399
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4070
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3674
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2894
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.