473,395 Members | 1,987 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

re.sub hangs on text from large files.

Atli
5,058 Expert 4TB
HI everybody.

In an effort to teach myself the basics of Python, I set out to create a script that would read a PHP file, remove any comments and then save it to another location.
I've got it all pretty much worked out now, except...

I have this re.sub regex that is meant to remove /*...*/ comments.
It works fine on small files, but causes Python to hang on larger files.
(By large I mean over 20kb files, sometimes containing a few thousand lines of code)
Expand|Select|Wrap|Line Numbers
  1. inFile = open(inPath + cfile)
  2. outFile = open(outPath + cfile, "w")
  3.  
  4. inText = inFile.read()
  5. outText = re.sub("\/\*(.|\s)*\*\/", "", inText)
  6. outFile.write(outText)
  7.  
  8. inFile.close()
  9. outFile.close()
  10.  
Running this causes Python to hang, and when closing it (crl+c) this is what I get:
Expand|Select|Wrap|Line Numbers
  1. Traceback (most recent call last):
  2.   File "./scandir.py", line 60, in <module>
  3.     listdirrec(inPath, outPath)
  4.   File "./scandir.py", line 44, in listdirrec
  5.     listdirrec(inPath + entry +"/", outPath + entry +"/")
  6.   File "./scandir.py", line 53, in listdirrec
  7.     outText = re.sub("\/\*(.|\s)*\*\/", "", inText)
  8.   File "/usr/lib/python2.5/re.py", line 150, in sub
  9.     return _compile(pattern, 0).sub(repl, string, count)
  10. KeyboardInterrupt
  11.  
I'm running Python 2.5.2 on Ubuntu 8.04.

Any input would be greatly appreciated.
Thanks
May 21 '08 #1
3 3176
jlm699
314 100+
Ok so I don't have any large text files with which to work; however one point of advice that I can give:

From my experience working with the re module it is almost always a good idea to compile your regex expressions. This should speed up your process and possibly will fix the error that you are seeing.
Expand|Select|Wrap|Line Numbers
  1. import re
  2. rc = re.compile("\/\*(.|\s)*\*\/")
  3. rc.sub("", inputText)
On an entirely different note I always get worried when I see people using string concatenation to construct paths. (inPath + myFile, etc.)
I usually use os.path.join(), as it makes things much easier; for example:
Expand|Select|Wrap|Line Numbers
  1. >>> import os
  2. >>> os.path.join('/usr', 'src', 'bin')
  3. '/usr/src/bin'
  4. >>> # On a windows system:
  5. >>> os.path.join('C:\\', 'Program Files', 'Python', 'Rules')
  6. 'C:\\Program Files\\Python\\Rules'
  7. >>> 
That's just my two cents and the method that I always go with; however there's nothing wrong with what you've done.
May 21 '08 #2
bvdet
2,851 Expert Mod 2GB
Is each comment on one line? If so, try iterating on the file object.
Example:
Expand|Select|Wrap|Line Numbers
  1. f = open(file_name)
  2. for line in f:
  3.     ..........
From Python docs:
"Also note that when in non-blocking mode, less data than what was requested may be returned, even if no size parameter was given."
May 21 '08 #3
Atli
5,058 Expert 4TB
Is each comment on one line? If so, try iterating on the file object.
No, these comments can (and usually do) span multiple lines.

I did manage to find a solution tho!

After some testing, I find that making the expression non-greedy will fix the problem no matter what combination of the whit-space characters I use.

This is also true with a greedy expression, except when you couple the \n char with any other white-space char, the process is somehow cought in an indefinite loop, running at 100% CPU indefinitely.
The funny thing is tho, that it takes up virtually no memory.

Anyhow...
This ended up working for me.
Expand|Select|Wrap|Line Numbers
  1. rc = re.compile("\/\*(.|\s)*?\*\/")
  2. outText = rc.sub("", inText)
  3.  
Thanks for the help guys!

O, and thanks for the os.path.join tip.
That's going to save me a lot of trouble :)
May 21 '08 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: John Ramsden | last post by:
I have a script running on PHP v4.3.6 (cgi) that hangs forever in a call to the Postgres pg_get_result() function when and only when the query length is 65536 or more bytes. The query is a...
1
by: Josh | last post by:
Hi. I am writing a script that downloads lots of zips from a usgs site. All is going well except that occasionally, in the midst of downloading a file, the script just hangs and i must either...
1
by: Kenneth H. Brannigan | last post by:
Hello, I going mad trying to figure this out. I am using a process class to kick off an executable that converts HTML to JPG files. If the HTML is very large (Over 100,000 characters) the process...
1
by: Derrick | last post by:
I have an app that relies on text file data. The app .exe and .dlls are only around 500k, the data however, is hundreds of megs. I wrote a setup project for this, and it compiled, but the .msi is...
5
by: Loane Sharp | last post by:
Hi there I've got a hang of a problem ... I'm running the .NET framework (2.0.40903), SQL Server 2000 and SQL Express 2005 on Windows XP Pro on a pretty good and new IBM Thinkpad X41. Some...
0
by: Nathan Truhan | last post by:
Hello, I have a WinForms application to send out emails to a large group of students. This is not spam, but mail from our university to the students. Currentl I have a ListView with 4 columns...
0
by: Arno Stienen | last post by:
Perhaps I should be a bit more specific. When using this code to connect to a remote XML-RPC server (C++, xmlrpc++0.7 library): import xmlrpclib server =...
2
by: Learning.Net | last post by:
hi , I have a application which reads files, directory,and its version and version information is written to text file.Its working fine if files in directory are less but problem arises when no...
2
by: Patrick Finnegan | last post by:
Running db2 8.2 ON aIX 5.3. We have a third party USEREXIT program that periodically hangs for some unknown reason. Db2 generates error message to the diag log. MESSAGE : Successfully...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.