473,387 Members | 1,501 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

modifying small chunks from long string

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

Thanks for any help,

Mack

Nov 22 '05 #1
6 1537
MackS wrote:
This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time.
....
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object?


Is that a problem? Are you using string methods?

Some possibilities are:

array.array('c')
UserString.MutableString
cStringIO

depending on exactly what you are trying to do, but if
it is just straight forward replacement, (i.e. you
never actually change the length of the string) I'd
guess the list idiom would be simplest.

If you have enough memory, just keep two copies of the
string, the original and the modified:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy[i] = original[i].lower()
copy = ''.join(copy)

This takes 530 seconds (8 minutes) on my
not-especially-powerful system. Ouch. How does that
compare to your code?

If you can operate on moderately sized chunks of text
at a time rather than one character at a time, you'll
see some performance increase.

Also the above code not only has a 20MB string, plus a
20MB list, it also carries around a list of 20+ million
int objects. If I was paying more attention before I
hit the enter key, I would have used xrange instead of
range and saved a lot of memory.

It also has to assemble the new 20MB string before
garbage-collecting the character list. It is possible
that the above test code has well in excess of 100MB of
data at its peak.

With those problems in mind, I try this:

from __future__ import division

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = []
blocksize = 1024
for i in xrange(size//blocksize):
copy.append( \
original[i*blocksize:(i+1)*blocksize].lower() )
copy = ''.join(copy)

This makes a BIG difference: from 8 minutes down to 0.5
seconds. That's a speed-up by a factor of about one
thousand.

I was so amazed at the speed increase I re-wrote the
code with a couple of simple tests, then ran it again.
Same result. Unless I've made some stupid mistake, I
think this is the way to go.

--
Steven.

Nov 22 '05 #2
Replying to myself... the first sign of madness *wink*
Steven D'Aprano wrote:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy[i] = original[i].lower()
copy = ''.join(copy)
Do you notice the premature optimization? Rather than
appending new data to an initially empty list, I
thought I would "optimize" my code by pre-allocating
all the memory I would need for the list.

You can see how well it didn't work:
This takes 530 seconds (8 minutes)


The more sensible algorithm, without the unneeded
pre-allocation of the copy list, runs 1000 times
faster. That's the difference between optimization by
wild guessing and optimization by actually writing good
code :-)
--
Steven.

Nov 22 '05 #3
On 13 Nov 2005 22:57:50 -0800, "MackS" <ma***********@hotmail.com> wrote:
Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!
The usual way is to accumulate the edited short pieces of the new version of longstr
in a list and then join them once, if you really need the new longstr in a single piece
for something. I.e., (untested sketch)

chunks_of_new_longstr = []
for chunk in chunker(longstr):
#edit chunk (your shortstr)
newlong.append(chunk) # or do several appends of pieces from the editing of a chunk
longstr = ''.join(chunks_of_new_longstr)

But if you don't really need it except to write it to output and the next thing would be
open('longstr.txt','wb').write(longstr) # might want 'w' instead of 'wb' for plain text data

then don't bother joining into a new longstr but do
open('longstr.txt','wb').writelines(chunks_of_new_ longstr)

instead. But if you are going to do that, why not just
fout = open('longstr.txt','wb')
before the loop, and
fout.write(chunk)
in place of
newlong.append(chunk)

Of course, if longstr is coming from a file, maybe you can have
the chunker operate on a file instead of a longstr in memory.
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

I imagine you should be able to change a very few lines to switch between
ways of getting your input stream of editable chunks and accumulating your output.

OTOH, this is all guesswork without more context ;-)

Regards,
Bengt Richter
Nov 22 '05 #4
In article <11*********************@g44g2000cwa.googlegroups. com>,
"MackS" <ma***********@hotmail.com> wrote:
Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...
A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

MutableString warns that it is very slow. It seems to work by having a
string data item that it keeps replacing. I didn't try it.

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!


Replace this with slice assignment:

longarray = array.array('c',longstr) # once only at beginning!

shortstring = longarray[beg:end].tostring() # or just edit an array

# edit shortstring (or shortarray)

longarray[beg:end] = array.array('c',shortstr)

longstring = longarray.tostring() # if needed
__________________________________________________ ______________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
Nov 22 '05 #5
Thank you all for your great help. One of the few things better than
python is the knowledgeable community around it. : )

Regards,

Mack

Nov 22 '05 #6
Tony Nelson wrote:
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.


I have no major objections to using array, but a minor
one: ordinary lists may very well be more than snappy
enough, and they have the advantage of being more
familiar than the array module to many Python programmers.

The time it takes to process a 20MB string will depend
on the details of the processing, but my back of the
envelope test using one large input string and an
intermediate list of strings was *extremely* fast, less
than half a second for a 20MB input. (See my earlier
post for details.)

Given that sort of speed, shifting to the less familiar
array module just to shave the time from 0.49s to 0.45s
is premature optimization. Although, in fairness, if
you could cut the time to 0.04s for 20MB then it would
be worth the extra work to use the array module.
--
Steven.

Nov 22 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: qwweeeit | last post by:
Hi all, when I was young I programmed in an interpreted language that allowed to modify itself. Also Python can (writing and running a module, in-line): fNew =open("newModule.py",'w') lNew=...
10
by: AlexS | last post by:
Hi, I wonder if anybody can comment if what I see is normal in FW 1.1 and how to avoid this. I have .Net assembly, which creates literally thousands of temporary strings and other objects...
10
by: draghuram | last post by:
Hi, Is there any special support for sparse file handling in python? My initial search didn't bring up much (not a thorough search). I wrote the following pice of code: options.size =...
12
by: jl_post | last post by:
Dear C++ community, I have a question regarding the size of C++ std::strings. Basically, I compiled the following code under two different compilers: std::string someString = "Hello, world!";...
1
by: mdefoor | last post by:
I've written the following sample to split a string into chunks. Is this the right approach or is there perhaps a better way to accomplish getting chunks of a string? #include <stdio.h>...
1
by: ryann18 | last post by:
Can someone please help with this modifying Account problem!?!?! Modify the Account class so that it also permits an account to be opened with just a name and an account number, assuming an...
56
by: subramanian100in | last post by:
The standard allows that we can copy strings onto arg, arg, etc. Why is it allowed ? What can be the maximum length of such a string that is copied ?
8
by: manmit.walia | last post by:
Hello Everyone, Long time ago, I posted a small problem I had about converting a VB6 program to C#. Well with the help with everyone I got it converted. But I overlooked something and don't...
6
by: Arnshea | last post by:
(apologies for the crosspost) I'm working with an MFC based COM object. From C# I'd like to be able to call a method on the COM object that takes a string array and modifies the contents. Is...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.