modifying small chunks from long string

MackS

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

Thanks for any help,

Mack

Nov 22 '05 #1

Subscribe Post Reply

1537

Steven D'Aprano

MackS wrote:

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time.
....
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object?

Is that a problem? Are you using string methods?

Some possibilities are:

array.array('c')
UserString.MutableString
cStringIO

depending on exactly what you are trying to do, but if
it is just straight forward replacement, (i.e. you
never actually change the length of the string) I'd
guess the list idiom would be simplest.

If you have enough memory, just keep two copies of the
string, the original and the modified:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy[i] = original[i].lower()
copy = ''.join(copy)

This takes 530 seconds (8 minutes) on my
not-especially-powerful system. Ouch. How does that
compare to your code?

If you can operate on moderately sized chunks of text
at a time rather than one character at a time, you'll
see some performance increase.

Also the above code not only has a 20MB string, plus a
20MB list, it also carries around a list of 20+ million
int objects. If I was paying more attention before I
hit the enter key, I would have used xrange instead of
range and saved a lot of memory.

It also has to assemble the new 20MB string before
garbage-collecting the character list. It is possible
that the above test code has well in excess of 100MB of
data at its peak.

With those problems in mind, I try this:

from __future__ import division

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = []
blocksize = 1024
for i in xrange(size//blocksize):
copy.append( \
original[i*blocksize:(i+1)*blocksize].lower() )
copy = ''.join(copy)

This makes a BIG difference: from 8 minutes down to 0.5
seconds. That's a speed-up by a factor of about one
thousand.

I was so amazed at the speed increase I re-wrote the
code with a couple of simple tests, then ran it again.
Same result. Unless I've made some stupid mistake, I
think this is the way to go.

--
Steven.

Nov 22 '05 #2

Steven D'Aprano

Replying to myself... the first sign of madness *wink*
Steven D'Aprano wrote:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy[i] = original[i].lower()
copy = ''.join(copy)
Do you notice the premature optimization? Rather than
appending new data to an initially empty list, I
thought I would "optimize" my code by pre-allocating
all the memory I would need for the list.

You can see how well it didn't work:
This takes 530 seconds (8 minutes)

The more sensible algorithm, without the unneeded
pre-allocation of the copy list, runs 1000 times
faster. That's the difference between optimization by
wild guessing and optimization by actually writing good
code :-)
--
Steven.

Nov 22 '05 #3

Bengt Richter

On 13 Nov 2005 22:57:50 -0800, "MackS" <ma***********@hotmail.com> wrote:

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!
The usual way is to accumulate the edited short pieces of the new version of longstr
in a list and then join them once, if you really need the new longstr in a single piece
for something. I.e., (untested sketch)

chunks_of_new_longstr = []
for chunk in chunker(longstr):
#edit chunk (your shortstr)
newlong.append(chunk) # or do several appends of pieces from the editing of a chunk
longstr = ''.join(chunks_of_new_longstr)

But if you don't really need it except to write it to output and the next thing would be
open('longstr.txt','wb').write(longstr) # might want 'w' instead of 'wb' for plain text data

then don't bother joining into a new longstr but do
open('longstr.txt','wb').writelines(chunks_of_new_ longstr)

instead. But if you are going to do that, why not just
fout = open('longstr.txt','wb')
before the loop, and
fout.write(chunk)
in place of
newlong.append(chunk)

Of course, if longstr is coming from a file, maybe you can have
the chunker operate on a file instead of a longstr in memory.
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

I imagine you should be able to change a very few lines to switch between
ways of getting your input stream of editable chunks and accumulating your output.

OTOH, this is all guesswork without more context ;-)

Regards,
Bengt Richter

Nov 22 '05 #4

Tony Nelson

In article <11*********************@g44g2000cwa.googlegroups. com>,
"MackS" <ma***********@hotmail.com> wrote:

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...
A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

MutableString warns that it is very slow. It seems to work by having a
string data item that it keeps replacing. I didn't try it.

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Replace this with slice assignment:

longarray = array.array('c',longstr) # once only at beginning!

shortstring = longarray[beg:end].tostring() # or just edit an array

# edit shortstring (or shortarray)

longarray[beg:end] = array.array('c',shortstr)

longstring = longarray.tostring() # if needed
__________________________________________________ ______________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

Nov 22 '05 #5

MackS

Thank you all for your great help. One of the few things better than
python is the knowledgeable community around it. : )

Regards,

Mack

Nov 22 '05 #6

Steven D'Aprano

Tony Nelson wrote:

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

I have no major objections to using array, but a minor
one: ordinary lists may very well be more than snappy
enough, and they have the advantage of being more
familiar than the array module to many Python programmers.

The time it takes to process a 20MB string will depend
on the details of the processing, but my back of the
envelope test using one large input string and an
intermediate list of strings was *extremely* fast, less
than half a second for a 20MB input. (See my earlier
post for details.)

Given that sort of speed, shifting to the less familiar
array module just to shave the time from 0.49s to 0.45s
is premature optimization. Although, in fairness, if
you could cut the time to 0.04s for 20MB then it would
be worth the extra work to use the array module.
--
Steven.

Nov 22 '05 #7

modifying small chunks from long string

Similar topics