473,668 Members | 2,330 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

modifying small chunks from long string

Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

Thanks for any help,

Mack

Nov 22 '05 #1
6 1551
MackS wrote:
This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time.
....
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object?


Is that a problem? Are you using string methods?

Some possibilities are:

array.array('c' )
UserString.Muta bleString
cStringIO

depending on exactly what you are trying to do, but if
it is just straight forward replacement, (i.e. you
never actually change the length of the string) I'd
guess the list idiom would be simplest.

If you have enough memory, just keep two copies of the
string, the original and the modified:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy[i] = original[i].lower()
copy = ''.join(copy)

This takes 530 seconds (8 minutes) on my
not-especially-powerful system. Ouch. How does that
compare to your code?

If you can operate on moderately sized chunks of text
at a time rather than one character at a time, you'll
see some performance increase.

Also the above code not only has a 20MB string, plus a
20MB list, it also carries around a list of 20+ million
int objects. If I was paying more attention before I
hit the enter key, I would have used xrange instead of
range and saved a lot of memory.

It also has to assemble the new 20MB string before
garbage-collecting the character list. It is possible
that the above test code has well in excess of 100MB of
data at its peak.

With those problems in mind, I try this:

from __future__ import division

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = []
blocksize = 1024
for i in xrange(size//blocksize):
copy.append( \
original[i*blocksize:(i+ 1)*blocksize].lower() )
copy = ''.join(copy)

This makes a BIG difference: from 8 minutes down to 0.5
seconds. That's a speed-up by a factor of about one
thousand.

I was so amazed at the speed increase I re-wrote the
code with a couple of simple tests, then ran it again.
Same result. Unless I've made some stupid mistake, I
think this is the way to go.

--
Steven.

Nov 22 '05 #2
Replying to myself... the first sign of madness *wink*
Steven D'Aprano wrote:

size = 1024*1024*20 # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
copy[i] = original[i].lower()
copy = ''.join(copy)
Do you notice the premature optimization? Rather than
appending new data to an initially empty list, I
thought I would "optimize" my code by pre-allocating
all the memory I would need for the list.

You can see how well it didn't work:
This takes 530 seconds (8 minutes)


The more sensible algorithm, without the unneeded
pre-allocation of the copy list, runs 1000 times
faster. That's the difference between optimization by
wild guessing and optimization by actually writing good
code :-)
--
Steven.

Nov 22 '05 #3
On 13 Nov 2005 22:57:50 -0800, "MackS" <ma***********@ hotmail.com> wrote:
Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!
The usual way is to accumulate the edited short pieces of the new version of longstr
in a list and then join them once, if you really need the new longstr in a single piece
for something. I.e., (untested sketch)

chunks_of_new_l ongstr = []
for chunk in chunker(longstr ):
#edit chunk (your shortstr)
newlong.append( chunk) # or do several appends of pieces from the editing of a chunk
longstr = ''.join(chunks_ of_new_longstr)

But if you don't really need it except to write it to output and the next thing would be
open('longstr.t xt','wb').write (longstr) # might want 'w' instead of 'wb' for plain text data

then don't bother joining into a new longstr but do
open('longstr.t xt','wb').write lines(chunks_of _new_longstr)

instead. But if you are going to do that, why not just
fout = open('longstr.t xt','wb')
before the loop, and
fout.write(chun k)
in place of
newlong.append( chunk)

Of course, if longstr is coming from a file, maybe you can have
the chunker operate on a file instead of a longstr in memory.
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

I imagine you should be able to change a very few lines to switch between
ways of getting your input stream of editable chunks and accumulating your output.

OTOH, this is all guesswork without more context ;-)

Regards,
Bengt Richter
Nov 22 '05 #4
In article <11************ *********@g44g2 000cwa.googlegr oups.com>,
"MackS" <ma***********@ hotmail.com> wrote:
Hello everyone

I am faced with the following problem. For the first time I've asked
myself "might this actually be easier to code in C rather than in
python?", and I am not looking at device drivers. : )

This program is meant to process relatively long strings (10-20 MB) by
selectively modifying small chunks one at a time. Eg, it locates
approx. 1000-2000 characters and modifies them. Currently I was doing
this using a string object but it is getting very slow: although I only
modify a tiny bit of the string at a time, a new entire string gets
created whenever I "merge" it with the rest. Eg,

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!

Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...
A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.

MutableString warns that it is very slow. It seems to work by having a
string data item that it keeps replacing. I didn't try it.

shortstr = longstr[beg:end]

# edit shortstr...

longstr = longstr[:beg] + shortstr + longstr[end:] # new huge string is
created!!


Replace this with slice assignment:

longarray = array.array('c' ,longstr) # once only at beginning!

shortstring = longarray[beg:end].tostring() # or just edit an array

# edit shortstring (or shortarray)

longarray[beg:end] = array.array('c' ,shortstr)

longstring = longarray.tostr ing() # if needed
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>
Nov 22 '05 #5
Thank you all for your great help. One of the few things better than
python is the knowledgeable community around it. : )

Regards,

Mack

Nov 22 '05 #6
Tony Nelson wrote:
Can I get over this performance problem without reimplementing the
whole thing using a barebones list object? I though I was being "smart"
by avoiding editing the long list, but then it struck me that I am
creating a second object of the same size when I put the modified
shorter string in place...

A couple of minutes experimenting with array.array at the python command
line indicates that it will work fine for you. Quite snappy on a 16 MB
array, including a slice assignment of 1 KB near the beginning.
Array.array is probably better than lists for speed, and uses less
memory. It is the way to go if you are going to be randomly editing all
over the place but don't need to convert to string often.


I have no major objections to using array, but a minor
one: ordinary lists may very well be more than snappy
enough, and they have the advantage of being more
familiar than the array module to many Python programmers.

The time it takes to process a 20MB string will depend
on the details of the processing, but my back of the
envelope test using one large input string and an
intermediate list of strings was *extremely* fast, less
than half a second for a 20MB input. (See my earlier
post for details.)

Given that sort of speed, shifting to the less familiar
array module just to shave the time from 0.49s to 0.45s
is premature optimization. Although, in fairness, if
you could cut the time to 0.04s for 20MB then it would
be worth the extra work to use the array module.
--
Steven.

Nov 22 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
8311
by: qwweeeit | last post by:
Hi all, when I was young I programmed in an interpreted language that allowed to modify itself. Also Python can (writing and running a module, in-line): fNew =open("newModule.py",'w') lNew= fNew.writelines(lNew) fNew.close()
10
1776
by: AlexS | last post by:
Hi, I wonder if anybody can comment if what I see is normal in FW 1.1 and how to avoid this. I have .Net assembly, which creates literally thousands of temporary strings and other objects when running. Usually it is something like { string s=some value; some local processing here
10
5666
by: draghuram | last post by:
Hi, Is there any special support for sparse file handling in python? My initial search didn't bring up much (not a thorough search). I wrote the following pice of code: options.size = 6442450944 options.ranges = fd = open("testfile", "w") fd.seek(options.size-1)
12
13574
by: jl_post | last post by:
Dear C++ community, I have a question regarding the size of C++ std::strings. Basically, I compiled the following code under two different compilers: std::string someString = "Hello, world!"; int size1 = sizeof(std::string); int size2 = sizeof(someString); and printed out the values of size1 and size2. size1 and size2 always
1
6262
by: mdefoor | last post by:
I've written the following sample to split a string into chunks. Is this the right approach or is there perhaps a better way to accomplish getting chunks of a string? #include <stdio.h> #include <string.h> #define BUFSIZE 10 int main(int argc, char *argv) {
1
2268
by: ryann18 | last post by:
Can someone please help with this modifying Account problem!?!?! Modify the Account class so that it also permits an account to be opened with just a name and an account number, assuming an initial balance of zero. Modify the main method of the Transactions class to demonstrate this new capability. //******************************************************************** // Account.java Author: Lewis/Loftus // // Represents a...
56
4454
by: subramanian100in | last post by:
The standard allows that we can copy strings onto arg, arg, etc. Why is it allowed ? What can be the maximum length of such a string that is copied ?
8
2740
by: manmit.walia | last post by:
Hello Everyone, Long time ago, I posted a small problem I had about converting a VB6 program to C#. Well with the help with everyone I got it converted. But I overlooked something and don't understand why it is doing this. Below is my code, I would be greatfull if someone can guide me through the right path or even help me solve this issue. Problem: The old tool which was written in VB6 works perfect. But I needed to convert this to C#...
6
3863
by: Arnshea | last post by:
(apologies for the crosspost) I'm working with an MFC based COM object. From C# I'd like to be able to call a method on the COM object that takes a string array and modifies the contents. Is there any way to do this with a variable length array? I've only been able to get it to work with a fixed size array. The relevant code snippets are below. Suggestions are greatly appreciated
0
8890
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8791
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8575
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8653
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7398
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
4202
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4373
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2784
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1783
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.