473,499 Members | 1,903 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

speedy Python strings?

Hi,

thanks to previous help my wrapper for an abstract file type with variable
record length is working now.
I did some test runs and its awfully slow:

I didn't want to read in the whole file at once and I didn't like to read it
step by step (contains 32 bit length terminated strings (Delphi) and 32bit
integers), so I read in i.e. 2MB, work on that buffer and if the buffer
goes empty i load some more 2MB, etc.
For this buffer I use ordinary strings:

class myFile(file):
def read(self, *args):
...
self.buffer += file.read(self, *args)
...

and after reading information from the buffer I remove the read part from
it:

text = struct.unpack("L", self.buffer[:4])
self.buffer = self.buffer[4:]

During debugging I saw the program accelerating when the buffersize
<len(self.buffer) > grew smaller, thus my conclusion: since strings are
immutable the string operations are so slow.

I tested different sizes of the buffer and it performed best when I didn't
use that buffering system (which is sad, since I spend so much time on it).

Is there anything else I can use instead of normal strings that'll speed
this up?
What do you suggest how to deal with this situation? Do you suggest I remove
any buffering of the data?

Thanks for any comments
Yours
Uwe
Jul 18 '05 #1
8 1551
On Tue, Jan 20, 2004 at 01:43:24AM +0100, Uwe Mayer wrote:
Hi,

thanks to previous help my wrapper for an abstract file type with variable
record length is working now.
I did some test runs and its awfully slow:

I didn't want to read in the whole file at once and I didn't like to read it
step by step (contains 32 bit length terminated strings (Delphi) and 32bit
integers), so I read in i.e. 2MB, work on that buffer and if the buffer
goes empty i load some more 2MB, etc.
For this buffer I use ordinary strings:

class myFile(file):
def read(self, *args):
...
self.buffer += file.read(self, *args)
...
Doing the above in a loop has quadratic complexity.

and after reading information from the buffer I remove the read part from
it:

text = struct.unpack("L", self.buffer[:4])
self.buffer = self.buffer[4:]


The same is true for this line.

Your program will be much faster if you make fewer string copies.
"self.buffer[4:]" is a string copy, as is "self.buffer += anotherString".

The buffer() builtin is one way to avoid copying, but another way is to
simply keep track of how far into the string you are and use both low and
high indexes when slicing, instead of repeatedly chopping the front off the
string.

Jp

Jul 18 '05 #2
In article <bu**********@news.rz.uni-karlsruhe.de>,
Uwe Mayer <me*****@hadiko.de> wrote:
since strings are
immutable the string operations are so slow.


More to the point, and I suspect what you're running into, string
addition is linear with respect to string length because it has to
create a new string each time and destroy the old one. Since each
addition is linear with string length, a loop which adds strings will
run in quadratic time.

Possibly you want to look at the cStringIO class.

Another possibility is to buffer things in a list, using append to add
to the end and pop(0) to pop off the front (a roll-your-own queue).
Lists do not suffer from the quadratic behavor that string addition does.
Jul 18 '05 #3
On Tue, 20 Jan 2004 01:43:24 +0100, Uwe Mayer wrote:
class myFile(file):
def read(self, *args):
... # self.buffer += file.read(self, *args) # not this
self.buffer = self.buffer[self.pos:] + file.read(self, *args)
self.pos = 0 ...

and after reading information from the buffer I remove the read part from
it:

# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[pos:pos+4])
self.pos = pos + 4

--
Stuart D. Gathman <st****@bmsi.com>
Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

Jul 18 '05 #4
> Another possibility is to buffer things in a list, using append to add
to the end and pop(0) to pop off the front (a roll-your-own queue).
Lists do not suffer from the quadratic behavor that string addition does.


Deleting from the front of a list results in every item being shifted up
a single entry. While this is insignificant for short lists, it is
significant for long lists:
def del_from_front(size): ... a = range(size)
... t = time.time()
... while a: a.pop(0)
... return time.time()-t
... del_from_front(1000) 0.016000032424926758 del_from_front(10000) 0.85899996757507324 del_from_front(20000) 3.2970001697540283 def del_from_end(size): ... a = range(size)
... t = time.time()
... while a: a.pop()
... return time.time()-t
... del_from_end(10000) 0.062999963760375977 del_from_end(100000) 0.51600003242492676 del_from_end(1000000)

5.4530000686645508

In terms of buffering as the parent post asks, file IO is already
buffered by the underlying libraries and operating system. Another
layer of Python buffering doesn't seem to make much sense. I would
suggest writing off the buffering exercise as a learning experience.

- Josiah

Jul 18 '05 #5
Stuart D. Gathman wrote in message ...
# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[pos:pos+4])
self.pos = pos + 4

In this vein, I would also recommend looking at the array module. You
didn't describe the data structure, but if each record is simply a list of
4-byte integers, you could convert the whole record to ints at once like so:

record = array.array('L', stringorlist)
Then, perform your loops on record, which will be a list-like object
supporting item insertion and deletion, and conversion to bytestrings and
other Python base types.

Or, you could simply use array.array() to implement your buffer more
efficiently than continually calling struct. Simply buffer the converted
data a chunk at a time, using array.

However, I don't think buffering helps much in this case. File operations
are already buffered by Python, and most (all?) OSes buffer reads
themselves, too. Reading a file is almost always very fast. I would
concentrate more on presenting a useful abstraction for your data, rather
than worrying about what is essentially an optimization problem.
--
Francis Avila

Jul 18 '05 #6
On Mon, Jan 19, 2004 at 10:53:47PM -0500, Francis Avila wrote:
Stuart D. Gathman wrote in message ...
# text = struct.unpack("L", self.buffer[:4])
# self.buffer = self.buffer[4:]
pos = self.pos
text = struct.unpack("L", self.buffer[pos:pos+4])
self.pos = pos + 4

In this vein, I would also recommend looking at the array module. You
didn't describe the data structure, but if each record is simply a list of
4-byte integers, you could convert the whole record to ints at once like so:

record = array.array('L', stringorlist)
Then, perform your loops on record, which will be a list-like object
supporting item insertion and deletion, and conversion to bytestrings and
other Python base types.


Also note the existence of fromfile() - the array module actually makes it
possible to do this incredibly efficiently:

a = array.array('L')
a.fromfile(openFileObj, count)

Jp

Jul 18 '05 #7
"Josiah Carlson" <jc******@uci.edu> wrote:
Deleting from the front of a list results in every item being
shifted up a single entry. While this is insignificant for
short lists, it is significant for long lists:


Note that there's been some very interesting discussions on python-dev
recently that may end up alleviating this issue somewhat in the future - but
no promises - there doesn't appear to be a consensus on the "right way" ...

Tim Delaney
Jul 18 '05 #8
> Note that there's been some very interesting discussions on python-dev
recently that may end up alleviating this issue somewhat in the future - but
no promises - there doesn't appear to be a consensus on the "right way" ...


I've been following it. My favorite solution is to not change the
structure of the list at all, and to offer a reasonably fast fifo queue
in a standard library module with a bunch of other useful data
structures. Then you don't need to worry about breaking C modules that
rely on the list having that behavior, and get some useful data
structures out of the deal.

- Josiah

Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

220
18799
by: Brandon J. Van Every | last post by:
What's better about Ruby than Python? I'm sure there's something. What is it? This is not a troll. I'm language shopping and I want people's answers. I don't know beans about Ruby or have...
16
2402
by: Paul Prescod | last post by:
I skimmed the tutorial and something alarmed me. "Strings are a powerful data type in Prothon. Unlike many languages, they can be of unlimited size (constrained only by memory size) and can hold...
10
3658
by: Andrew Dalke | last post by:
Is there an author index for the new version of the Python cookbook? As a contributor I got my comp version delivered today and my ego wanted some gratification. I couldn't find my entries. ...
0
7134
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7225
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
6901
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7392
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
4920
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
3105
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
1429
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
667
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
307
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.