473,386 Members | 1,745 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

"".join(string_generator()) fails to be magic

I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Oct 11 '07 #1
6 1554
On Thu, 11 Oct 2007 01:26:04 -0500, Matt Mackal wrote:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.
Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Ciao,
Marc 'BlackJack' Rintsch
Oct 11 '07 #2
On Oct 11, 8:53 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Thu, 11 Oct 2007 01:26:04 -0500, Matt Mackal wrote:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.
Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.
Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.

Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Ciao,
Marc 'BlackJack' Rintsch
Perhaps realloc() could be used to avoid this? I'm guessing that's
what cStringIO does, although I'm too lazy to check (I don't have
source on this box). Perhaps a cStringIO.getvalue() implementation
that doesn't copy memory would solve the problem?

-- bjorn

Oct 11 '07 #3
Matt Mackal schrieb:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.
You can't built a contiguous string of bytes without copying them.

The question is: what do you need the resulting strings for? Depending
on the use-case, it might be that you could spare yourself the actual
concatenation, but instead use a generator like this:
def charit(strings):
for s in strings:
for c in s:
yield c

Diez
Oct 11 '07 #4
On Thu, 11 Oct 2007 07:02:10 +0000, thebjorn wrote:
On Oct 11, 8:53 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
>Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Perhaps realloc() could be used to avoid this? I'm guessing that's
what cStringIO does, although I'm too lazy to check (I don't have
source on this box). Perhaps a cStringIO.getvalue() implementation
that doesn't copy memory would solve the problem?
How could `realloc()` solve that problem? Doesn't `realloc()` copy the
memory too if the current memory block can't hold the new size!?

And `StringIO` has the very same problem, if the `getvalue()`
method doesn't copy you have to make copies while writing to the `StringIO`
object and the buffer is not large enough.

Ciao,
Marc 'BlackJack' Rintsch
Oct 11 '07 #5
Matt Mackal <mp*@selenic.comwrote:
>I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Do you mean physical RAM, or addressable memory? If the former,
there's an obvious solution....
>Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result.
I think you can get better than 2x if you've got a reasonable number
of (ideally similarly sized) large strings with something along the
lines of:

for i in range(0, len(list_of_strings), 3): #tune step
result_string += (list_of_strings[i] +
list_of_strings[i+1] +
list_of_strings[i+2])
list_of_strings[i] = ""
list_of_strings[i+1] = ""
list_of_strings[i+2] = ""

remembering the recent string concatenation optimisations. Beyond
that, your most reliable solution may be the (c)StringIO approach
but with a real file (see the tempfile module, if you didn't know
about it).

--
\S -- si***@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
"Frankly I have no feelings towards penguins one way or the other"
-- Arthur C. Clarke
her nu becomež se bera eadward ofdun hlęddre heafdes bęce bump bump bump
Oct 11 '07 #6
On Oct 11, 2:26 am, Matt Mackal <m...@selenic.comwrote:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.
Do you really need a Python string? Some functions work just fine on
mmap or array objects, for example regular expressions:
>>import array
import re
a = array.array('c','hello, world')
a
array('c', 'hello, world')
>>m = re.search('llo',a)
m
<_sre.SRE_Match object at 0x009DCB80>
>>m.group(0)
array('c', 'llo')

I would look to see if there's a way to use an array or mmap instead.
If you have an upper bound for the total size, then you can reserve
the needed number of bytes.

If you really need a Python string, you might have to resort to a C
solution.
Carl Banks

Oct 11 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Nilsson Mats | last post by:
Hi! I have an intresting problem for our programming community on Solaris. I want to develop an environment where: 1) The developers shouldn't need to bother about which Perl version to use....
1
by: Matt | last post by:
Hi all :) I'm trying to get the functionality gained using only CSS in Opera/Gecko etc on this page: <http://matt.blissett.me.uk/web/authoring/css_menus/sample> to work in IE, using javascript....
4
by: Lonnie Princehouse | last post by:
So, it turns out that reload() fails if the module being reloaded isn't in sys.path. Maybe it could fall back to module.__file__ if the module isn't found in sys.path?? .... or reload could...
0
by: Michael Jackson | last post by:
My app (VB.NET 2003) receives files from many sources, and does magic on them. Then it must copy these files to a Network Attached Storage device. However, if a file name has any ascii characters...
20
by: weston | last post by:
I've got a piece of code where, for all the world, it looks like this fails in IE 6: hometab = document.getElementById('hometab'); but this succeeds: hometabemt =...
8
by: Rick Lederman | last post by:
I am using a PrintDocument and PrintDialog to print. The first time that I print it works, but when I try to print a second time without exiting the entire program I get an...
1
by: tim.landgraf | last post by:
hi there, i am experiencing a strange problem. i am iterating through a given directory, selecting only jpg - images that are then resized and inserted into a database. everything works, but if...
16
by: per9000 | last post by:
Hi, I recently started working a lot more in python than I have done in the past. And I discovered something that totally removed the pretty pink clouds of beautifulness that had surrounded my...
12
by: andrew cooke | last post by:
Hi, This is my first attempt at new classes and dynamic python, so I am probably doing something very stupid... After reading the how-to for descriptors at...
9
by: Larry Hale | last post by:
I've heard tell of a Python binding for libmagic (file(1) *nixy command; see http://darwinsys.com/file/). Generally, has anybody built this and worked with it under Windows? The only thing I've...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.