471,337 Members | 896 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,337 software developers and data experts.

"".join(string_generator()) fails to be magic

I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Oct 11 '07 #1
6 1442
On Thu, 11 Oct 2007 01:26:04 -0500, Matt Mackal wrote:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.
Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Ciao,
Marc 'BlackJack' Rintsch
Oct 11 '07 #2
On Oct 11, 8:53 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Thu, 11 Oct 2007 01:26:04 -0500, Matt Mackal wrote:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.
Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.
Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.

Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Ciao,
Marc 'BlackJack' Rintsch
Perhaps realloc() could be used to avoid this? I'm guessing that's
what cStringIO does, although I'm too lazy to check (I don't have
source on this box). Perhaps a cStringIO.getvalue() implementation
that doesn't copy memory would solve the problem?

-- bjorn

Oct 11 '07 #3
Matt Mackal schrieb:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subjectwould be the obvious way to do this, but it of
course converts the generator output to a list first.
You can't built a contiguous string of bytes without copying them.

The question is: what do you need the resulting strings for? Depending
on the use-case, it might be that you could spare yourself the actual
concatenation, but instead use a generator like this:
def charit(strings):
for s in strings:
for c in s:
yield c

Diez
Oct 11 '07 #4
On Thu, 11 Oct 2007 07:02:10 +0000, thebjorn wrote:
On Oct 11, 8:53 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
>Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Perhaps realloc() could be used to avoid this? I'm guessing that's
what cStringIO does, although I'm too lazy to check (I don't have
source on this box). Perhaps a cStringIO.getvalue() implementation
that doesn't copy memory would solve the problem?
How could `realloc()` solve that problem? Doesn't `realloc()` copy the
memory too if the current memory block can't hold the new size!?

And `StringIO` has the very same problem, if the `getvalue()`
method doesn't copy you have to make copies while writing to the `StringIO`
object and the buffer is not large enough.

Ciao,
Marc 'BlackJack' Rintsch
Oct 11 '07 #5
Matt Mackal <mp*@selenic.comwrote:
>I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Do you mean physical RAM, or addressable memory? If the former,
there's an obvious solution....
>Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result.
I think you can get better than 2x if you've got a reasonable number
of (ideally similarly sized) large strings with something along the
lines of:

for i in range(0, len(list_of_strings), 3): #tune step
result_string += (list_of_strings[i] +
list_of_strings[i+1] +
list_of_strings[i+2])
list_of_strings[i] = ""
list_of_strings[i+1] = ""
list_of_strings[i+2] = ""

remembering the recent string concatenation optimisations. Beyond
that, your most reliable solution may be the (c)StringIO approach
but with a real file (see the tempfile module, if you didn't know
about it).

--
\S -- si***@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
"Frankly I have no feelings towards penguins one way or the other"
-- Arthur C. Clarke
her nu becomež se bera eadward ofdun hlęddre heafdes bęce bump bump bump
Oct 11 '07 #6
On Oct 11, 2:26 am, Matt Mackal <m...@selenic.comwrote:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.
Do you really need a Python string? Some functions work just fine on
mmap or array objects, for example regular expressions:
>>import array
import re
a = array.array('c','hello, world')
a
array('c', 'hello, world')
>>m = re.search('llo',a)
m
<_sre.SRE_Match object at 0x009DCB80>
>>m.group(0)
array('c', 'llo')

I would look to see if there's a way to use an array or mmap instead.
If you have an upper bound for the total size, then you can reserve
the needed number of bytes.

If you really need a Python string, you might have to resort to a C
solution.
Carl Banks

Oct 11 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Nilsson Mats | last post: by
4 posts views Thread by Lonnie Princehouse | last post: by
reply views Thread by Michael Jackson | last post: by
8 posts views Thread by Rick Lederman | last post: by
1 post views Thread by tim.landgraf | last post: by
16 posts views Thread by per9000 | last post: by
12 posts views Thread by andrew cooke | last post: by
9 posts views Thread by Larry Hale | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.