Connecting Tech Pros Worldwide Forums | Help | Site Map

Splitting strings - by iterators?

Jeremy Sanders
Guest
 
Posts: n/a
#1: Jul 18 '05
I have a large string containing lines of text separated by '\n'. I'm
currently using text.splitlines(True) to break the text into lines, and
I'm iterating over the resulting list.

This is very slow (when using 400000 lines!). Other than dumping the
string to a file, and reading it back using the file iterator, is there a
way to quickly iterate over the lines?

I tried using newpos=text.find('\n', pos), and returning the chopped text
text[pos:newpos+1], but this is much slower than splitlines.

Any ideas?

Thanks

Jeremy


Diez B. Roggisch
Guest
 
Posts: n/a
#2: Jul 18 '05

re: Splitting strings - by iterators?


Jeremy Sanders wrote:
[color=blue]
> I have a large string containing lines of text separated by '\n'. I'm
> currently using text.splitlines(True) to break the text into lines, and
> I'm iterating over the resulting list.
>
> This is very slow (when using 400000 lines!). Other than dumping the
> string to a file, and reading it back using the file iterator, is there a
> way to quickly iterate over the lines?
>
> I tried using newpos=text.find('\n', pos), and returning the chopped text
> text[pos:newpos+1], but this is much slower than splitlines.
>
> Any ideas?[/color]

Maybe [c]StringIO can be of help. I don't know if it's iterator is lazy. But
at least it has one, so you can try and see if it improves performance :)


--
Regards,

Diez B. Roggisch
Jeremy Sanders
Guest
 
Posts: n/a
#3: Jul 18 '05

re: Splitting strings - by iterators?


On Fri, 25 Feb 2005 17:14:24 +0100, Diez B. Roggisch wrote:
[color=blue]
> Maybe [c]StringIO can be of help. I don't know if it's iterator is lazy. But
> at least it has one, so you can try and see if it improves performance :)[/color]

Excellent! I somehow missed that module. StringIO speeds up the iteration
by a factor of 20!

Thanks

Jeremy
Larry Bates
Guest
 
Posts: n/a
#4: Jul 18 '05

re: Splitting strings - by iterators?


Jeremy,

How did you get the string in memory in the first place?
If you read it from a file, perhaps you should change to
reading it from the file a line at the time and use
file.readline as your iterator.

fp=file(inputfile, 'r')
for line in fp:
...do your processing...

fp.close()

I don't think I would never read 400,000 lines as a single
string and then split it. Just a suggestion.

Larry Bates

Jeremy Sanders wrote:[color=blue]
> I have a large string containing lines of text separated by '\n'. I'm
> currently using text.splitlines(True) to break the text into lines, and
> I'm iterating over the resulting list.
>
> This is very slow (when using 400000 lines!). Other than dumping the
> string to a file, and reading it back using the file iterator, is there a
> way to quickly iterate over the lines?
>
> I tried using newpos=text.find('\n', pos), and returning the chopped text
> text[pos:newpos+1], but this is much slower than splitlines.
>
> Any ideas?
>
> Thanks
>
> Jeremy
>[/color]
Jeremy Sanders
Guest
 
Posts: n/a
#5: Jul 18 '05

re: Splitting strings - by iterators?


On Fri, 25 Feb 2005 10:57:59 -0600, Larry Bates wrote:
[color=blue]
> How did you get the string in memory in the first place?[/color]

They're actually from a generated python script, acting as a saved file
format, something like:

interpret("""
lots of lines
""")
another_command()

Obviously this isn't the most efficient format, but it's nice to
encapsulate the data and the script into one file.

Jeremy

Francis Girard
Guest
 
Posts: n/a
#6: Jul 18 '05

re: Splitting strings - by iterators?


Hi,

Using finditer in re module might help. I'm not sure it is lazy nor
performant. Here's an example :

=== BEGIN SNAP
import re

reLn = re.compile(r"""[^\n]*(\n|$)""")

sStr = \
"""
This is a test string.
It is supposed to be big.
Oh well.
"""

for oMatch in reLn.finditer(sStr):
print oMatch.group()
=== END SNAP

Regards,

Francis Girard

Le vendredi 25 Février 2005 16:55, Jeremy Sanders a écrit*:[color=blue]
> I have a large string containing lines of text separated by '\n'. I'm
> currently using text.splitlines(True) to break the text into lines, and
> I'm iterating over the resulting list.
>
> This is very slow (when using 400000 lines!). Other than dumping the
> string to a file, and reading it back using the file iterator, is there a
> way to quickly iterate over the lines?
>
> I tried using newpos=text.find('\n', pos), and returning the chopped text
> text[pos:newpos+1], but this is much slower than splitlines.
>
> Any ideas?
>
> Thanks
>
> Jeremy[/color]

Larry Bates
Guest
 
Posts: n/a
#7: Jul 18 '05

re: Splitting strings - by iterators?


By putting them into another file you can just use
..readline iterator on file object to solve your
problem. I would personally find it hard to work
on a program that had 400,000 lines of data hard
coded into a structure like this, but that's me.

-Larry


Jeremy Sanders wrote:[color=blue]
> On Fri, 25 Feb 2005 10:57:59 -0600, Larry Bates wrote:
>
>[color=green]
>>How did you get the string in memory in the first place?[/color]
>
>
> They're actually from a generated python script, acting as a saved file
> format, something like:
>
> interpret("""
> lots of lines
> """)
> another_command()
>
> Obviously this isn't the most efficient format, but it's nice to
> encapsulate the data and the script into one file.
>
> Jeremy
>[/color]
John Machin
Guest
 
Posts: n/a
#8: Jul 18 '05

re: Splitting strings - by iterators?



Jeremy Sanders wrote:[color=blue]
> On Fri, 25 Feb 2005 17:14:24 +0100, Diez B. Roggisch wrote:
>[color=green]
> > Maybe [c]StringIO can be of help. I don't know if it's iterator is[/color][/color]
lazy. But[color=blue][color=green]
> > at least it has one, so you can try and see if it improves[/color][/color]
performance :)[color=blue]
>
> Excellent! I somehow missed that module. StringIO speeds up the[/color]
iteration[color=blue]
> by a factor of 20!
>[/color]

Twenty?? StringIO.StringIO or cStringIO.StringIO???

I did some "timeit" tests using the code below, on 400,000 lines of 53
chars (uppercase + lowercase + '\n').

On my config (Python 2.4, Windows 2000, 1.4 GHz Athlon chip, not short
of memory), cStringIO took 0.18 seconds and the "hard way" took 0.91
seconds. Stringio (not shown) took 2.9 seconds. FWIW, moving an
attribute look-up in the (sfind = s.find) saves only about 0.1 seconds.
[color=blue]
>python -m timeit -s "import itersplitlines as i; d =[/color]
i.mk_data(400000)" "i.test_csio(d)"
10 loops, best of 3: 1.82e+005 usec per loop
[color=blue]
>python -m timeit -s "import itersplitlines as i; d =[/color]
i.mk_data(400000)" "i.test_gen(d)"
10 loops, best of 3: 9.06e+005 usec per loop

A few questions:
(1) What is your equivalent of the "hard way"? What [c]StringIO code
did you use?
(2) How did you measure the time?
(3) How long does it take *compile* your 400,000-line Python script?

!import cStringIO
!
!def itersplitlines(s):
! if not s:
! yield s
! return
! pos = 0
! sfind = s.find
! epos = len(s)
! while pos < epos:
! newpos = sfind('\n', pos)
! if newpos == -1:
! yield s[pos:]
! return
! yield s[pos:newpos+1]
! pos = newpos+1
!
!def test_gen(s):
! for z in itersplitlines(s):
! pass
!
!def test_csio(s):
! for z in cStringIO.StringIO(s):
! pass
!
!def mk_data(n):
! import string
! return (string.lowercase + string.uppercase + '\n') * n

Closed Thread


Similar Python bytes