Splitting strings - by iterators?

Jeremy Sanders

I have a large string containing lines of text separated by '\n'. I'm
currently using text.splitlines(True) to break the text into lines, and
I'm iterating over the resulting list.

This is very slow (when using 400000 lines!). Other than dumping the
string to a file, and reading it back using the file iterator, is there a
way to quickly iterate over the lines?

I tried using newpos=text.find('\n', pos), and returning the chopped text
text[pos:newpos+1], but this is much slower than splitlines.

Any ideas?

Thanks

Jeremy

Jul 18 '05 #1

Subscribe Post Reply

3178

Diez B. Roggisch

Jeremy Sanders wrote:

I have a large string containing lines of text separated by '\n'. I'm
currently using text.splitlines(True) to break the text into lines, and
I'm iterating over the resulting list.

This is very slow (when using 400000 lines!). Other than dumping the
string to a file, and reading it back using the file iterator, is there a
way to quickly iterate over the lines?

I tried using newpos=text.find('\n', pos), and returning the chopped text
text[pos:newpos+1], but this is much slower than splitlines.

Any ideas?

Maybe [c]StringIO can be of help. I don't know if it's iterator is lazy. But
at least it has one, so you can try and see if it improves performance :)
--
Regards,

Diez B. Roggisch

Jul 18 '05 #2

Jeremy Sanders

On Fri, 25 Feb 2005 17:14:24 +0100, Diez B. Roggisch wrote:

Maybe [c]StringIO can be of help. I don't know if it's iterator is lazy. But
at least it has one, so you can try and see if it improves performance :)

Excellent! I somehow missed that module. StringIO speeds up the iteration
by a factor of 20!

Thanks

Jeremy

Jul 18 '05 #3

Larry Bates

Jeremy,

How did you get the string in memory in the first place?
If you read it from a file, perhaps you should change to
reading it from the file a line at the time and use
file.readline as your iterator.

fp=file(inputfile, 'r')
for line in fp:
...do your processing...

fp.close()

I don't think I would never read 400,000 lines as a single
string and then split it. Just a suggestion.

Larry Bates

Jeremy Sanders wrote:

I have a large string containing lines of text separated by '\n'. I'm
currently using text.splitlines(True) to break the text into lines, and
I'm iterating over the resulting list.

This is very slow (when using 400000 lines!). Other than dumping the
string to a file, and reading it back using the file iterator, is there a
way to quickly iterate over the lines?

I tried using newpos=text.find('\n', pos), and returning the chopped text
text[pos:newpos+1], but this is much slower than splitlines.

Any ideas?

Thanks

Jeremy

Jul 18 '05 #4

Jeremy Sanders

On Fri, 25 Feb 2005 10:57:59 -0600, Larry Bates wrote:

How did you get the string in memory in the first place?

They're actually from a generated python script, acting as a saved file
format, something like:

interpret("""
lots of lines
""")
another_command()

Obviously this isn't the most efficient format, but it's nice to
encapsulate the data and the script into one file.

Jeremy

Jul 18 '05 #5

Francis Girard

Hi,

Using finditer in re module might help. I'm not sure it is lazy nor
performant. Here's an example :

=== BEGIN SNAP
import re

reLn = re.compile(r"""[^\n]*(\n|$)""")

sStr = \
"""
This is a test string.
It is supposed to be big.
Oh well.
"""

for oMatch in reLn.finditer(sStr):
print oMatch.group()
=== END SNAP

Regards,

Francis Girard

Le vendredi 25 Février 2005 16:55, Jeremy Sanders a écrit*:

I have a large string containing lines of text separated by '\n'. I'm
currently using text.splitlines(True) to break the text into lines, and
I'm iterating over the resulting list.

This is very slow (when using 400000 lines!). Other than dumping the
string to a file, and reading it back using the file iterator, is there a
way to quickly iterate over the lines?

I tried using newpos=text.find('\n', pos), and returning the chopped text
text[pos:newpos+1], but this is much slower than splitlines.

Any ideas?

Thanks

Jeremy

Jul 18 '05 #6

Larry Bates

By putting them into another file you can just use
..readline iterator on file object to solve your
problem. I would personally find it hard to work
on a program that had 400,000 lines of data hard
coded into a structure like this, but that's me.

-Larry
Jeremy Sanders wrote:

On Fri, 25 Feb 2005 10:57:59 -0600, Larry Bates wrote:

How did you get the string in memory in the first place?

They're actually from a generated python script, acting as a saved file
format, something like:

interpret("""
lots of lines
""")
another_command()

Obviously this isn't the most efficient format, but it's nice to
encapsulate the data and the script into one file.

Jeremy

Jul 18 '05 #7

John Machin

Jeremy Sanders wrote:

On Fri, 25 Feb 2005 17:14:24 +0100, Diez B. Roggisch wrote:
Maybe [c]StringIO can be of help. I don't know if it's iterator is lazy. But at least it has one, so you can try and see if it improves
performance :)
Excellent! I somehow missed that module. StringIO speeds up the iteration by a factor of 20!

Twenty?? StringIO.StringIO or cStringIO.StringIO???

I did some "timeit" tests using the code below, on 400,000 lines of 53
chars (uppercase + lowercase + '\n').

On my config (Python 2.4, Windows 2000, 1.4 GHz Athlon chip, not short
of memory), cStringIO took 0.18 seconds and the "hard way" took 0.91
seconds. Stringio (not shown) took 2.9 seconds. FWIW, moving an
attribute look-up in the (sfind = s.find) saves only about 0.1 seconds.
python -m timeit -s "import itersplitlines as i; d = i.mk_data(400000)" "i.test_csio(d)"
10 loops, best of 3: 1.82e+005 usec per loop
python -m timeit -s "import itersplitlines as i; d =

i.mk_data(400000)" "i.test_gen(d)"
10 loops, best of 3: 9.06e+005 usec per loop

A few questions:
(1) What is your equivalent of the "hard way"? What [c]StringIO code
did you use?
(2) How did you measure the time?
(3) How long does it take *compile* your 400,000-line Python script?

!import cStringIO
!
!def itersplitlines(s):
! if not s:
! yield s
! return
! pos = 0
! sfind = s.find
! epos = len(s)
! while pos < epos:
! newpos = sfind('\n', pos)
! if newpos == -1:
! yield s[pos:]
! return
! yield s[pos:newpos+1]
! pos = newpos+1
!
!def test_gen(s):
! for z in itersplitlines(s):
! pass
!
!def test_csio(s):
! for z in cStringIO.StringIO(s):
! pass
!
!def mk_data(n):
! import string
! return (string.lowercase + string.uppercase + '\n') * n

Jul 18 '05 #8

Similar topics

Splitting on a word

by: qwweeeit | last post by:

Hi all, I am writing a script to visualize (and print) the web references hidden in the html files as: '<a href="web reference"> underlined reference</a>' Optimizing my code, I found that an...

Python

From char* iterators to char* strings

by: pmatos | last post by:

Hi all, I have 2 char* iterators str and end and I'm doing as follows: string id_str(str, end); const char * id = id_str.c_str(); but these has 2 problems afaik. One, I'm generating a string...

C / C++

FYI - Sql for splitting a delimited concatenated string into separate strings/rows.

by: Dr. StrangeLove | last post by:

Greetings, Let say we want to split column 'list' in table lists into separate rows using the comma as the delimiter. Table lists id list 1 aa,bbb,c 2 e,f,gggg,hh 3 ii,kk 4 m

Microsoft Access / VBA

i need help with splitting a string please

by: Trint Smith | last post by:

Ok, My program has been formating .txt files for input into sql server and ran into a problem...the .txt is an export from an accounting package and is only supposed to contain comas (,) between...

Visual Basic .NET

Splitting up a string

by: Opettaja | last post by:

I am new to c# and I am currently trying to make a program to retrieve Battlefield 2 game stats from the gamespy servers. I have got it so I can retrieve the data but I do not know how to cut up...

C# / C Sharp

C Style Strings

by: scroopy | last post by:

Hi, I've always used std::string but I'm having to use a 3rd party library that returns const char*s. Given: char* pString1 = "Blah "; const char* pString2 = "Blah Blah"; How do I append...

C / C++

splitting a vector into two parts

by: AG | last post by:

Hi, I have a vector that represent memory in my code. I would like to split it into two smaller vector, without copying it. I want the split to be "in-place", so that modifications on the two...

C / C++

Programming algorithms with strings in C++, and transition from C by example!

by: CoreyWhite | last post by:

/* WORKING WITH STRINGS IN C++ IS THE BEST WAY TO LEARN THE LANGUAGE AND TRANSITION FROM C. C++ HAS MANY NEW FEATURES THAT WORK TOGETHER AND WHEN YOU SEE THEM DOING THE IMPOSSIBLE AND MAKING...

C / C++

Splitting function

by: shadow_ | last post by:

Hi i m new at C and trying to write a parser and a string class. Basicly program will read data from file and splits it into lines then lines to words. i used strtok function for splitting data to...

C / C++

Splitting strings

by: Eyes Of Madness | last post by:

I'm doing a program for a class of mine and I am having trouble splitting my strings up. I know you can do something like: a = '012345' a returns 012 but I am inputing strings of varying...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server