split large file by string/regex

Martin Dieringer

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.

Jul 18 '05 #1

Subscribe Post Reply

5662

Steve Holden

Martin Dieringer wrote:

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.

Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119

Jul 18 '05 #2

Jason Rennie

On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?

If the pattern is contained within a single line, do something like this:

import re
myre = re.compile(r'foo')
fh = open(f)
fh1 = open(f1,'w')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,'w')
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()

I'm doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there's a
chance that the pattern does not appear in the file, you'll need to
check for eof in the first while loop.

Jason

Jul 18 '05 #3

Diez B. Roggisch

> Depends on your definition of "simple", I suppose. The problem with

*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that's up to you.

--
Regards,

Diez B. Roggisch

Jul 18 '05 #4

Martin Dieringer

Steve Holden <st***@holdenweb.com> writes:

Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.

Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a
sequence of overlapping chunks to make sure that a regex could pick up
all matches. For me that would be more complex than using a lexer,
given the excellent range of modules such as SPARK and PLY, to mention
but two.

yes lexing would be the simplest, but PLY also can't read from streams
and it looks to me (from the examples) as if it's the same with SPARK.
I wonder why something like this is not in any lib.
Is there any known lexer that can do this?
I don't have to parse, just write the junks to separate files.
I really hate doing that sequence thing...

m.

Jul 18 '05 #5

Martin Dieringer

Jason Rennie <jr*****@csail.mit.edu> writes:

On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?

If the pattern is contained within a single line, do something like this:

Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.

Jul 18 '05 #6

Bengt Richter

On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <di******@zedat.fu-berlin.de> wrote:

Jason Rennie <jr*****@csail.mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?

If the pattern is contained within a single line, do something like this:

Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.

Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of

'1231xxx45646xxx45646xxx78'.split('xxx') ['1231', '45646', '45646', '78']

or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

['1231xxx', '45646xxx', '45646xxx', '78']

??

Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen:
start, end = end, buf.find(splitstr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break

yield buf

def test(*args):
for chunk in splitfile(*args):
print repr(chunk)

if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------

Extent of testing follows :-)
print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40) ----------------------------------------
01234abc5678abc901234
567ab890abc
---------------------------------------- import ut.splitfile
ut.splitfile.test('splitfile.txt', 'abc') '01234'
'abc'
'5678'
'abc'
'901234\r\n567ab890'
'abc'
'\r\n' ut.splitfile.test('splitfile.txt', '012') ''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890abc\r\n' it = ut.splitfile.splitfile('splitfile.txt','ab89',4)
it.next <method-wrapper object at 0x02EF1C6C> it.next() '01234abc5678abc901234\r\n567' it.next() 'ab89' it.next() '0abc\r\n' it.next()

Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Regards,
Bengt Richter

Jul 18 '05 #7

Denis S. Otkidach

On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenweb.com> wrote:

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.
Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all

re module works fine with mmap-ed file, so no need to read it into memory.
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

--
Denis S. Otkidach
http://www.python.ru/ [ru]

Jul 18 '05 #8

Martin Dieringer

"Denis S. Otkidach" <od*@strana.ru> writes:

On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenweb.com> wrote:
> I am trying to split a file by a fixed string.
> The file is too large to just read it into a string and split this.
> I could probably use a lexer but there maybe anything more simple?
> thanks
> m.

Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all

re module works fine with mmap-ed file, so no need to read it into memory.

thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

m.

Jul 18 '05 #9

William Park

Martin Dieringer <di******@zedat.fu-berlin.de> wrote:

Jason Rennie <jr*****@csail.mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?

If the pattern is contained within a single line, do something like this:

Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

man strings (-o option)

--
William Park <op**********@yahoo.ca>
Linux solution for data management and processing.

Jul 18 '05 #10

Martin Dieringer

William Park <op**********@yahoo.ca> writes:

Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
Jason Rennie <jr*****@csail.mit.edu> writes:
> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
>> I am trying to split a file by a fixed string.
>> The file is too large to just read it into a string and split this.
>> I could probably use a lexer but there maybe anything more simple?
>
> If the pattern is contained within a single line, do something like this:

Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

man strings (-o option)

this doesn't make sense at all

m.

Jul 18 '05 #11

Denis S. Otkidach

On Mon, 22 Nov 2004 20:48:16 +0100
Martin Dieringer <di******@zedat.fu-berlin.de> wrote:

"Denis S. Otkidach" <od*@strana.ru> writes:

[...]

re module works fine with mmap-ed file, so no need to read it into
memory.

thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

mmap-ed files also support subscription and slicing. I guess
mmfile[start:stop] would more readable.

--
Denis S. Otkidach
http://www.python.ru/ [ru]

Jul 18 '05 #12

Martin Dieringer

"Denis S. Otkidach" <od*@strana.ru> writes:

On Mon, 22 Nov 2004 20:48:16 +0100
Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
"Denis S. Otkidach" <od*@strana.ru> writes:

[...]
> re module works fine with mmap-ed file, so no need to read it into
> memory.
>

thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

mmap-ed files also support subscription and slicing. I guess
mmfile[start:stop] would more readable.

yes, even better :-)

m.

Jul 18 '05 #13

Similar topics

Problem with String.split(Regex arg)

by: Blue Ocean | last post by:

In short, it's not working right for me. In long: The program is designed to read numbers from an accumulator and speak them out loud. Unfortunately, the class that contains the method to...

Java

String.Split needs an enhancement to ignore empty fields

by: David Logan | last post by:

We need an additional function in the String class. We need the ability to suppress empty fields, so that we can more effectively parse. Right now, multiple whitespace characters create multiple...

C# / C Sharp

Regex help with large strings

by: Mark | last post by:

Hi, I've seen some postings on this but not exactly relating to this posting. I'm reading in a large mail message as a string. In the string is an xml attachment that I need to parse out and...

C# / C Sharp

Replace methode, Replace Function, Stringbuilder replace, Regex Replace, Split

by: Cor | last post by:

Hi Newsgroup, I have given an answer in this newsgroup about a "Replace". There came an answer on that I did not understand, so I have done some tests. I got the idea that someone said,...

Visual Basic .NET

Regex. Split or Split

by: lgbjr | last post by:

Hi All, I'm trying to split a string on every character. The string happens to be a representation of a hex number. So, my regex expression is (). Seems simple, but for some reason, I'm not...

Visual Basic .NET

Regex.Split... Can I do this??

by: Jordi Rico | last post by:

Hi, I know I can split a string into an array doing this: Dim s As String()=Regex.Split("One-Two-Three","-") So I would have: s(0)="One" s(1)="Two"

Visual Basic .NET

how to split a string using ,fixed character length, variable text delimmiter

by: garyusenet | last post by:

I'm working on a data file and can't find any common delimmiters in the file to indicate the end of one row of data and the start of the next. Rows are not on individual lines but run accross...

C# / C Sharp

String.Split or Regex

by: =?ISO-8859-15?Q?C=E9dric?= | last post by:

Hi all, I want to import a SQL script (SQLite) executing each queries separately. - I read the SQL file - I split the read string with the separator ";" - I execute each query string query...

.NET Framework

strange behavior from Regex.Split & myString.IndexOf

by: mad.scientist.jr | last post by:

I am working in C# ASP.NET framework 1.1 and for some reason Regex.Split isn't working as expected. When trying to split a string, Split is returning an array with the entire string in element ...

.NET Framework

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA