473,398 Members | 2,812 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

split large file by string/regex


I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.
Jul 18 '05 #1
12 5662
Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.


Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
Jul 18 '05 #2
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:

import re
myre = re.compile(r'foo')
fh = open(f)
fh1 = open(f1,'w')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,'w')
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()

I'm doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there's a
chance that the pattern does not appear in the file, you'll need to
check for eof in the first while loop.

Jason
Jul 18 '05 #3
> Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.


At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that's up to you.

--
Regards,

Diez B. Roggisch
Jul 18 '05 #4
Steve Holden <st***@holdenweb.com> writes:
Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.


Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a
sequence of overlapping chunks to make sure that a regex could pick up
all matches. For me that would be more complex than using a lexer,
given the excellent range of modules such as SPARK and PLY, to mention
but two.


yes lexing would be the simplest, but PLY also can't read from streams
and it looks to me (from the examples) as if it's the same with SPARK.
I wonder why something like this is not in any lib.
Is there any known lexer that can do this?
I don't have to parse, just write the junks to separate files.
I really hate doing that sequence thing...

m.
Jul 18 '05 #5
Jason Rennie <jr*****@csail.mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.
Jul 18 '05 #6
On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
Jason Rennie <jr*****@csail.mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.

Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of
'1231xxx45646xxx45646xxx78'.split('xxx') ['1231', '45646', '45646', '78']

or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

['1231xxx', '45646xxx', '45646xxx', '78']

??

Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen:
start, end = end, buf.find(splitstr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break

yield buf

def test(*args):
for chunk in splitfile(*args):
print repr(chunk)

if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------

Extent of testing follows :-)
print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40) ----------------------------------------
01234abc5678abc901234
567ab890abc
---------------------------------------- import ut.splitfile
ut.splitfile.test('splitfile.txt', 'abc') '01234'
'abc'
'5678'
'abc'
'901234\r\n567ab890'
'abc'
'\r\n' ut.splitfile.test('splitfile.txt', '012') ''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890abc\r\n' it = ut.splitfile.splitfile('splitfile.txt','ab89',4)
it.next <method-wrapper object at 0x02EF1C6C> it.next() '01234abc5678abc901234\r\n567' it.next() 'ab89' it.next() '0abc\r\n' it.next()

Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Regards,
Bengt Richter
Jul 18 '05 #7
On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenweb.com> wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.
Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all


re module works fine with mmap-ed file, so no need to read it into memory.
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.


--
Denis S. Otkidach
http://www.python.ru/ [ru]
Jul 18 '05 #8
"Denis S. Otkidach" <od*@strana.ru> writes:
On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenweb.com> wrote:
> I am trying to split a file by a fixed string.
> The file is too large to just read it into a string and split this.
> I could probably use a lexer but there maybe anything more simple?
> thanks
> m.


Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all


re module works fine with mmap-ed file, so no need to read it into memory.


thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

m.
Jul 18 '05 #9
Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
Jason Rennie <jr*****@csail.mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...


man strings (-o option)

--
William Park <op**********@yahoo.ca>
Linux solution for data management and processing.
Jul 18 '05 #10
William Park <op**********@yahoo.ca> writes:
Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
Jason Rennie <jr*****@csail.mit.edu> writes:
> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
>> I am trying to split a file by a fixed string.
>> The file is too large to just read it into a string and split this.
>> I could probably use a lexer but there maybe anything more simple?
>
> If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...


man strings (-o option)

this doesn't make sense at all

m.
Jul 18 '05 #11
On Mon, 22 Nov 2004 20:48:16 +0100
Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
"Denis S. Otkidach" <od*@strana.ru> writes:

[...]
re module works fine with mmap-ed file, so no need to read it into
memory.


thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read


mmap-ed files also support subscription and slicing. I guess
mmfile[start:stop] would more readable.

--
Denis S. Otkidach
http://www.python.ru/ [ru]
Jul 18 '05 #12
"Denis S. Otkidach" <od*@strana.ru> writes:
On Mon, 22 Nov 2004 20:48:16 +0100
Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
"Denis S. Otkidach" <od*@strana.ru> writes:

[...]
> re module works fine with mmap-ed file, so no need to read it into
> memory.
>


thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read


mmap-ed files also support subscription and slicing. I guess
mmfile[start:stop] would more readable.


yes, even better :-)

m.
Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Blue Ocean | last post by:
In short, it's not working right for me. In long: The program is designed to read numbers from an accumulator and speak them out loud. Unfortunately, the class that contains the method to...
19
by: David Logan | last post by:
We need an additional function in the String class. We need the ability to suppress empty fields, so that we can more effectively parse. Right now, multiple whitespace characters create multiple...
1
by: Mark | last post by:
Hi, I've seen some postings on this but not exactly relating to this posting. I'm reading in a large mail message as a string. In the string is an xml attachment that I need to parse out and...
4
by: Cor | last post by:
Hi Newsgroup, I have given an answer in this newsgroup about a "Replace". There came an answer on that I did not understand, so I have done some tests. I got the idea that someone said,...
7
by: lgbjr | last post by:
Hi All, I'm trying to split a string on every character. The string happens to be a representation of a hex number. So, my regex expression is (). Seems simple, but for some reason, I'm not...
7
by: Jordi Rico | last post by:
Hi, I know I can split a string into an array doing this: Dim s As String()=Regex.Split("One-Two-Three","-") So I would have: s(0)="One" s(1)="Two"
24
by: garyusenet | last post by:
I'm working on a data file and can't find any common delimmiters in the file to indicate the end of one row of data and the start of the next. Rows are not on individual lines but run accross...
0
by: =?ISO-8859-15?Q?C=E9dric?= | last post by:
Hi all, I want to import a SQL script (SQLite) executing each queries separately. - I read the SQL file - I split the read string with the separator ";" - I execute each query string query...
1
by: mad.scientist.jr | last post by:
I am working in C# ASP.NET framework 1.1 and for some reason Regex.Split isn't working as expected. When trying to split a string, Split is returning an array with the entire string in element ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.