I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m. 12 5662
Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple? thanks m.
Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.
regards
Steve
-- http://www.holdenweb.com http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
import re
myre = re.compile(r'foo')
fh = open(f)
fh1 = open(f1,'w')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,'w')
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()
I'm doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there's a
chance that the pattern does not appear in the file, you'll need to
check for eof in the first while loop.
Jason
> Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all matches. For me that would be more complex than using a lexer, given the excellent range of modules such as SPARK and PLY, to mention but two.
At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that's up to you.
--
Regards,
Diez B. Roggisch
Steve Holden <st***@holdenweb.com> writes: Martin Dieringer wrote:
I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple? thanks m.
Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all matches. For me that would be more complex than using a lexer, given the excellent range of modules such as SPARK and PLY, to mention but two.
yes lexing would be the simplest, but PLY also can't read from streams
and it looks to me (from the examples) as if it's the same with SPARK.
I wonder why something like this is not in any lib.
Is there any known lexer that can do this?
I don't have to parse, just write the junks to separate files.
I really hate doing that sequence thing...
m.
Jason Rennie <jr*****@csail.mit.edu> writes: On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...
m.
On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <di******@zedat.fu-berlin.de> wrote: Jason Rennie <jr*****@csail.mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
Hmm it's binary data, I can't tell how long lines would be. OTOH a line would certainly contain the pattern as it has no \n in it... and the lines probably wouldn't be too large for memory...
m.
Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of '1231xxx45646xxx45646xxx78'.split('xxx')
['1231', '45646', '45646', '78']
or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']
or maybe
['1231xxx', '45646xxx', '45646xxx', '78']
??
Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):
--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen:
start, end = end, buf.find(splitstr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break
yield buf
def test(*args):
for chunk in splitfile(*args):
print repr(chunk)
if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------
Extent of testing follows :-)
print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40)
----------------------------------------
01234abc5678abc901234
567ab890abc
---------------------------------------- import ut.splitfile ut.splitfile.test('splitfile.txt', 'abc')
'01234'
'abc'
'5678'
'abc'
'901234\r\n567ab890'
'abc'
'\r\n' ut.splitfile.test('splitfile.txt', '012')
''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890abc\r\n' it = ut.splitfile.splitfile('splitfile.txt','ab89',4) it.next
<method-wrapper object at 0x02EF1C6C> it.next()
'01234abc5678abc901234\r\n567' it.next()
'ab89' it.next()
'0abc\r\n' it.next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).
Regards,
Bengt Richter
On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenweb.com> wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple? thanks m. Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all
re module works fine with mmap-ed file, so no need to read it into memory.
matches. For me that would be more complex than using a lexer, given the excellent range of modules such as SPARK and PLY, to mention but two.
--
Denis S. Otkidach http://www.python.ru/ [ru]
"Denis S. Otkidach" <od*@strana.ru> writes: On Mon, 22 Nov 2004 08:53:02 -0500 Steve Holden <st***@holdenweb.com> wrote:
> I am trying to split a file by a fixed string. > The file is too large to just read it into a string and split this. > I could probably use a lexer but there maybe anything more simple? > thanks > m.
Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all
re module works fine with mmap-ed file, so no need to read it into memory.
thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read
m.
Martin Dieringer <di******@zedat.fu-berlin.de> wrote: Jason Rennie <jr*****@csail.mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
Hmm it's binary data, I can't tell how long lines would be. OTOH a line would certainly contain the pattern as it has no \n in it... and the lines probably wouldn't be too large for memory...
man strings (-o option)
--
William Park <op**********@yahoo.ca>
Linux solution for data management and processing.
William Park <op**********@yahoo.ca> writes: Martin Dieringer <di******@zedat.fu-berlin.de> wrote: Jason Rennie <jr*****@csail.mit.edu> writes:
> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: >> I am trying to split a file by a fixed string. >> The file is too large to just read it into a string and split this. >> I could probably use a lexer but there maybe anything more simple? > > If the pattern is contained within a single line, do something like this:
Hmm it's binary data, I can't tell how long lines would be. OTOH a line would certainly contain the pattern as it has no \n in it... and the lines probably wouldn't be too large for memory...
man strings (-o option)
this doesn't make sense at all
m.
On Mon, 22 Nov 2004 20:48:16 +0100
Martin Dieringer <di******@zedat.fu-berlin.de> wrote: "Denis S. Otkidach" <od*@strana.ru> writes:
[...] re module works fine with mmap-ed file, so no need to read it into memory.
thank you, this is the solution! Now I can mmap.find all locations and then read the chunks them via file.seek and file.read
mmap-ed files also support subscription and slicing. I guess
mmfile[start:stop] would more readable.
--
Denis S. Otkidach http://www.python.ru/ [ru]
"Denis S. Otkidach" <od*@strana.ru> writes: On Mon, 22 Nov 2004 20:48:16 +0100 Martin Dieringer <di******@zedat.fu-berlin.de> wrote:
"Denis S. Otkidach" <od*@strana.ru> writes: [...] > re module works fine with mmap-ed file, so no need to read it into > memory. >
thank you, this is the solution! Now I can mmap.find all locations and then read the chunks them via file.seek and file.read
mmap-ed files also support subscription and slicing. I guess mmfile[start:stop] would more readable.
yes, even better :-)
m. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Blue Ocean |
last post by:
In short, it's not working right for me.
In long:
The program is designed to read numbers from an accumulator and speak
them out loud. Unfortunately, the class that contains the method to...
|
by: David Logan |
last post by:
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple...
|
by: Mark |
last post by:
Hi,
I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and...
|
by: Cor |
last post by:
Hi Newsgroup,
I have given an answer in this newsgroup about a "Replace".
There came an answer on that I did not understand, so I have done some
tests.
I got the idea that someone said,...
|
by: lgbjr |
last post by:
Hi All,
I'm trying to split a string on every character. The string happens to be a
representation of a hex number. So, my regex expression is ().
Seems simple, but for some reason, I'm not...
|
by: Jordi Rico |
last post by:
Hi,
I know I can split a string into an array doing this:
Dim s As String()=Regex.Split("One-Two-Three","-")
So I would have:
s(0)="One"
s(1)="Two"
|
by: garyusenet |
last post by:
I'm working on a data file and can't find any common delimmiters in the
file to indicate the end of one row of data and the start of the next.
Rows are not on individual lines but run accross...
|
by: =?ISO-8859-15?Q?C=E9dric?= |
last post by:
Hi all,
I want to import a SQL script (SQLite) executing each queries separately.
- I read the SQL file
- I split the read string with the separator ";"
- I execute each query
string query...
|
by: mad.scientist.jr |
last post by:
I am working in C# ASP.NET framework 1.1 and
for some reason Regex.Split isn't working as expected.
When trying to split a string, Split is returning an array
with the entire string in element ...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| |