473,586 Members | 2,652 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

split large file by string/regex


I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.
Jul 18 '05 #1
12 5676
Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.


Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
Jul 18 '05 #2
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:

import re
myre = re.compile(r'fo o')
fh = open(f)
fh1 = open(f1,'w')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,'w' )
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()

I'm doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there's a
chance that the pattern does not appear in the file, you'll need to
check for eof in the first while loop.

Jason
Jul 18 '05 #3
> Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.


At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that's up to you.

--
Regards,

Diez B. Roggisch
Jul 18 '05 #4
Steve Holden <st***@holdenwe b.com> writes:
Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.


Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a
sequence of overlapping chunks to make sure that a regex could pick up
all matches. For me that would be more complex than using a lexer,
given the excellent range of modules such as SPARK and PLY, to mention
but two.


yes lexing would be the simplest, but PLY also can't read from streams
and it looks to me (from the examples) as if it's the same with SPARK.
I wonder why something like this is not in any lib.
Is there any known lexer that can do this?
I don't have to parse, just write the junks to separate files.
I really hate doing that sequence thing...

m.
Jul 18 '05 #5
Jason Rennie <jr*****@csail. mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.
Jul 18 '05 #6
On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <di******@zedat .fu-berlin.de> wrote:
Jason Rennie <jr*****@csail. mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.

Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xx x45646xxx78' do you want the long-file equivalent of
'1231xxx45646xx x45646xxx78'.sp lit('xxx') ['1231', '45646', '45646', '78']

or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

['1231xxx', '45646xxx', '45646xxx', '78']

??

Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024* 64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb '):f.read(chunk size), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen :
start, end = end, buf.find(splits tr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break

yield buf

def test(*args):
for chunk in splitfile(*args ):
print repr(chunk)

if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------

Extent of testing follows :-)
print '%s\n%s%s'%('-'*40, open('splitfile .txt','rb').rea d(),'-'*40) ----------------------------------------
01234abc5678abc 901234
567ab890abc
---------------------------------------- import ut.splitfile
ut.splitfile.te st('splitfile.t xt', 'abc') '01234'
'abc'
'5678'
'abc'
'901234\r\n567a b890'
'abc'
'\r\n' ut.splitfile.te st('splitfile.t xt', '012') ''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890 abc\r\n' it = ut.splitfile.sp litfile('splitf ile.txt','ab89' ,4)
it.next <method-wrapper object at 0x02EF1C6C> it.next() '01234abc5678ab c901234\r\n567' it.next() 'ab89' it.next() '0abc\r\n' it.next()

Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Regards,
Bengt Richter
Jul 18 '05 #7
On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenwe b.com> wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.
Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all


re module works fine with mmap-ed file, so no need to read it into memory.
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.


--
Denis S. Otkidach
http://www.python.ru/ [ru]
Jul 18 '05 #8
"Denis S. Otkidach" <od*@strana.r u> writes:
On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenwe b.com> wrote:
> I am trying to split a file by a fixed string.
> The file is too large to just read it into a string and split this.
> I could probably use a lexer but there maybe anything more simple?
> thanks
> m.


Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all


re module works fine with mmap-ed file, so no need to read it into memory.


thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

m.
Jul 18 '05 #9
Martin Dieringer <di******@zedat .fu-berlin.de> wrote:
Jason Rennie <jr*****@csail. mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...


man strings (-o option)

--
William Park <op**********@y ahoo.ca>
Linux solution for data management and processing.
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
27182
by: Blue Ocean | last post by:
In short, it's not working right for me. In long: The program is designed to read numbers from an accumulator and speak them out loud. Unfortunately, the class that contains the method to read off large numbers is only for integers. My intention is to split a String across the Regex of ".". However, this code does not work: private...
19
10906
by: David Logan | last post by:
We need an additional function in the String class. We need the ability to suppress empty fields, so that we can more effectively parse. Right now, multiple whitespace characters create multiple empty strings in the resulting string array.
1
3377
by: Mark | last post by:
Hi, I've seen some postings on this but not exactly relating to this posting. I'm reading in a large mail message as a string. In the string is an xml attachment that I need to parse out and remove from the message once processed. I have to do this as a string and not using any CDO libraries. My problem is that there's normally a large pdf...
4
3836
by: Cor | last post by:
Hi Newsgroup, I have given an answer in this newsgroup about a "Replace". There came an answer on that I did not understand, so I have done some tests. I got the idea that someone said, that the split method and the regex.replace method was better than the string.replace method and replace function. I did not believe that.
7
2250
by: lgbjr | last post by:
Hi All, I'm trying to split a string on every character. The string happens to be a representation of a hex number. So, my regex expression is (). Seems simple, but for some reason, I'm not getting the results I expect. Dim SA as string() Dim S as string S="FBE"
7
2216
by: Jordi Rico | last post by:
Hi, I know I can split a string into an array doing this: Dim s As String()=Regex.Split("One-Two-Three","-") So I would have: s(0)="One" s(1)="Two"
24
4819
by: garyusenet | last post by:
I'm working on a data file and can't find any common delimmiters in the file to indicate the end of one row of data and the start of the next. Rows are not on individual lines but run accross multiple lines. It would appear though that every distinct set of data starts with a 'code' that is always the 25 characters long. The text is variable...
0
1337
by: =?ISO-8859-15?Q?C=E9dric?= | last post by:
Hi all, I want to import a SQL script (SQLite) executing each queries separately. - I read the SQL file - I split the read string with the separator ";" - I execute each query string query = File.ReadAllText("C:\\script.sql"); string str = query.Split(';');
1
3272
by: mad.scientist.jr | last post by:
I am working in C# ASP.NET framework 1.1 and for some reason Regex.Split isn't working as expected. When trying to split a string, Split is returning an array with the entire string in element and an empty string in element . I am trying two different ways (an ArrayList and a string array) and both are doing that. Also, IndexOf is not...
0
7912
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7839
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
1
7959
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5710
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5390
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3837
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3865
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1449
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1180
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.