I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m. 12 5676
Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple? thanks m.
Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.
regards
Steve
-- http://www.holdenweb.com http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
import re
myre = re.compile(r'fo o')
fh = open(f)
fh1 = open(f1,'w')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,'w' )
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()
I'm doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there's a
chance that the pattern does not appear in the file, you'll need to
check for eof in the first while loop.
Jason
> Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all matches. For me that would be more complex than using a lexer, given the excellent range of modules such as SPARK and PLY, to mention but two.
At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that's up to you.
--
Regards,
Diez B. Roggisch
Steve Holden <st***@holdenwe b.com> writes: Martin Dieringer wrote:
I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple? thanks m.
Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all matches. For me that would be more complex than using a lexer, given the excellent range of modules such as SPARK and PLY, to mention but two.
yes lexing would be the simplest, but PLY also can't read from streams
and it looks to me (from the examples) as if it's the same with SPARK.
I wonder why something like this is not in any lib.
Is there any known lexer that can do this?
I don't have to parse, just write the junks to separate files.
I really hate doing that sequence thing...
m.
Jason Rennie <jr*****@csail. mit.edu> writes: On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...
m.
On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <di******@zedat .fu-berlin.de> wrote: Jason Rennie <jr*****@csail. mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
Hmm it's binary data, I can't tell how long lines would be. OTOH a line would certainly contain the pattern as it has no \n in it... and the lines probably wouldn't be too large for memory...
m.
Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xx x45646xxx78' do you want the long-file equivalent of '1231xxx45646xx x45646xxx78'.sp lit('xxx')
['1231', '45646', '45646', '78']
or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']
or maybe
['1231xxx', '45646xxx', '45646xxx', '78']
??
Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):
--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024* 64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb '):f.read(chunk size), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen :
start, end = end, buf.find(splits tr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break
yield buf
def test(*args):
for chunk in splitfile(*args ):
print repr(chunk)
if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------
Extent of testing follows :-)
print '%s\n%s%s'%('-'*40, open('splitfile .txt','rb').rea d(),'-'*40)
----------------------------------------
01234abc5678abc 901234
567ab890abc
---------------------------------------- import ut.splitfile ut.splitfile.te st('splitfile.t xt', 'abc')
'01234'
'abc'
'5678'
'abc'
'901234\r\n567a b890'
'abc'
'\r\n' ut.splitfile.te st('splitfile.t xt', '012')
''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890 abc\r\n' it = ut.splitfile.sp litfile('splitf ile.txt','ab89' ,4) it.next
<method-wrapper object at 0x02EF1C6C> it.next()
'01234abc5678ab c901234\r\n567' it.next()
'ab89' it.next()
'0abc\r\n' it.next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).
Regards,
Bengt Richter
On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <st***@holdenwe b.com> wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple? thanks m. Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all
re module works fine with mmap-ed file, so no need to read it into memory.
matches. For me that would be more complex than using a lexer, given the excellent range of modules such as SPARK and PLY, to mention but two.
--
Denis S. Otkidach http://www.python.ru/ [ru]
"Denis S. Otkidach" <od*@strana.r u> writes: On Mon, 22 Nov 2004 08:53:02 -0500 Steve Holden <st***@holdenwe b.com> wrote:
> I am trying to split a file by a fixed string. > The file is too large to just read it into a string and split this. > I could probably use a lexer but there maybe anything more simple? > thanks > m.
Depends on your definition of "simple", I suppose. The problem with *not* using a lexer is that you'd have to examine the file in a sequence of overlapping chunks to make sure that a regex could pick up all
re module works fine with mmap-ed file, so no need to read it into memory.
thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read
m.
Martin Dieringer <di******@zedat .fu-berlin.de> wrote: Jason Rennie <jr*****@csail. mit.edu> writes:
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote: I am trying to split a file by a fixed string. The file is too large to just read it into a string and split this. I could probably use a lexer but there maybe anything more simple?
If the pattern is contained within a single line, do something like this:
Hmm it's binary data, I can't tell how long lines would be. OTOH a line would certainly contain the pattern as it has no \n in it... and the lines probably wouldn't be too large for memory...
man strings (-o option)
--
William Park <op**********@y ahoo.ca>
Linux solution for data management and processing. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Blue Ocean |
last post by:
In short, it's not working right for me.
In long:
The program is designed to read numbers from an accumulator and speak
them out loud. Unfortunately, the class that contains the method to
read off large numbers is only for integers. My intention is to split
a String across the Regex of ".". However, this code does not work:
private...
|
by: David Logan |
last post by:
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.
|
by: Mark |
last post by:
Hi,
I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and remove from
the message once processed. I have to do this as a string and not
using any CDO libraries. My problem is that there's normally a large
pdf...
|
by: Cor |
last post by:
Hi Newsgroup,
I have given an answer in this newsgroup about a "Replace".
There came an answer on that I did not understand, so I have done some
tests.
I got the idea that someone said, that the split method and the
regex.replace method was better than the string.replace method and replace
function. I did not believe that.
|
by: lgbjr |
last post by:
Hi All,
I'm trying to split a string on every character. The string happens to be a
representation of a hex number. So, my regex expression is ().
Seems simple, but for some reason, I'm not getting the results I expect.
Dim SA as string()
Dim S as string
S="FBE"
| |
by: Jordi Rico |
last post by:
Hi,
I know I can split a string into an array doing this:
Dim s As String()=Regex.Split("One-Two-Three","-")
So I would have:
s(0)="One"
s(1)="Two"
|
by: garyusenet |
last post by:
I'm working on a data file and can't find any common delimmiters in the
file to indicate the end of one row of data and the start of the next.
Rows are not on individual lines but run accross multiple lines.
It would appear though that every distinct set of data starts with a
'code' that is always the 25 characters long. The text is variable...
|
by: =?ISO-8859-15?Q?C=E9dric?= |
last post by:
Hi all,
I want to import a SQL script (SQLite) executing each queries separately.
- I read the SQL file
- I split the read string with the separator ";"
- I execute each query
string query = File.ReadAllText("C:\\script.sql");
string str = query.Split(';');
|
by: mad.scientist.jr |
last post by:
I am working in C# ASP.NET framework 1.1 and
for some reason Regex.Split isn't working as expected.
When trying to split a string, Split is returning an array
with the entire string in element and an empty string in element
.
I am trying two different ways (an ArrayList and a string array)
and both are doing that. Also, IndexOf is not...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |