473,395 Members | 1,696 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

split and regexp on textfile

Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?

Apr 13 '07 #1
8 1637
On Apr 13, 3:59 pm, "Flyzone" <flyz...@technologist.comwrote:
Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?
you trying to match the date part right? if re is what you desire,
here's one example:
>>data = open("file").read()
pat = re.compile("[A-Z][a-z]{2} [A-Z][a-z]{2} \d{,2}\s+\d{,2}:\d{,2}:\d{,2} \d{4}",re.M|re.DOTALL)
print pat.findall(data)
['Mon Apr 9 22:30:18 2007', 'Mon Apr 9 22:31:10 2007']

Apr 13 '07 #2
On 13 Apr, 10:40, mik3l3...@gmail.com wrote:
you trying to match the date part right? if re is what you desire,
here's one example:
Amm..not! I need to get the text-block between the two data, not the
data! :)

Apr 13 '07 #3
On Apr 13, 4:55 pm, "Flyzone" <flyz...@technologist.comwrote:
On 13 Apr, 10:40, mik3l3...@gmail.com wrote:
you trying to match the date part right? if re is what you desire,
here's one example:

Amm..not! I need to get the text-block between the two data, not the
data! :)
change to pat.split(data) then.
I get this:

['', '\ntext\ntext\n', '\ntext\ntext ']

Apr 13 '07 #4
On 13 Apr, 11:14, mik3l3...@gmail.com wrote:
change to pat.split(data) then.
next what i have tried originally..but is not working, my result is
here:

["Mon Feb 26 11:25:04 2007\ntext\n text\ntext\nMon Feb 26 11:25:16
2007\ntext\n text\n text\nMon Feb 26 17:06:41 2007\ntext"]

all together :(

Apr 13 '07 #5
Flyzone:
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
My first try:

data = """
error text
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
Mon Apr 10 22:31:10 2007
text
text
"""

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

section = []
for line in data.splitlines():
if date_find.search(line):
if section:
print "\n" + "-" * 10 + "\n", "\n".join(section)
section = [line]
else:
if line:
section.append(line)

print "\n" + "-" * 10 + "\n", "\n".join(section)

itertools.groupby() is fit to split sequences like:
1111100011111100011100101011111
as:
11111 000 111111 000 111 00 1 0 1 0 11111
While here we have a sequence like:
100001000101100001000000010000
that has to be splitted as:
10000 1000 10 1 10000 10000000 10000
A standard itertool can be added for such quite common situation too.

Along those lines I have devised this different (and maybe over-
engineered) version:
from itertools import groupby
import re

class Splitter(object):
# Not tested much
def __init__(self, predicate):
self.predicate = predicate
self.precedent_el = None
self.state = True
def __call__(self, el):
if self.predicate(el):
self.state = not self.state
self.precedent_el = el
return self.state

date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
splitter = Splitter(date_find.search)

sections = ("\n".join(g) for h,g in groupby(data.splitlines(),
key=splitter))
for section in sections:
if section:
print "\n" + "-" * 10 + "\n", section
The Splitter class + the groupby can become a single simpler
generator, like in this this version:
def grouper(seq, key=bool):
# A fast identity function can be used instead of bool()
# Not tested much
group = []
for part in seq:
if key(part):
if group: yield group
group = [part]
else:
group.append(part)
yield group

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

for section in grouper(data.splitlines(), date_find.search):
print "\n" + "-" * 10 + "\n", "\n".join(section)
Maybe that grouper can be modified to manage group lazily, like
groupby does, instead of building a true list.
Flyzone (seen later):
>Amm..not! I need to get the text-block between the two data, not the data! :)
Then you can modify the code like this:

def grouper(seq, key=bool):
group = []
for part in seq:
if key(part):
if group: yield group
group = [] # changed
else:
group.append(part)
yield group

Bye,
bearophile

Apr 13 '07 #6
On 13 Apr, 11:30, "Flyzone" <flyz...@technologist.comwrote:
all together :(
Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working! :)
Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?

Apr 13 '07 #7
On Apr 13, 6:08 pm, "Flyzone" <flyz...@technologist.comwrote:
On 13 Apr, 11:30, "Flyzone" <flyz...@technologist.comwrote:
all together :(

Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working! :)
Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?
not that i know of.

Apr 13 '07 #8
En Fri, 13 Apr 2007 07:08:05 -0300, Flyzone <fl*****@technologist.com>
escribió:
A little question: the pat.split can split without delete the date?
No, but instead of reading the whole file and splitting on dates, you
could iterate over the file and detect block endings:

def split_on_dates(ftoparse):
block = None
for line in ftoparse:
if fancy_date_regexp.match(line):
# a new block begins, yield the previous one
if block is not None:
yield current_date, block
current_date = line
block = []
else:
# accumulate lines for current block
block.append(line)
# don't forget the last block
if block is not None:
yield current_date, block

for date, block in split_on_dates(ftoparse):
# process block

--
Gabriel Genellina
Apr 15 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: nieuws | last post by:
Hi, I was trying to do the following. It's my first php "project", so it's quiet logic that i have some problems. Perhaps the php community might help. It's about this : I have a txt file...
1
by: jhcorey | last post by:
I don't know where the actual issue is, but hopefully someone can explain. The following displays "5" in FireFox, but "3" in IE: <script type="text/javascript" language="javascript"> var...
1
by: Krish | last post by:
I have requirement, that i get one big chunk of text file. This text file will have has information, that on finding "****End of Information****", i have to split them individual text file with our...
9
by: martinskou | last post by:
Hi, I'm looking for something like: multi_split( 'a:=b+c' , ) returning: whats the python way to achieve this, preferably without regexp? Thanks.
1
Atli
by: Atli | last post by:
The following small HowTo is a compilation of an original problem in getting some cookie-values through different methods of string-handling. The original Problem was posted as follows: As...
14
by: tom t/LA | last post by:
Here is a function to convert a CSV file to a Javascript array. Uses idealized file reading functions based on the std C library, since there is no Javascript standard. Not fully tested. ...
22
by: SmokeWilliams | last post by:
Hi, I am working on a Spell checker for my richtext editor. I cannot use any open source, and must develop everything myself. I need a RegExp pattern to split text into a word array. I have...
1
by: perdoname | last post by:
Hello, Im trying to implement a program which will split a text file and then parses the elements to an arraylist. My text file looks like that: My program is that: public class Parse {
13
by: kashif73 | last post by:
I have a texfile with hundreds of records. Each record in a line has 1250 values & values are seperated by a semi colon. IS there a way in VB.NET that i can split each line for example first 1000...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.