split and regexp on textfile

Flyzone

Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?

Apr 13 '07 #1

Subscribe Post Reply

1637

mik3l3374

On Apr 13, 3:59 pm, "Flyzone" <flyz...@technologist.comwrote:

Hi,
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
Here a sample:
-----
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
----

I'm trying to put all the lines in a one string and then to separate
it
(could be better to not delete the \n if possible...)
while 1:
line = ftoparse.readline()
if not line: break
if line[-1]=='\n': line=line[:-1]
file_str += line
matchobj=re.compile('[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9]
[ ][0-9][0-9][:]')
matchobj=matchobj.split(file_str)
print matchobj

i have tried also
matchobj=re.split(r"^[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ]
[0-9][ ][0-9][0-9][:]",file_str)
and reading all with one:
file_str=ftoparse.readlines()
but the split doesn't work...where i am wronging?

you trying to match the date part right? if re is what you desire,
here's one example:

>>data = open("file").read()
pat = re.compile("[A-Z][a-z]{2} [A-Z][a-z]{2} \d{,2}\s+\d{,2}:\d{,2}:\d{,2} \d{4}",re.M|re.DOTALL)
print pat.findall(data)

['Mon Apr 9 22:30:18 2007', 'Mon Apr 9 22:31:10 2007']

Apr 13 '07 #2

Flyzone

On 13 Apr, 10:40, mik3l3...@gmail.com wrote:

you trying to match the date part right? if re is what you desire,
here's one example:

Amm..not! I need to get the text-block between the two data, not the
data! :)

Apr 13 '07 #3

mik3l3374

On Apr 13, 4:55 pm, "Flyzone" <flyz...@technologist.comwrote:

On 13 Apr, 10:40, mik3l3...@gmail.com wrote:

you trying to match the date part right? if re is what you desire,
here's one example:

Amm..not! I need to get the text-block between the two data, not the
data! :)

change to pat.split(data) then.
I get this:

['', '\ntext\ntext\n', '\ntext\ntext ']

Apr 13 '07 #4

Flyzone

On 13 Apr, 11:14, mik3l3...@gmail.com wrote:

change to pat.split(data) then.

next what i have tried originally..but is not working, my result is
here:

["Mon Feb 26 11:25:04 2007\ntext\n text\ntext\nMon Feb 26 11:25:16
2007\ntext\n text\n text\nMon Feb 26 17:06:41 2007\ntext"]

all together :(

Apr 13 '07 #5

bearophileHUGS

Flyzone:

i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.

My first try:

data = """
error text
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
Mon Apr 10 22:31:10 2007
text
text
"""

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

section = []
for line in data.splitlines():
if date_find.search(line):
if section:
print "\n" + "-" * 10 + "\n", "\n".join(section)
section = [line]
else:
if line:
section.append(line)

print "\n" + "-" * 10 + "\n", "\n".join(section)

itertools.groupby() is fit to split sequences like:
1111100011111100011100101011111
as:
11111 000 111111 000 111 00 1 0 1 0 11111
While here we have a sequence like:
100001000101100001000000010000
that has to be splitted as:
10000 1000 10 1 10000 10000000 10000
A standard itertool can be added for such quite common situation too.

Along those lines I have devised this different (and maybe over-
engineered) version:
from itertools import groupby
import re

class Splitter(object):
# Not tested much
def __init__(self, predicate):
self.predicate = predicate
self.precedent_el = None
self.state = True
def __call__(self, el):
if self.predicate(el):
self.state = not self.state
self.precedent_el = el
return self.state

date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
splitter = Splitter(date_find.search)

sections = ("\n".join(g) for h,g in groupby(data.splitlines(),
key=splitter))
for section in sections:
if section:
print "\n" + "-" * 10 + "\n", section
The Splitter class + the groupby can become a single simpler
generator, like in this this version:
def grouper(seq, key=bool):
# A fast identity function can be used instead of bool()
# Not tested much
group = []
for part in seq:
if key(part):
if group: yield group
group = [part]
else:
group.append(part)
yield group

import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")

for section in grouper(data.splitlines(), date_find.search):
print "\n" + "-" * 10 + "\n", "\n".join(section)
Maybe that grouper can be modified to manage group lazily, like
groupby does, instead of building a true list.
Flyzone (seen later):

>Amm..not! I need to get the text-block between the two data, not the data! :)

Then you can modify the code like this:

def grouper(seq, key=bool):
group = []
for part in seq:
if key(part):
if group: yield group
group = [] # changed
else:
group.append(part)
yield group

Bye,
bearophile

Apr 13 '07 #6

Flyzone

On 13 Apr, 11:30, "Flyzone" <flyz...@technologist.comwrote:

all together :(

Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working! :)
Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?

Apr 13 '07 #7

mik3l3374

On Apr 13, 6:08 pm, "Flyzone" <flyz...@technologist.comwrote:

On 13 Apr, 11:30, "Flyzone" <flyz...@technologist.comwrote:

all together :(

Damn was wrong mine regexp:
pat = re.compile("[A-Z][a-z][a-z][ ][A-Z][a-z][a-z][ ][0-9| ][0-9][ ]
[0-9][0-9][:][0-9][0-9]",re.M|re.DOTALL)

now is working! :)
Great! really thanks for the helps!

A little question: the pat.split can split without delete the date?

not that i know of.

Apr 13 '07 #8

Gabriel Genellina

En Fri, 13 Apr 2007 07:08:05 -0300, Flyzone <fl*****@technologist.com>
escribió:

A little question: the pat.split can split without delete the date?

No, but instead of reading the whole file and splitting on dates, you
could iterate over the file and detect block endings:

def split_on_dates(ftoparse):
block = None
for line in ftoparse:
if fancy_date_regexp.match(line):
# a new block begins, yield the previous one
if block is not None:
yield current_date, block
current_date = line
block = []
else:
# accumulate lines for current block
block.append(line)
# don't forget the last block
if block is not None:
yield current_date, block

for date, block in split_on_dates(ftoparse):
# process block

--
Gabriel Genellina

Apr 15 '07 #9

Similar topics

split -command ??

by: nieuws | last post by:

Hi, I was trying to do the following. It's my first php "project", so it's quiet logic that i have some problems. Perhaps the php community might help. It's about this : I have a txt file...

PHP

split or regex difference between FF and IE

by: jhcorey | last post by:

I don't know where the actual issue is, but hopefully someone can explain. The following displays "5" in FireFox, but "3" in IE: <script type="text/javascript" language="javascript"> var...

Javascript

Read Text File and split them to individual text file

by: Krish | last post by:

I have requirement, that i get one big chunk of text file. This text file will have has information, that on finding "****End of Information****", i have to split them individual text file with our...

ASP.NET

multi split function taking delimiter list

by: martinskou | last post by:

Hi, I'm looking for something like: multi_split( 'a:=b+c' , ) returning: whats the python way to achieve this, preferably without regexp? Thanks.

Python

String Handling Opportunities with split(), indexOf() and RegExp

by: Atli | last post by:

The following small HowTo is a compilation of an original problem in getting some cookie-values through different methods of string-handling. The original Problem was posted as follows: As...

Javascript

CSV to array converter

by: tom t/LA | last post by:

Here is a function to convert a CSV file to a Javascript array. Uses idealized file reading functions based on the std C library, since there is no Javascript standard. Not fully tested. ...

Javascript

RegExp split for Spell Check

by: SmokeWilliams | last post by:

Hi, I am working on a Spell checker for my richtext editor. I cannot use any open source, and must develop everything myself. I need a RegExp pattern to split text into a word array. I have...

Javascript

Split text and store it in arraylist [ problem with code]

by: perdoname | last post by:

Hello, Im trying to implement a program which will split a text file and then parses the elements to an arraylist. My text file looks like that: My program is that: public class Parse {

Java

split a textfile in to two

by: kashif73 | last post by:

I have a texfile with hundreds of records. Each record in a line has 1250 values & values are seperated by a semi colon. IS there a way in VB.NET that i can split each line for example first 1000...

.NET Framework

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General