469,600 Members | 2,326 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,600 developers. It's quick & easy.

How do I parse this ? regexp ?

Hello all,

I have this line of numbers:
04242005 18:20:42-0.000002, 271.1748608, [-4.119873046875,
3.4332275390625, 105.062255859375], [0.093780517578125, 0.041015625,
-0.960662841796875], [0.01556396484375, 0.01220703125,
0.01068115234375]
repeated several times in a text file and I would like each element to
be part of a vector. how do I do this ? I am not very capable in using
regexp as you can see.
Thanks in advance,
Jake.

Jul 19 '05 #1
7 1591
"se*******@gmail.com" <se*******@gmail.com> writes:
Hello all,

I have this line of numbers:
04242005 18:20:42-0.000002, 271.1748608, [-4.119873046875,
3.4332275390625, 105.062255859375], [0.093780517578125, 0.041015625,
-0.960662841796875], [0.01556396484375, 0.01220703125,
0.01068115234375]
repeated several times in a text file and I would like each element to
be part of a vector. how do I do this ? I am not very capable in using
regexp as you can see.


You don't need a regexp to do that.

Use the split string method. It will split on spaces by default. If you want
to keep the values inside "[]" together, remove the spaces before splitting or
split on the "[" char first and then split the first item using spaces as a
separator.
Be seeing you,
--
Jorge Godoy <go***@ieee.org>
Jul 19 '05 #2
Hello,

I am not understanding your answer, but I probably asked the wrong
question :-)

I want to remove the commas, and square brackets [ and ] characters and
rewrite this whole line (and all the ones following in a text file
where only space would be a delimiter. How do I do this ?

I have tried this:

f = open(name3,'r')
r = r"\d+\.\d*"
for line in f:
cols = line.split()
data1 = re.findall(r,line)

and then I don't know what to do with either cols nor data1

Jake.

Jul 19 '05 #3
On Wed, 27 Apr 2005 07:56:11 -0700, se*******@gmail.com wrote:
Hello all,

I have this line of numbers:
04242005 18:20:42-0.000002, 271.1748608, [-4.119873046875,
3.4332275390625, 105.062255859375], [0.093780517578125, 0.041015625,
-0.960662841796875], [0.01556396484375, 0.01220703125, 0.01068115234375]
repeated several times in a text file and I would like each element to be
part of a vector. how do I do this ? I am not very capable in using regexp
as you can see.


I think, based on the responses you've gotten so far, that perhaps you
aren't being clear enough.

Some starter questions:

* Is that all on one line in your file?
* Are there ever variable numbers of the [] fields?
* What do you mean by "vectors"?

If the line format is stable (no variation in numbers), and especially if
that is all one line, given that you are not familiar with regexp I
wouldn't muck about with it. (For me, I'd still say it's borderline if I
would go with that.) Instead, follow along in the following and it'll
probably help, though as I don't precisely know what you're asking I can't
give a complete solution:

Python 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
x = "04242005 18:20:42-0.000002, 271.1748608, [-4.119873046875, 3.4332275390 625, 105.062255859375], [0.093780517578125, 0.041015625, -0.960662841796875], [0
..01556396484375, 0.01220703125, 0.01068115234375]" x.split(',', 2) ['04242005 18:20:42-0.000002', ' 271.1748608', ' [-4.119873046875, 3.43322753906
25, 105.062255859375], [0.093780517578125, 0.041015625, -0.960662841796875], [0.
01556396484375, 0.01220703125, 0.01068115234375]'] splitted = x.split(',', 2)
splitted[2] ' [-4.119873046875, 3.4332275390625, 105.062255859375], [0.093780517578125, 0.04
1015625, -0.960662841796875], [0.01556396484375, 0.01220703125, 0.01068115234375
]' import re
safetyChecker = re.compile(r"^[-\[\]0-9,. ]*$")
if safetyChecker.match(splitted[2]): .... eval(splitted[2], {}, {})
....
([-4.119873046875, 3.4332275390625, 105.062255859375], [0.093780517578125,
0.041015625, -0.960662841796875], [0.01556396484375, 0.01220703125,
0.01068115234375]) splitted[0].split() ['04242005', '18:20:42-0.000002'] splitted[0].split()[1].split('-') ['18:20:42', '0.000002']

I'd like to STRONGLY EMPHASIZE that there is danger in using "eval" as it
is very dangerous if you can't trust the source; *any* python code will
be run. That is why I am extra paranoid and double-check that the
expression only has the characters listed in that simple regex in it.
(Anyone who can construct a malicious string out of those characters will
get my sincere admiration.) You may do as you please, of course, but I
believe it is not helpful to suggest security holes on comp.lang.python
:-) The coincidence of that part of your data, which is also the most
challenging to parse, exactly matching Python syntax is too much to pass
up.

This should give you some good ideas; if you post more detailed questions
we can probably be of more help.

Jul 19 '05 #4
Jake -

If regexp's give you pause, here is a pyparsing version that, while
verbose, is fairly straightforward. I made some guesses at what some
of the data fields might be, but that doesn't matter much.

Note the use of setResultsName() to give different parse fragments
names so that they are directly addressable in the results, instead of
having to count out "the 0'th group is the date, the 1'st group is the
time...". Also, there is a commented-out conversion action, to
automatically convert strings to floats during parsing.

Download pyparsing at http://pyparsing.sourceforge.net.

Good luck,
-- Paul
data = """04242005 18:20:42-0.000002, 271.1748608, [-4.119873046875,
3.4332275390625, 105.062255859375], [0.093780517578125, 0.041015625,
-0.960662841796875], [0.01556396484375, 0.01220703125,
0.01068115234375]"""

from pyparsing import *

COMMA = Literal(",").suppress()
LBRACK = Literal("[").suppress()
RBRACK = Literal("]").suppress()

# define a two-digit integer, we'll need a lot of them
int2 = Word(nums,exact=2)
month = int2
day = int2
yr = Combine("20" + int2)
date = Combine(month + day + yr)

hr = int2
min = int2
sec = int2
tz = oneOf("+ -") + Word(nums) + "." + Word(nums)
time = Combine( hr + ":" + min + ":" + sec + tz )

realNum = Combine( Optional("-") + Word(nums) + "." + Word(nums) )
# uncomment the next line and reals will be converted from strings to
floats during parsing
#realNum.setParseAction( lambda s,l,t: float(t[0]) )

triplet = Group( LBRACK + realNum + COMMA + realNum + COMMA + realNum +
RBRACK )
entry = Group( date.setResultsName("date") +
time.setResultsName("time") + COMMA +
realNum.setResultsName("temp") + COMMA +
Group( triplet + COMMA + triplet + COMMA + triplet
).setResultsName("coords") )

dataFormat = OneOrMore(entry)
results = dataFormat.parseString(data)

for d in results:
print d.date
print d.time
print d.temp
print d.coords[0].asList()
print d.coords[1].asList()
print d.coords[2].asList()

returns:

04242005
18:20:42-0.000002
271.1748608
['-4.119873046875', '3.4332275390625', '105.062255859375']
['0.093780517578125', '0.041015625', '-0.960662841796875']
['0.01556396484375', '0.01220703125', '0.01068115234375']

Jul 19 '05 #5
safetyChecker = re.compile(r"^[-\[\]0-9,. ]*$")


...doesn't the dot (.) in your character class mean that you are allowing
EVERYTHING (except newline?)

(you would probably want \. instead)

/Simon
Jul 19 '05 #6
Simon Dahlbacka wrote:
>safetyChecker = re.compile(r"^[-\[\]0-9,. ]*$")


..doesn't the dot (.) in your character class mean that you are allowing
EVERYTHING (except newline?)


The re docs clearly say this is not the case:

'''
[]
Used to indicate a set of characters. Characters can be listed
individually, or a range of characters can be indicated by giving two
characters and separating them by a "-". Special characters are not
active inside sets.
'''

Note the last sentence in the above quotation...

-Peter
Jul 19 '05 #7
On Thu, 28 Apr 2005 20:53:14 -0400, Peter Hansen wrote:
The re docs clearly say this is not the case:

'''
[]
Used to indicate a set of characters. Characters can be listed
individually, or a range of characters can be indicated by giving two
characters and separating them by a "-". Special characters are not active
inside sets.
'''

Note the last sentence in the above quotation...

-Peter


Aren't regexes /fun/?

Also from that passage, Simon, note the "-" right in front of
[-\[\]0-9,. ], another one that's tripped me up more than once.

Wheeee!

"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." - jwz
http://www.jwz.org/hacks/marginal.html

Jul 19 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Lukas Holcik | last post: by
4 posts views Thread by Andrew E | last post: by
8 posts views Thread by Douglas Crockford | last post: by
reply views Thread by David Lozzi | last post: by
4 posts views Thread by Matt | last post: by
reply views Thread by suresh191 | last post: by
4 posts views Thread by guiromero | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.