By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,222 Members | 2,416 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,222 IT Pros & Developers. It's quick & easy.

splitting delimited strings

P: n/a
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

What's the most efficient way to process this? Failing all
else I will split the string into characters and use a FSM,
but it seems that's not very pythonesqe.

@rv@ 2 @db.locks@ @//depot/hello.txt@ @mh@ @mh@ 1 1 44
@pv@ 0 @db.changex@ 44 44 @mh@ @mh@ 1118875308 0 @ :@@: :@@@@: @

(this is from a perforce journal file, btw)

Many TIA!
Mark

--
Mark Harrison
Pixar Animation Studios
Jul 19 '05 #1
Share this Question
Share on Google+
10 Replies


P: n/a
Mark Harrison wrote:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.
Have you taken a look at the csv module yet? No guarantees, but it may
just work. You'd have to set delimiter to ' ' and quotechar to '@'. You
may need to manually handle the double-@ thing, but why don't you see
how close you can get with csv?
@rv@ 2 @db.locks@ @//depot/hello.txt@ @mh@ @mh@ 1 1 44
@pv@ 0 @db.changex@ 44 44 @mh@ @mh@ 1118875308 0 @ :@@: :@@@@: @

(this is from a perforce journal file, btw)

--
Paul McNett
http://paulmcnett.com

Jul 19 '05 #2

P: n/a
You could use regular expressions... it's an FSM of some kind but it's
faster *g*
check this snippet out:

def mysplit(s):
pattern = '((?:"[^"]*")|(?:[^ ]+))'
tmp = re.split(pattern, s)
res = [ifelse(i[0] in ('"',"'"), lambda:i[1:-1], lambda:i) for i in
tmp if i.strip()]
return res
mysplit('foo bar "baz foo" bar "baz"')

['foo', 'bar', 'baz foo', 'bar', 'baz']

Jul 19 '05 #3

P: n/a
On Wed, 15 Jun 2005 23:03:55 +0000, Mark Harrison wrote:
What's the most efficient way to process this? Failing all
else I will split the string into characters and use a FSM,
but it seems that's not very pythonesqe.


like this ?
s = "@hello@world@@foo@bar"
s.split("@") ['', 'hello', 'world', '', 'foo', 'bar'] s2 = "hello@world@@foo@bar"
s2 'hello@world@@foo@bar' s2.split("@") ['hello', 'world', '', 'foo', 'bar']


bye
Jul 19 '05 #4

P: n/a
Mark Harrison wrote:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

What's the most efficient way to process this? Failing all
else I will split the string into characters and use a FSM,
but it seems that's not very pythonesqe.

@rv@ 2 @db.locks@ @//depot/hello.txt@ @mh@ @mh@ 1 1 44
@pv@ 0 @db.changex@ 44 44 @mh@ @mh@ 1118875308 0 @ :@@: :@@@@: @

import csv
list(csv.reader(file('at_quotes.txt', 'rb'), delimiter=' ', quotechar='@'))
[['rv', '2', 'db.locks', '//depot/hello.txt', 'mh', 'mh', '1', '1',
'44'], ['pv'
, '0', 'db.changex', '44', '44', 'mh', 'mh', '1118875308', '0', ' :@:
:@@: ']]

Jul 19 '05 #5

P: n/a
Nicola Mingotti wrote:
On Wed, 15 Jun 2005 23:03:55 +0000, Mark Harrison wrote:

What's the most efficient way to process this? Failing all
else I will split the string into characters and use a FSM,
but it seems that's not very pythonesqe.

like this ?


No, not like that. The OP said that an embedded @ was doubled.

s = "@hello@world@@foo@bar"
s.split("@")
['', 'hello', 'world', '', 'foo', 'bar']
s2 = "hello@world@@foo@bar"
s2
'hello@world@@foo@bar'
s2.split("@")


['hello', 'world', '', 'foo', 'bar']
bye

Jul 19 '05 #6

P: n/a
Paul McNett <p@ulmcnett.com> wrote:
Mark Harrison wrote:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.


Have you taken a look at the csv module yet? No guarantees, but it may
just work. You'd have to set delimiter to ' ' and quotechar to '@'. You
may need to manually handle the double-@ thing, but why don't you see
how close you can get with csv?


This is great! Everything works perfectly. Even the double-@ thing
is handled by the default quotechar handling.

Thanks again,
Mark

--
Mark Harrison
Pixar Animation Studios
Jul 19 '05 #7

P: n/a
Mark Harrison wrote:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

import re
_at_re = re.compile('(?<!@)@(?!@)')
def split_at_line(line): .... return [field.replace('@@', '@') for field in
.... _at_re.split(line)]
.... split_at_line('foo@bar@@baz@qux')

['foo', 'bar@baz', 'qux']
Jul 19 '05 #8

P: n/a
Mark -

Let me weigh in with a pyparsing entry to your puzzle. It wont be
blazingly fast, but at least it will give you another data point in
your comparison of approaches. Note that the parser can do the
string-to-int conversion for you during the parsing pass.

If @rv@ and @pv@ are record type markers, then you can use pyparsing to
create more of a parser than just a simple tokenizer, and parse out the
individual record fields into result attributes.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul

test1 = "@hello@@world@@foo@bar"
test2 = """@rv@ 2 @db.locks@ @//depot/hello.txt@ @mh@ @mh@ 1 1 44
@pv@ 0 @db.changex@ 44 44 @mh@ @mh@ 1118875308 0 @ :@@: :@@@@: @"""

from pyparsing import *

AT = Literal("@")
atQuotedString = AT.suppress() + Combine(OneOrMore((~AT + SkipTo(AT)) |

(AT +
AT).setParseAction(replaceWith("@")) )) + AT.suppress()

# extract any @-quoted strings
for test in (test1,test2):
for toks,s,e in atQuotedString.scanString(test):
print toks
print

# parse all tokens (assume either a positive integer or @-quoted
string)
def makeInt(s,l,toks):
return int(toks[0])
entry = OneOrMore( Word(nums).setParseAction(makeInt) | atQuotedString
)

for t in test2.split("\n"):
print entry.parseString(t)

Prints out:

['hello@world@foo']

['rv']
['db.locks']
['//depot/hello.txt']
['mh']
['mh']
['pv']
['db.changex']
['mh']
['mh']
[':@: :@@: ']

['rv', 2, 'db.locks', '//depot/hello.txt', 'mh', 'mh', 1, 1, 44]
['pv', 0, 'db.changex', 44, 44, 'mh', 'mh', 1118875308, 0, ':@: :@@: ']

Jul 19 '05 #9

P: n/a
On Thu, 16 Jun 2005 09:36:56 +1000, John Machin wrote:
like this ?


No, not like that. The OP said that an embedded @ was doubled.


you are right, sorry :)

anyway, if @@ -> @
an empty field map to what ?

Jul 19 '05 #10

P: n/a
Leif K-Brooks wrote:
Mark Harrison wrote:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.


import re
_at_re = re.compile('(?<!@)@(?!@)')
def split_at_line(line):
... return [field.replace('@@', '@') for field in
... _at_re.split(line)]
...
split_at_line('foo@bar@@baz@qux')


['foo', 'bar@baz', 'qux']


The plot according to the OP was that the @s were quotes, NOT delimiters.
Jul 19 '05 #11

This discussion thread is closed

Replies have been disabled for this discussion.