By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,077 Members | 2,121 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,077 IT Pros & Developers. It's quick & easy.

extracting a substring

P: n/a
Hi,
I have a bunch of strings like
a53bc_531.txt
a53bc_2285.txt
....
a53bc_359.txt

and I want to extract the numbers 531, 2285, ...,359.

One thing for sure is that these numbers are the ONLY part that is
changing; all the other characters are always fixed.

I know I should use regular expressions, but I'm not familar with
python, so any quick help would help, such as which commands or idioms
to use. Thanks a lot!

Apr 19 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Em Ter, 2006-04-18 *s 17:25 -0700, b8*******@yahoo.com escreveu:
Hi,
I have a bunch of strings like
a53bc_531.txt
a53bc_2285.txt
...
a53bc_359.txt

and I want to extract the numbers 531, 2285, ...,359.


Some ways:

1) Regular expressions, as you said:
from re import compile
find = compile("a53bc_([1-9]*)\\.txt").findall
find('a53bc_531.txt\na53bc_2285.txt\na53bc_359.txt ') ['531', '2285', '359']

2) Using ''.split: [x.split('.')[0].split('_')[1] for x in 'a53bc_531.txt \na53bc_2285.txt\na53bc_359.txt'.splitlines()]
['531', '2285', '359']

3) Using indexes (be careful!): [x[6:-4] for x in 'a53bc_531.txt\na53bc_2285.txt

\na53bc_359.txt'.splitlines()]
['531', '2285', '359']

Measuring speeds:

$ python2.4 -m timeit -s 'from re import compile; find =
compile("a53bc_([1-9]*)\\.txt").findall; s = "a53bc_531.txt
\na53bc_2285.txt\na53bc_359.txt"' 'find(s)'
100000 loops, best of 3: 3.03 usec per loop

$ python2.4 -m timeit -s 's = "a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n"[:-1]' "[x.split('.')[0].split('_')[1] for x in
s.splitlines()]"
100000 loops, best of 3: 7.64 usec per loop

$ python2.4 -m timeit -s 's = "a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n"[:-1]' "[x[6:-4] for x in s.splitlines()]"
100000 loops, best of 3: 2.47 usec per loop
$ python2.4 -m timeit -s 'from re import compile; find =
compile("a53bc_([1-9]*)\\.txt").findall; s = ("a53bc_531.txt
\na53bc_2285.txt\na53bc_359.txt\n"*1000)[:-1]' 'find(s)'
1000 loops, best of 3: 1.95 msec per loop

$ python2.4 -m timeit -s 's = ("a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n" * 1000)[:-1]' "[x.split('.')[0].split('_')[1] for x
in s.splitlines()]"
100 loops, best of 3: 6.51 msec per loop

$ python2.4 -m timeit -s 's = ("a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n" * 1000)[:-1]' "[x[6:-4] for x in s.splitlines()]"
1000 loops, best of 3: 1.53 msec per loop
Summary: using indexes is less powerful than regexps, but faster.

HTH,

--
Felipe.

Apr 19 '06 #2

P: n/a
b8*******@yahoo.com wrote:
Hi,
I have a bunch of strings like
a53bc_531.txt
a53bc_2285.txt
...
a53bc_359.txt

and I want to extract the numbers 531, 2285, ...,359.

One thing for sure is that these numbers are the ONLY part that is
changing; all the other characters are always fixed.

I know I should use regular expressions, but I'm not familar with
python, so any quick help would help, such as which commands or idioms
to use. Thanks a lot!

Try this:
import re
pattern = re.compile("a53bc_([0-9]*).txt")

s = "a53bc_531.txt"
match = pattern.match(s)
if match: .... print int(match.group(1))
.... else:
.... print "No match"
....
531


Hope that helps,
Gary Herron
Apr 19 '06 #3

P: n/a
You don't need a regex for this, as long as the prefix and suffix are fixed
lengths, the following will do:
"a53bc_531.txt"[6:-4] '531'
"a53bc_2285.txt"[6:-4]
'2285'

b8*******@yahoo.com wrote:
Hi,
I have a bunch of strings like
a53bc_531.txt
a53bc_2285.txt
...
a53bc_359.txt

and I want to extract the numbers 531, 2285, ...,359.

One thing for sure is that these numbers are the ONLY part that is
changing; all the other characters are always fixed.

I know I should use regular expressions, but I'm not familar with
python, so any quick help would help, such as which commands or idioms
to use. Thanks a lot!


--
Dale Strickland-Clark
Riverhall Systems - www.riverhall.co.uk

Apr 19 '06 #4

P: n/a
b8*******@yahoo.com wrote:
Hi,
I have a bunch of strings like
a53bc_531.txt
a53bc_2285.txt
...
a53bc_359.txt

and I want to extract the numbers 531, 2285, ...,359.

One thing for sure is that these numbers are the ONLY part that is
changing; all the other characters are always fixed.


In that case a fixed slice will do what you want:

In [1]: s='a53bc_531.txt'

In [2]: s[6:-4]
Out[2]: '531'

Kent
Apr 19 '06 #5

P: n/a
rx
and I want to extract the numbers 531, 2285, ...,359.

One thing for sure is that these numbers are the ONLY part that is
changing; all the other characters are always fixed.


I'm not sure about what you mean by "always fixed" but I guess it means that
you have n files with a fixed start and a changing ending, and m files with
a fixed start and a changing ending, ....

import re
filenames=['ac99_124.txt', 'ac99_344.txt', 'ac99_445.txt']
numbers=[]
for i in filenames:
numbers.append(int(re.compile('[^_]*_(?P<number>[^.]*).txt').match(i).group('number')))

this sets numbers to: [124, 344, 445]
Apr 19 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.