By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,996 Members | 1,498 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,996 IT Pros & Developers. It's quick & easy.

String parsing

P: n/a

The string below is a piece of a longer string of about 20000
characters returned from a web page. I need to isolate the number at
the end of the line containing 'LastUpdated'. I can find
'LastUpdated' with .find but not sure about how to isolate the
number. 'LastUpdated' is guaranteed to occur only once. Would
appreciate it if one of you string parsing whizzes would take a stab
at it.

Thanks,

jh

<input type="hidden" name="RFP" value="-1"/>
<!--<input type="hidden" name="EnteredBy" value="johnxxxx"/>-->
<input type="hidden" name="EnteredBy" value="john"/>
<input type="hidden" name="ServiceIndex" value="1"/>
<input type="hidden" name="LastUpdated" value="1178658863"/>
<input type="hidden" name="NextPage" value="../active/active.php"/>
<input type="hidden" name="ExistingStatus" value="10" ?>
<table width="98%" cellpadding="0" cellspacing="0" border="0"
align="center"

May 9 '07 #1
Share this Question
Share on Google+
12 Replies


P: n/a
En Tue, 08 May 2007 22:09:52 -0300, HMS Surprise <jo**@datavoiceint.com>
escribió:
The string below is a piece of a longer string of about 20000
characters returned from a web page. I need to isolate the number at
the end of the line containing 'LastUpdated'. I can find
'LastUpdated' with .find but not sure about how to isolate the
number. 'LastUpdated' is guaranteed to occur only once. Would
appreciate it if one of you string parsing whizzes would take a stab
at it.
<input type="hidden" name="RFP" value="-1"/>
<!--<input type="hidden" name="EnteredBy" value="johnxxxx"/>-->
<input type="hidden" name="EnteredBy" value="john"/>
<input type="hidden" name="ServiceIndex" value="1"/>
<input type="hidden" name="LastUpdated" value="1178658863"/>
<input type="hidden" name="NextPage" value="../active/active.php"/>
<input type="hidden" name="ExistingStatus" value="10" ?>
<table width="98%" cellpadding="0" cellspacing="0" border="0"
align="center"
You really should use an html parser here. But assuming that the page will
not change a lot its structure you could use a regular expression like
this:

expr = re.compile(r'name\s*=\s*"LastUpdated"\s+value\s*=\ s*"(.*?)"',
re.IGNORECASE)
number = expr.search(text).group(1)
(Handling of "not found" and "duplicate" cases is left as an exercise for
the reader)

Note that <input value="1178658863" type="hidden" name="LastUpdated" /is
as valid as your html, but won't match the expression.

--
Gabriel Genellina

May 9 '07 #2

P: n/a
On 8 May 2007 18:09:52 -0700, HMS Surprise <jo**@datavoiceint.comwrote:
>
The string below is a piece of a longer string of about 20000
characters returned from a web page. I need to isolate the number at
the end of the line containing 'LastUpdated'. I can find
'LastUpdated' with .find but not sure about how to isolate the
number. 'LastUpdated' is guaranteed to occur only once. Would
appreciate it if one of you string parsing whizzes would take a stab
at it.
Does this help?

In [7]: s = '<input type="hidden" name="LastUpdated"
value="1178658863"/>'

In [8]: int(s.split("=")[-1].split('"')[1])
Out[8]: 1178658863

There's probably a hundred different ways of doing this, but this is
the first that came to mind.

Cheers,

Tim
Thanks,

jh

<input type="hidden" name="RFP" value="-1"/>
<!--<input type="hidden" name="EnteredBy" value="johnxxxx"/>-->
<input type="hidden" name="EnteredBy" value="john"/>
<input type="hidden" name="ServiceIndex" value="1"/>
<input type="hidden" name="LastUpdated" value="1178658863"/>
<input type="hidden" name="NextPage" value="../active/active.php"/>
<input type="hidden" name="ExistingStatus" value="10" ?>
<table width="98%" cellpadding="0" cellspacing="0" border="0"
align="center"

--
http://mail.python.org/mailman/listinfo/python-list
May 9 '07 #3

P: n/a
Thanks for posting. Could you reccommend an HTML parser that can be
used with python or jython?
john
May 9 '07 #4

P: n/a
Yes it could, after I isolate that one string. Making sure I that I
isolate that complete line and only that line is part of the problem.

thanks for posting.

jh

May 9 '07 #5

P: n/a
On May 8, 9:19 pm, HMS Surprise <j...@datavoiceint.comwrote:
Yes it could, after I isolate that one string. Making sure I that I
isolate that complete line and only that line is part of the problem.
It comes in as one large string...
May 9 '07 #6

P: n/a
En Tue, 08 May 2007 23:06:14 -0300, HMS Surprise <jo**@datavoiceint.com>
escribió:
Thanks for posting. Could you reccommend an HTML parser that can be
used with python or jython?
Try BeautifoulSoup, which handles malformed pages pretty well.

--
Gabriel Genellina

May 9 '07 #7

P: n/a
On 8 May 2007 19:06:14 -0700, HMS Surprise wrote
Thanks for posting. Could you reccommend an HTML parser that can be
used with python or jython?
BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) makes HTML
parsing easy as pie, and sufficiently old versions seem to work with Jython. I
just tested this with Jython 2.2a1 and BeautifulSoup 1.x:

Jython 2.2a1 on java1.5.0_07 (JIT: null)
Type "copyright", "credits" or "license" for more information.
>>from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("""<input type="hidden" name="LastUpdated"
value="1178658863"/>""")
>>print soup.first('input', {'name':'LastUpdated'}).get('value')
1178658863

Hope this helps,

--
Carsten Haese
http://informixdb.sourceforge.net

May 9 '07 #8

P: n/a
Thanks all.

Carsten, you are here early and late. Do you ever sleep? ;^)

May 9 '07 #9

P: n/a
This looks to be simple HTML (and I'm presuming that's a type on
that ?ending). A quick glance at the Python library reference (you do
have a copy, don't you) reveals at least two HTML parsing modules...
No that is not a typo and bears investigation. Thanks for the find.

I found HTMLParser but had trouble setting it up.
About five minutes work gave me this:
My effort has been orders of magnitude greater in time.....

Thanks all for all the excellent suggestions.
jh

May 9 '07 #10

P: n/a
BTW, here's what I used, the other ideas have been squirreled away in
my neat tricks and methods folder.

for el in data.splitlines():
if el.find('LastUpdated') <-1:
s = el.split("=")[-1].split('"')[1]
print 's:', s
Thanks again,

jh

May 9 '07 #11

P: n/a
On 9 May, 06:42, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:
>
[HTMLParser-based solution]

Here's another approach using libxml2dom [1] in HTML parsing mode:

import libxml2dom

# The text, courtesy of Dennis.
sample = """<input type="hidden" name="RFP" value="-1"/>
<!--<input type="hidden" name="EnteredBy" value="johnxxxx"/>-->
<input type="hidden" name="EnteredBy" value="john"/>
<input type="hidden" name="ServiceIndex" value="1"/>
<input type="hidden" name="LastUpdated" value="1178658863"/>
<input type="hidden" name="NextPage" value="../active/active.php"/>
<input type="hidden" name="ExistingStatus" value="10" />
<table width="98%" cellpadding="0" cellspacing="0" border="0"
align="center" >"""

# Parse the string in HTML mode.
d = libxml2dom.parseString(sample, html=1)

# For all input fields having the name 'LastUpdated',
# get the value attribute.
last_updated_fields = d.xpath("//input[@name='LastUpdated']/@value")

# Assuming we find one, print the contents of the value attribute.
print last_updated_fields[0].nodeValue

Paul

[1] http://www.python.org/pypi/libxml2dom

May 9 '07 #12

P: n/a
Dennis Lee Bieber wrote:
>
I was trying to stay with a solution the should have been available
in the version of Python equivalent to the Jython being used by the
original poster. HTMLParser, according to the documents, was 2.2 level.
I guess I should read the whole thread before posting. ;-) I'll have
to look into libxml2 availability for Java, though, as it appears
(from various accounts) that some Java platform users struggle with
HTML parsing or have a really limited selection of decent and
performant parsers in that area.

Another thing for the "to do" list...

Paul

May 9 '07 #13

This discussion thread is closed

Replies have been disabled for this discussion.