By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,548 Members | 1,495 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,548 IT Pros & Developers. It's quick & easy.

BeautifulSoup to get string inner 'p' and 'a' tags

P: n/a
I'm trying to get the 'FOO' string but the problem is that inner 'P'
tag there is another tag, 'a'. So:
from BeautifulSoup import BeautifulSoup
s = '<td width="88%" valign="TOP"<p class="contentBody">FOO <a name="f"></a</p></td>'
tree = BeautifulSoup(s)
print tree.first('p')
<p class="contentBody">FOO <a name="f"></a</p>

So if I run 'print tree.first('p').string' to get the 'FOO' string it
shows Null value because it's the 'a' tag:
print tree.first('p').string
Null

Any solution?

Jul 24 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
In <11**********************@i42g2000cwa.googlegroups .com>, GinTon wrote:
I'm trying to get the 'FOO' string but the problem is that inner 'P'
tag there is another tag, 'a'. So:
>from BeautifulSoup import BeautifulSoup
s = '<td width="88%" valign="TOP"<p class="contentBody">FOO <a name="f"></a</p></td>'
tree = BeautifulSoup(s)
>print tree.first('p')
<p class="contentBody">FOO <a name="f"></a</p>

So if I run 'print tree.first('p').string' to get the 'FOO' string it
shows Null value because it's the 'a' tag:
>print tree.first('p').string
Null

Any solution?
In [53]: print tree.first('p').contents[0]
FOO

Ciao,
Marc 'BlackJack' Rintsch
Jul 24 '06 #2

P: n/a

Marc 'BlackJack' Rintsch wrote:
In [53]: print tree.first('p').contents[0]
FOO
Thanks! I was going to crazy with this.

Jul 24 '06 #3

P: n/a
Quick-n-dirty way:
After you get your whole p string: <p class="contentBody">FOO <a
name="f"></a</p>
Remove any tags delimited by '<' and '>' with a regex. In your short
example you _don't_ show that there might be something between the <a>
and </atags so I assume there won't be anything or if there would be
something then you also want it included in the final text. As in
'<p class="contentBody">FOO <a name="f">URLNAME</a</p>' =='FOO
URLNAME'

For the regex start with something simple like <.*?and see if it
works then improve it. Use kiki or kodos - python visual regex
helpers.

Hope this helps,
Nick V.
GinTon wrote:
I'm trying to get the 'FOO' string but the problem is that inner 'P'
tag there is another tag, 'a'. So:
from BeautifulSoup import BeautifulSoup
s = '<td width="88%" valign="TOP"<p class="contentBody">FOO <a name="f"></a</p></td>'
tree = BeautifulSoup(s)
print tree.first('p')
<p class="contentBody">FOO <a name="f"></a</p>

So if I run 'print tree.first('p').string' to get the 'FOO' string it
shows Null value because it's the 'a' tag:
print tree.first('p').string
Null

Any solution?
Jul 24 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.