Quick-n-dirty way:
After you get your whole p string: <p class="contentBody">FOO <a
name="f"></a</p>
Remove any tags delimited by '<' and '>' with a regex. In your short
example you _don't_ show that there might be something between the <a>
and </atags so I assume there won't be anything or if there would be
something then you also want it included in the final text. As in
'<p class="contentBody">FOO <a name="f">URLNAME</a</p>' =='FOO
URLNAME'
For the regex start with something simple like <.*?and see if it
works then improve it. Use kiki or kodos - python visual regex
helpers.
Hope this helps,
Nick V.
GinTon wrote:
I'm trying to get the 'FOO' string but the problem is that inner 'P'
tag there is another tag, 'a'. So:
from BeautifulSoup import BeautifulSoup
s = '<td width="88%" valign="TOP"<p class="contentBody">FOO <a name="f"></a</p></td>'
tree = BeautifulSoup(s)
print tree.first('p')
<p class="contentBody">FOO <a name="f"></a</p>
So if I run 'print tree.first('p').string' to get the 'FOO' string it
shows Null value because it's the 'a' tag:
print tree.first('p').string
Null
Any solution?