Connecting Tech Pros Worldwide Help | Site Map
 
 
LinkBack Thread Tools Search this Thread
  #1  
Old May 16th, 2006, 10:15 PM
rh0dium
Guest
 
Posts: n/a
Default Beautiful parse joy - Oh what fun

Hi all,

I am trying to parse into a dictionary a table and I am having all
kinds of fun. Can someone please help me out.

What I want is this:

dic={'Division Code':'SALS','Employee':'LOO ABLE'}

Here is what I have..

html="""<table> <tr valign="top"><td width="24"><img
src="/icons/ecblank.gif" border="0" height="1" width="1" alt=""
/></td><td width="129"><b><font size="2" face="Arial">Division Code:
</font></b></td><td width="693"><font size="2"
face="Arial">SALS</font></td></tr> <tr valign="top"><td width="24"><img
src="/icons/ecblank.gif" border="0" height="1" width="1" alt="" /> <td
width="129"><b><font size="2" face="Arial">Employee:
</font></b></td> <td width="693"><font size="2"
face="Arial">LOO</font><b><font size="2" face="Arial"> </font></b><font
size="2" face="Arial">ABLE</font></td></tr></table> """


from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup()
soup.feed(html)

dic={}
for row in soup('table')[0]('tr'):
column = row('td')
print column[1].findNext('font').string.strip(),
column[2].findNext('font').string.strip()
dic[column[1].findNext('font').string.strip()]=
column[2].findNext('font').string.strip()

for key in dic.keys():
print key, dic[key]

The problem is I am missing the last name ABLE. How can I get "ALL"
of the text. Clearly I have something wrong with my font string.. but
what it is I am not sure of.

Please and thanks!!

  #2  
Old May 16th, 2006, 10:45 PM
Larry Bates
Guest
 
Posts: n/a
Default Re: Beautiful parse joy - Oh what fun


rh0dium wrote:[color=blue]
> Hi all,
>
> I am trying to parse into a dictionary a table and I am having all
> kinds of fun. Can someone please help me out.
>
> What I want is this:
>
> dic={'Division Code':'SALS','Employee':'LOO ABLE'}
>
> Here is what I have..
>
> html="""<table> <tr valign="top"><td width="24"><img
> src="/icons/ecblank.gif" border="0" height="1" width="1" alt=""
> /></td><td width="129"><b><font size="2" face="Arial">Division Code:
> </font></b></td><td width="693"><font size="2"
> face="Arial">SALS</font></td></tr> <tr valign="top"><td width="24"><img
> src="/icons/ecblank.gif" border="0" height="1" width="1" alt="" /> <td
> width="129"><b><font size="2" face="Arial">Employee:
> </font></b></td> <td width="693"><font size="2"
> face="Arial">LOO</font><b><font size="2" face="Arial"> </font></b><font
> size="2" face="Arial">ABLE</font></td></tr></table> """
>
>
> from BeautifulSoup import BeautifulSoup
> soup = BeautifulSoup()
> soup.feed(html)
>
> dic={}
> for row in soup('table')[0]('tr'):
> column = row('td')
> print column[1].findNext('font').string.strip(),
> column[2].findNext('font').string.strip()
> dic[column[1].findNext('font').string.strip()]=
> column[2].findNext('font').string.strip()
>
> for key in dic.keys():
> print key, dic[key]
>
> The problem is I am missing the last name ABLE. How can I get "ALL"
> of the text. Clearly I have something wrong with my font string.. but
> what it is I am not sure of.
>
> Please and thanks!!
>[/color]
In the last row you have 3 <font> tags. The first one
contains LOO the second one is empty and the third one
contains ABLE.

<td width="693"><font size="2" face="Arial">LOO</font><b>
<font size="2" face="Arial"> </font></b>
<font size="2" face="Arial">ABLE</font></td>

Your code is not expecting the second (empty) tag.

-Larry Bates
  #3  
Old May 17th, 2006, 06:15 PM
KvS
Guest
 
Posts: n/a
Default Re: Beautiful parse joy - Oh what fun

Maybe a more robust approach is just to walk through the string
counting the (increments) of the number of brackets "<" and ">" as you
know that all the relevant text occurs right after a ">" has occured
that sets your counter to 0 (meaning you're at the "highest level").
There's no relevant text if the next character is again a "<".

  #4  
Old May 17th, 2006, 06:55 PM
George Sakkis
Guest
 
Posts: n/a
Default Re: Beautiful parse joy - Oh what fun

Here's one way to do it:

import re
_any_re = re.compile('.+')

d = {}
for row in BeautifulSoup(html).fetch('tr'):
columns = row.fetch('td')
field = columns[1].firstText(_any_re).rstrip(' \t\n:')
value = ' '.join(text.rstrip()
for text in columns[2].fetchText(_any_re))
d[field] = value
print d

George

 

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Popular Articles

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over 205,338 network members.