473,396 Members | 1,884 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Extract information from HTML table

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :
"""
<tr>
<td class="tdn" valign="top">
<input name="x44553130" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:24:00</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628033">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44553032" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:14:35</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628007">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44552991" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:11:39</td>
<td class="tdn"Vous avez bien accompli votre
tâche de Gardien de Cimetière et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320
<img src="messages-bite_fichiers/res2.gif"
alt="Or" align="absmiddle" border="0">
et collectez 3 d'expérience !</td>
</tr>
"""

I would like to transform this in following thing :

Date : Sat, 31.03.2007 - 20:24:00
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL : http://s2.bitefight.fr/bite/bericht....p;beid=2628033

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL : http://s2.bitefight.fr/bite/bericht....p;beid=2628007

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Text
Contain : Vous avez bien accompli votre tâche de Gardien de Cimetière
et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320 et collectez 3 d'expérience !

.....

Do you know the way to do it ?

Thanks

Apr 1 '07 #1
7 15709
On Apr 1, 10:13 pm, "Ulysse" <maxim...@gmail.comwrote:
Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :
"""
<tr>
<td class="tdn" valign="top">
<input name="x44553130" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:24:00</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628033">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44553032" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:14:35</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628007">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44552991" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:11:39</td>
<td class="tdn"Vous avez bien accompli votre
tâche de Gardien de Cimetière et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320
<img src="messages-bite_fichiers/res2.gif"
alt="Or" align="absmiddle" border="0">
et collectez 3 d'expérience !</td>
</tr>
"""

I would like to transform this in following thing :

Date : Sat, 31.03.2007 - 20:24:00
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL :http://s2.bitefight.fr/bite/bericht....976d890379f987...

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL :http://s2.bitefight.fr/bite/bericht....976d890379f987...

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Text
Contain : Vous avez bien accompli votre tâche de Gardien de Cimetière
et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320 et collectez 3 d'expérience !

....

Do you know the way to do it ?
You can use Beautiful Soup http://www.crummy.com/software/BeautifulSoup/

see this page to see how you can search for tags, then retrieve the
contents

http://www.crummy.com/software/Beaut...20Parse%20Tree

Cheers

Apr 1 '07 #2
On Apr 1, 3:13 pm, "Ulysse" <maxim...@gmail.comwrote:
Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :

....

Do you know the way to do it ?
Beautiful Soup is an easy way to parse HTML (that may be broken).
http://www.crummy.com/software/BeautifulSoup/

Here's a start of a parser for your HTML:

soup = BeautifulSoup(txt)
for tr in soup('tr'):
dateTd, textTd = tr('td')[1:]
print 'Date :', dateTd.contents[0].strip()
print textTd #element still needs parsing

where txt is the string in your message.

Apr 1 '07 #3
On Apr 1, 2:52 pm, irs...@gmail.com wrote:
On Apr 1, 3:13 pm, "Ulysse" <maxim...@gmail.comwrote:
Hello,
I'm trying to extract the data from HTML table. Here is the part of
the HTML source :
....
Do you know the way to do it ?

Beautiful Soup is an easy way to parse HTML (that may be broken).http://www.crummy.com/software/BeautifulSoup/

Here's a start of a parser for your HTML:

soup = BeautifulSoup(txt)
for tr in soup('tr'):
dateTd, textTd = tr('td')[1:]
print 'Date :', dateTd.contents[0].strip()
print textTd #element still needs parsing

where txt is the string in your message.
I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

Apr 1 '07 #4
On 1 Apr 2007 07:56:04 -0700, Ulysse <ma******@gmail.comwrote:
I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...
If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/
http://what-is-what.com/
Apr 1 '07 #5
On Apr 2, 12:54 am, "Dotan Cohen" <dotanco...@gmail.comwrote:
On 1 Apr 2007 07:56:04 -0700, Ulysse <maxim...@gmail.comwrote:
I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/http://what-is-what.com/

Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html

-anjesh

Apr 2 '07 #6
In article <11*********************@n59g2000hsh.googlegroups. com>,
anjesh <an************@gmail.comwrote:
>On Apr 2, 12:54 am, "Dotan Cohen" <dotanco...@gmail.comwrote:
>On 1 Apr 2007 07:56:04 -0700, Ulysse <maxim...@gmail.comwrote:
I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/http://what-is-what.com/


Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html

-anjesh
Yes, except that these last two follow-ups UNDERstate the difficulty--in
fact, the impossibility--of achieving adequate results on this problem
with regular expressions. We'll help with the documentation for HTMLParser
and BeautifulSoup. REs are an invitation to madness.

<URL: http://www.unixreview.com/documents/s=10121/ur0702e/ might amuse
those who want to think more about REs.
Apr 2 '07 #7
On Apr 2, 9:28 pm, cla...@lairds.us (Cameron Laird) wrote:
In article <1175503135.234560.51...@n59g2000hsh.googlegroups. com>,

anjesh <anjeshtulad...@gmail.comwrote:
On Apr 2, 12:54 am, "Dotan Cohen" <dotanco...@gmail.comwrote:
On 1 Apr 2007 07:56:04 -0700, Ulysse <maxim...@gmail.comwrote:
I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...
If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.
Dotan Cohen
>http://lyricslist.com/http://what-is-what.com/
Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html
-anjesh

Yes, except that these last two follow-ups UNDERstate the difficulty--in
fact, the impossibility--of achieving adequate results on this problem
with regular expressions. We'll help with the documentation for HTMLParser
and BeautifulSoup. REs are an invitation to madness.

<URL:http://www.unixreview.com/documents/s=10121/ur0702e/might amuse
those who want to think more about REs.
r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?<a href="(.*?)">(.*?)</a>.*?</td>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2}).*?player\.php.*?>(.*?)</
a>.*?<textarea.*?>(.*?)</textarea>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?Message au clan de :([a-zA-Z0-9_\-]+?)\W*<br>(.*?)</th>'

These three REs extract all data I need. That not exactly apply to the
given string.
I read the article but I didn't understood why REs are invitation to
madness...

Apr 2 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Joe | last post by:
I'm trying to extract part of html code from a tag to a tag code begins with <span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE> I was...
8
by: john | last post by:
I would like to develope a system using a web or non-web based client (FrontPage, Access, etc.) that can send requests to various travel web site (using our user name and password for each) and...
1
by: Ori | last post by:
Hi, I have a HTML text which I need to parse in order to extract data from it. My html contain a table contains few rows and two columns. I want to extract the data from the 2nd column in...
2
by: Thief_ | last post by:
I've got this type of info on a web page: ---------------------------------------------------------------------------- -------------------------------------------- <tr height="25"> <td nowrap...
3
by: rahman | last post by:
I have few hundred HTML pages. I need to extract portion of each HTML page into a text/database/HTML files format. You can imagine it is very tedious to do one by one. Is there any automatic...
5
by: klall | last post by:
Hello. I need to extract date information from a memo field entered in the following way: 01/01/2005 - 31/12/2005 01/01/2004 - 31/12/2004 01/01/2003 - 31/12/2003 01/01/1996 - 31/12/1996. The...
9
by: chrisspencer02 | last post by:
I am looking for a method to extract the links embedded within the Javascript in a web page: an ActiveX component, or example code in C++/Pascal/etc. I am looking for a general solution, not one...
1
by: Walter Cruz | last post by:
On Fri, Sep 5, 2008 at 11:29 AM, Jackie Wang <jackie.python@gmail.comwrote: Use BeautifulSoup. 's - Walter
18
by: Ecka | last post by:
Hi everyone, I'm trying to write a PHP script that connects to a bank's currency convertor page using cURL and that part works fine. The issue is that I end up with a page that includes a lot...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.