Extract information from HTML table

Ulysse

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :
"""
<tr>
<td class="tdn" valign="top">
<input name="x44553130" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:24:00</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628033">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44553032" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:14:35</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628007">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44552991" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:11:39</td>
<td class="tdn"Vous avez bien accompli votre
tâche de Gardien de Cimetière et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320
<img src="messages-bite_fichiers/res2.gif"
alt="Or" align="absmiddle" border="0">
et collectez 3 d'expérience !</td>
</tr>
"""

I would like to transform this in following thing :

Date : Sat, 31.03.2007 - 20:24:00
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL : http://s2.bitefight.fr/bite/bericht....p;beid=2628033

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL : http://s2.bitefight.fr/bite/bericht....p;beid=2628007

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Text
Contain : Vous avez bien accompli votre tâche de Gardien de Cimetière
et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320 et collectez 3 d'expérience !

.....

Do you know the way to do it ?

Thanks

Apr 1 '07 #1

Subscribe Post Reply

15709

placid

On Apr 1, 10:13 pm, "Ulysse" <maxim...@gmail.comwrote:

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :
"""
<tr>
<td class="tdn" valign="top">
<input name="x44553130" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:24:00</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628033">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44553032" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:14:35</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp ;beid=2628007">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44552991" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:11:39</td>
<td class="tdn"Vous avez bien accompli votre
tâche de Gardien de Cimetière et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320
<img src="messages-bite_fichiers/res2.gif"
alt="Or" align="absmiddle" border="0">
et collectez 3 d'expérience !</td>
</tr>
"""

I would like to transform this in following thing :

Date : Sat, 31.03.2007 - 20:24:00
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL :http://s2.bitefight.fr/bite/bericht....976d890379f987...

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL :http://s2.bitefight.fr/bite/bericht....976d890379f987...

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Text
Contain : Vous avez bien accompli votre tâche de Gardien de Cimetière
et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320 et collectez 3 d'expérience !

....

Do you know the way to do it ?

You can use Beautiful Soup http://www.crummy.com/software/BeautifulSoup/

see this page to see how you can search for tags, then retrieve the
contents

http://www.crummy.com/software/Beaut...20Parse%20Tree

Cheers

Apr 1 '07 #2

irstas

On Apr 1, 3:13 pm, "Ulysse" <maxim...@gmail.comwrote:

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :

....

Do you know the way to do it ?

Beautiful Soup is an easy way to parse HTML (that may be broken).
http://www.crummy.com/software/BeautifulSoup/

Here's a start of a parser for your HTML:

soup = BeautifulSoup(txt)
for tr in soup('tr'):
dateTd, textTd = tr('td')[1:]
print 'Date :', dateTd.contents[0].strip()
print textTd #element still needs parsing

where txt is the string in your message.

Apr 1 '07 #3

Ulysse

On Apr 1, 2:52 pm, irs...@gmail.com wrote:

On Apr 1, 3:13 pm, "Ulysse" <maxim...@gmail.comwrote:

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :

....

Do you know the way to do it ?

Beautiful Soup is an easy way to parse HTML (that may be broken).http://www.crummy.com/software/BeautifulSoup/

Here's a start of a parser for your HTML:

soup = BeautifulSoup(txt)
for tr in soup('tr'):
dateTd, textTd = tr('td')[1:]
print 'Date :', dateTd.contents[0].strip()
print textTd #element still needs parsing

where txt is the string in your message.

I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

Apr 1 '07 #4

Dotan Cohen

On 1 Apr 2007 07:56:04 -0700, Ulysse <ma******@gmail.comwrote:

I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/
http://what-is-what.com/

Apr 1 '07 #5

anjesh

On Apr 2, 12:54 am, "Dotan Cohen" <dotanco...@gmail.comwrote:

On 1 Apr 2007 07:56:04 -0700, Ulysse <maxim...@gmail.comwrote:

I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/http://what-is-what.com/

Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html

-anjesh

Apr 2 '07 #6

Cameron Laird

In article <11*********************@n59g2000hsh.googlegroups. com>,
anjesh <an************@gmail.comwrote:

>On Apr 2, 12:54 am, "Dotan Cohen" <dotanco...@gmail.comwrote:
>On 1 Apr 2007 07:56:04 -0700, Ulysse <maxim...@gmail.comwrote:

I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/http://what-is-what.com/

Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html

-anjesh

Yes, except that these last two follow-ups UNDERstate the difficulty--in
fact, the impossibility--of achieving adequate results on this problem
with regular expressions. We'll help with the documentation for HTMLParser
and BeautifulSoup. REs are an invitation to madness.

<URL: http://www.unixreview.com/documents/s=10121/ur0702e/ might amuse
those who want to think more about REs.

Apr 2 '07 #7

Ulysse

On Apr 2, 9:28 pm, cla...@lairds.us (Cameron Laird) wrote:

In article <1175503135.234560.51...@n59g2000hsh.googlegroups. com>,

anjesh <anjeshtulad...@gmail.comwrote:
On Apr 2, 12:54 am, "Dotan Cohen" <dotanco...@gmail.comwrote:
On 1 Apr 2007 07:56:04 -0700, Ulysse <maxim...@gmail.comwrote:

I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

>http://lyricslist.com/http://what-is-what.com/

Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html

-anjesh

Yes, except that these last two follow-ups UNDERstate the difficulty--in
fact, the impossibility--of achieving adequate results on this problem
with regular expressions. We'll help with the documentation for HTMLParser
and BeautifulSoup. REs are an invitation to madness.

<URL:http://www.unixreview.com/documents/s=10121/ur0702e/might amuse
those who want to think more about REs.

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?<a href="(.*?)">(.*?)</a>.*?</td>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2}).*?player\.php.*?>(.*?)</
a>.*?<textarea.*?>(.*?)</textarea>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?Message au clan de :([a-zA-Z0-9_\-]+?)\W*<br>(.*?)</th>'

These three REs extract all data I need. That not exactly apply to the
given string.
I read the article but I didn't understood why REs are invitation to
madness...

Apr 2 '07 #8

Similar topics

How to extract a part of html file

by: Joe | last post by:

I'm trying to extract part of html code from a tag to a tag code begins with <span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE> I was...

Python

Extract information from various web sites

by: john | last post by:

I would like to develope a system using a web or non-web based client (FrontPage, Access, etc.) that can send requests to various travel web site (using our user name and password for each) and...

Microsoft Access / VBA

Extract HTML + Reg Ex

by: Ori | last post by:

Hi, I have a HTML text which I need to parse in order to extract data from it. My html contain a table contains few rows and two columns. I want to extract the data from the 2nd column in...

C# / C Sharp

Extract data from web page.

by: Thief_ | last post by:

I've got this type of info on a web page: ---------------------------------------------------------------------------- -------------------------------------------- <tr height="25"> <td nowrap...

Visual Basic .NET

Need to extract portion of HTML page...

by: rahman | last post by:

I have few hundred HTML pages. I need to extract portion of each HTML page into a text/database/HTML files format. You can imagine it is very tedious to do one by one. Is there any automatic...

HTML / CSS

Extracting dates from a memo field

by: klall | last post by:

Hello. I need to extract date information from a memo field entered in the following way: 01/01/2005 - 31/12/2005 01/01/2004 - 31/12/2004 01/01/2003 - 31/12/2003 01/01/1996 - 31/12/1996. The...

Microsoft Access / VBA

Extract links from Javascript (not using Javascript)?

by: chrisspencer02 | last post by:

I am looking for a method to extract the links embedded within the Javascript in a web page: an ActiveX component, or example code in C++/Pascal/etc. I am looking for a general solution, not one...

Javascript

Re: Extract Information from Tables in html

by: Walter Cruz | last post by:

On Fri, Sep 5, 2008 at 11:29 AM, Jackie Wang <jackie.python@gmail.comwrote: Use BeautifulSoup. 's - Walter

Python

how to extract part of HTML page

by: Ecka | last post by:

Hi everyone, I'm trying to write a PHP script that connects to a bank's currency convertor page using cURL and that part works fine. The issue is that I end up with a page that includes a lot...

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice