Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old August 2nd, 2005, 08:35 PM
florent
Guest
 
Posts: n/a
Default trying to parse non valid html documents with HTMLParser

I'm trying to parse html documents from the web, using the HTMLParser
class of the HTMLParser module (python 2.3), but some web documents are
not fully valids. When the parser finds an invalid tag, he raises an
exception. Then it seems impossible to resume the parsing just after
where the exception was raised. I'd like to continue parsing an html
document even if an invalid tag was found. Is it possible to do this ?

Here is a little non valid html document.
----------
<html>
<body>
<a href="""">bogus link</a>
</body>
</html>
----------
  #2  
Old August 2nd, 2005, 09:15 PM
Benjamin Niemann
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

florent wrote:
[color=blue]
> I'm trying to parse html documents from the web, using the HTMLParser
> class of the HTMLParser module (python 2.3), but some web documents are
> not fully valids.[/color]

Some?? Most of them :(
[color=blue]
> When the parser finds an invalid tag, he raises an
> exception. Then it seems impossible to resume the parsing just after
> where the exception was raised. I'd like to continue parsing an html
> document even if an invalid tag was found. Is it possible to do this ?[/color]

AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
yet) htmllib and see, which parser is more forgiving.

You might pipe the document through an external tool like HTML Tidy
<http://www.w3.org/People/Raggett/tidy/> before you feed it into
HTMLParser.


--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
  #3  
Old August 2nd, 2005, 09:35 PM
Benji York
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

florent wrote:[color=blue]
> I'm trying to parse html documents from the web, using the HTMLParser
> class of the HTMLParser module (python 2.3), but some web documents are
> not fully valids.[/color]

From http://www.crummy.com/software/BeautifulSoup/:

You didn't write that awful page. You're just trying to get
some data out of it. Right now, you don't really care what
HTML is supposed to look like.

Neither does this parser.
--
Benji York

  #4  
Old August 3rd, 2005, 10:55 AM
florent
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

> AFAIK not with HTMLParser or htmllib. You might try (if you haven't done[color=blue]
> yet) htmllib and see, which parser is more forgiving.[/color]

Thanks, I'll try htmllib.
In other case, I found a solution. Feeding data to the HTMLParser by
chunks extracted from the string using string.split("<"), will allow me
to loose only one tag at a time when an exception is raised !
  #5  
Old August 3rd, 2005, 11:15 AM
florent
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

> From http://www.crummy.com/software/BeautifulSoup/:[color=blue]
>
> You didn't write that awful page. You're just trying to get
> some data out of it. Right now, you don't really care what
> HTML is supposed to look like.
>
> Neither does this parser.[/color]

True, I just want to extract some data from html documents. But the
problem is the same. The parser looses the position he was in the string
when he encounters a bad tag.
  #6  
Old August 3rd, 2005, 02:15 PM
Benji York
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

florent wrote:[color=blue]
> True, I just want to extract some data from html documents. But the
> problem is the same. The parser looses the position he was in the string
> when he encounters a bad tag.[/color]

Are you saying that Beautiful Soup can't parse the HTML? If so, I'm
sure the author would like an example so he can "fix" it.
--
Benji York


  #7  
Old August 3rd, 2005, 04:45 PM
florent
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

> AFAIK not with HTMLParser or htmllib. You might try (if you haven't done[color=blue]
> yet) htmllib and see, which parser is more forgiving.[/color]

You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !
  #8  
Old August 3rd, 2005, 04:55 PM
florent
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

> Are you saying that Beautiful Soup can't parse the HTML? If so, I'm[color=blue]
> sure the author would like an example so he can "fix" it.[/color]

I finally use the htmllib module wich is more permissive than the
HTMLParser module when parsing bad html documents.
Anyway, where can I find the author's contact informations ?
  #9  
Old August 3rd, 2005, 05:35 PM
Steve M
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

>You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !

The HTMLParser on my distribution is a she. But then again, I am using
ActivePython on Windows...

  #10  
Old August 3rd, 2005, 08:25 PM
Benjamin Niemann
Guest
 
Posts: n/a
Default Re: trying to parse non valid html documents with HTMLParser

Steve M wrote:
[color=blue][color=green]
>>You were right, the HTMLParser of htmllib is more permissive. He just[/color]
> ignores the bad tags !
>
> The HTMLParser on my distribution is a she. But then again, I am using
> ActivePython on Windows...[/color]

Although building parsers is for some strange reason one of my favourite
programming adventures, I do not have such a personal relationship with my
classes ;)

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles