Parsing HTML

Anders Eriksson

Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Please help!

// Anders

Jul 18 '05 #1

Subscribe Post Reply

4186

Thomas Guettler

Am Thu, 23 Sep 2004 08:42:08 +0200 schrieb Anders Eriksson:

Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Hi,

If you only want to parse one page, I would use the re module.

If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

HTH,
Thomas

Jul 18 '05 #2

Richie Hindle

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is perfect for
this job:

import urllib2, pprint
from BeautifulSoup import BeautifulSoup

def cellToWord(cell):
"""Given a table cell, return the word in that cell."""
# Some words are in bold.
if cell('b'):
return cell.first('b').string.strip() # Return the bold piece.
else:
return cell.string.split('.')[1].strip() # Remove the number.

def parse(url):
"""Parse the given URL and return a dictionary mapping US words to
foreign words."""

# Read the URL and pass it to BeautifulSoup.
html = urllib2.urlopen(url).read()
soup = BeautifulSoup()
soup.feed(html)

# Read the main table, extracting the words from the table cells.
USToForeign = {}
mainTable = soup.first('table')
rows = mainTable('tr')
for row in rows[1:]: # Exclude the first (headings) row.
cells = row('td')
if len(cells) == 3: # Some rows have a single colspan="3" cell.
US = cellToWord(cells[0])
foreign = cellToWord(cells[1])
USToForeign[US] = foreign

return USToForeign
if __name__ == '__main__':
url = 'http://msdn.microsoft.com/library/en-us/dnwue/html/FRE_word_list.htm'
USToForeign = parse(url)
pairs = USToForeign.items()
pairs.sort(lambda a, b: cmp(a[0].lower(), b[0].lower())) # Web page order
pprint.pprint(pairs)
--
Richie Hindle
ri****@entrian.com

Jul 18 '05 #3

Richie Hindle

[Richie]

BeautifulSoup is perfect for this job:

Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.
8-)

--
Richie Hindle
ri****@entrian.com

Jul 18 '05 #4

Chris McD

Anders Eriksson wrote:

Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Please help!

// Anders

hi,
try this:

###############################################
import re, urllib2

#get page
s =
urllib2.urlopen('http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm').read()

regex = re.compile('<td.*?>\d*\. (?:)?(.*?)(?:)?</td>')
myresult = regex.findall(s)
#print myresult

# map pairs in list to key:value in dict
nwords = range(len(myresult))
mydict = {}
for i in range(min(nwords),max(nwords),2):
mydict[myresult[i]] = myresult[i+1]

#print mydict

# try some words
print mydict['wizard']
print mydict['Web site']
print mydict['unavailable']

##############################

which outputs:
guide
webbplats
inte tillgÃ¤nglig

Chris

Jul 18 '05 #5

Walter Dörwald

Richie Hindle wrote:

[Richie]
BeautifulSoup is perfect for this job:

Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.
8-)

Another option might be the HTML parser from libxml2 (www.xmlsoft.org):

import libxml2
doc = libxml2.htmlParseFile("http://www.python.org", None) http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>
^ doc.serialize()

'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...

Bye,
Walter Dörwald

Jul 18 '05 #6

Fredrik Lundh

Thomas Guettler wrote:

If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

including the page Anders pointed to, which is too broken for tidy's
default settings:

line 1 column 1 - Warning: specified input encoding (iso-8859-1) does
not match actual input encoding (utf-8)
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected <html>
line 9 column 1 - Error: <xml> is not recognized!
... snip ...
260 warnings, 14 errors were found! Not all warnings/errors were shown.

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

you can fix this either by tweaking the tidy settings, or by fixing up the
document before you parse it (note the first warning: if you're not care-
ful, you may end up with unusable swedish text).

I've attached a script based on my ElementTidy binding for tidy. see
alternative 1 below. usage:

URL = "http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"

wordlist = parse_microsoft_wordlist(URL)

for item in wordlist:
print item

the wordlist contains (english word, swedish word), using Unicode where
appropriate.

you can get elementtree and elementtidy via

http://effbot.org/zone/element.htm
http://effbot.org/zone/element-tidylib.htm

on the other hand, for this specific case, a regular expression-based approach
is probably easier. see alternative 2 below for one way to do it.

</F>

# --------------------------------------------------------------------
# alternative 1: using the TIDY->XML approach

from elementtidy.TidyHTMLTreeBuilder import parse
from urllib import urlopen
from StringIO import StringIO
import re

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

# get rid of BOM crud
text = re.sub("^[^<]*", "", text) # bom crud

# the page seems to be UTF-8 encoded, but it doesn't say so;
# convert it to Latin 1 to simplify further processing
text = unicode(text, "utf-8").encode("iso-8859-1")

# get rid of things that Tidy doesn't like
text = re.sub("(?i)</?xml*?>", "", text) # embedded <xml>
text = re.sub("(?i)</?ms.*?>", "", text) # <mshelp> stuff

# now, let's process it
tree = parse(StringIO(text))

# look for TR tags, and pick out the text from the first two TDs
wordlist = []
for row in tree.getiterator(XHTML("tr")):
cols = row.findall(XHTML("td"))
if len(cols) == 3:
wordlist.append((fixword(cols[0]), fixword(cols[1])))
return wordlist

# helpers

def XHTML(tag):
# map a tag to its XHTML name
return "{http://www.w3.org/1999/xhtml}" + tag

def fixword(column):
# get text from TD and subelements
word = flatten(column)
# get rid of leading number and whitespace
word = re.sub("^\d+\.\s+", "", word)
return word

def flatten(node):
# get text from an element and all its subelements
text = ""
if node.text:
text += node.text
for subnode in node:
text += flatten(subnode)
if subnode.tail:
text += subnode.tail
return text

# --------------------------------------------------------------------
# alternative 2: using regular expressions

import re
from urllib import urlopen

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

text = unicode(text, "utf-8")

pattern = "(?s)<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>"

def fixword(word):
# get rid of leading nnn.
word = re.sub("^\d+\.\s+", "", word)
# get rid of embedded tags
word = re.sub("<[^>]+>", "", word)
return word

wordlist = []
for w1, w2 in re.findall(pattern, text):
wordlist.append((fixword(w1), fixword(w2)))

return wordlist

# --------------------------------------------------------------------

Jul 18 '05 #7

Anders Eriksson

I would like to thank everyone that have help on this!

The solution I settled for was a using BeautifulSoup and a script that Mr.
Leonard Richardson sent me.

Now to the next part of the problem, how to manage Unicode....
// Anders
--
To promote the usage of BeautifulSoup here is the script by Mr. Leonard
Richarson

import urllib
import re
from BeautifulSoup import BeautifulSoup

URL =
"http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
text = urllib.urlopen(URL).read()
# remove all and 
p = re.compile('\<b\>|\</b\>')
text = p.sub('',text)

# soupify it
soup = BeautifulSoup(text)

def unmunge(value):
"""Use this method to turn, eg "74. Help menu" into "Help menu",
probably using a regular expression."""
return value[value.find('.')+2:]

d = []
cols = soup.fetch('td', {'width' : '33%'})
for i in range(0, len(cols)):
if i % 3 != 2: #Every third column is a note which we ignore.
value = unmunge(cols[i].renderContents())
if not d or len(d[-1]) == 2:
#English term
d.append([value])
else:
#Swedish term
d[-1].append(value)
d = dict(d)
for key, val in d.items():
print "%s = %s" % (key, val)

Jul 18 '05 #8

Uche Ogbuji

Anders Eriksson <an*************@morateknikutveckling.se> wrote in message news:<jm***************@morateknikutveckling.se>.. .

Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

http://www.xml.com/pub/a/2004/09/08/pyxml.html

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com

A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerwork...ematron-i.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerwork...x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerwork...x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerwork...rary/x-stand4/

Jul 18 '05 #9

by: Gerrit Holl | last post by:

Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...

Python

Parsing library for Python?

by: Viktor Rosenfeld | last post by:

Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: -...

Python

Help with parsing web page

by: RiGGa | last post by:

Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...

Python

Parsing HTML - modify URLs

by: Fuzzyman | last post by:

I am trying to parse an HTML page an only modify URLs within tags - e.g. inside IMG, A, SCRIPT, FRAME tags etc... I have built one that works fine using the HTMLParser.HTMLParser and it works...

Python

XML file parsing with SAX

by: Willem Ligtenberg | last post by:

I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...

Python

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

html parsing / regular expressions

by: yonido | last post by:

hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...

.NET Framework

Parsing an HTML table with XML

by: Rick Walsh | last post by:

I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...

.NET Framework

Parsing Baseball Stats

by: ankitdesai | last post by:

I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...

Python

Parsing an html/aspx file

by: Neil.Smith | last post by:

I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...

ASP.NET

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Similar topics