472,353 Members | 1,897 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,353 software developers and data experts.

Parsing HTML

Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Please help!

// Anders
Jul 18 '05 #1
8 4031
Am Thu, 23 Sep 2004 08:42:08 +0200 schrieb Anders Eriksson:
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?


Hi,

If you only want to parse one page, I would use the re module.

If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

HTH,
Thomas

Jul 18 '05 #2
I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.


BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is perfect for
this job:

import urllib2, pprint
from BeautifulSoup import BeautifulSoup

def cellToWord(cell):
"""Given a table cell, return the word in that cell."""
# Some words are in bold.
if cell('b'):
return cell.first('b').string.strip() # Return the bold piece.
else:
return cell.string.split('.')[1].strip() # Remove the number.

def parse(url):
"""Parse the given URL and return a dictionary mapping US words to
foreign words."""

# Read the URL and pass it to BeautifulSoup.
html = urllib2.urlopen(url).read()
soup = BeautifulSoup()
soup.feed(html)

# Read the main table, extracting the words from the table cells.
USToForeign = {}
mainTable = soup.first('table')
rows = mainTable('tr')
for row in rows[1:]: # Exclude the first (headings) row.
cells = row('td')
if len(cells) == 3: # Some rows have a single colspan="3" cell.
US = cellToWord(cells[0])
foreign = cellToWord(cells[1])
USToForeign[US] = foreign

return USToForeign
if __name__ == '__main__':
url = 'http://msdn.microsoft.com/library/en-us/dnwue/html/FRE_word_list.htm'
USToForeign = parse(url)
pairs = USToForeign.items()
pairs.sort(lambda a, b: cmp(a[0].lower(), b[0].lower())) # Web page order
pprint.pprint(pairs)
--
Richie Hindle
ri****@entrian.com

Jul 18 '05 #3

[Richie]
BeautifulSoup is perfect for this job:


Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.
8-)

--
Richie Hindle
ri****@entrian.com

Jul 18 '05 #4
Anders Eriksson wrote:
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?

Please help!

// Anders


hi,
try this:

###############################################
import re, urllib2

#get page
s =
urllib2.urlopen('http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm').read()

regex = re.compile('<td.*?>\d*\. (?:<b>)?(.*?)(?:</b>)?</td>')
myresult = regex.findall(s)
#print myresult

# map pairs in list to key:value in dict
nwords = range(len(myresult))
mydict = {}
for i in range(min(nwords),max(nwords),2):
mydict[myresult[i]] = myresult[i+1]

#print mydict

# try some words
print mydict['wizard']
print mydict['Web site']
print mydict['unavailable']

##############################

which outputs:
guide
webbplats
inte tillgänglig

Chris
Jul 18 '05 #5
Richie Hindle wrote:
[Richie]
BeautifulSoup is perfect for this job:

Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
And I don't know whether you'd consider it correct to extract only the bold
text from the entries that have bold text. But it gives you a place to start.
8-)


Another option might be the HTML parser from libxml2 (www.xmlsoft.org):
import libxml2
doc = libxml2.htmlParseFile("http://www.python.org", None) http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>
^ doc.serialize()

'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...

Bye,
Walter Dörwald
Jul 18 '05 #6
Thomas Guettler wrote:
If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.


including the page Anders pointed to, which is too broken for tidy's
default settings:

line 1 column 1 - Warning: specified input encoding (iso-8859-1) does
not match actual input encoding (utf-8)
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected <html>
line 9 column 1 - Error: <xml> is not recognized!
... snip ...
260 warnings, 14 errors were found! Not all warnings/errors were shown.

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

you can fix this either by tweaking the tidy settings, or by fixing up the
document before you parse it (note the first warning: if you're not care-
ful, you may end up with unusable swedish text).

I've attached a script based on my ElementTidy binding for tidy. see
alternative 1 below. usage:

URL = "http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"

wordlist = parse_microsoft_wordlist(URL)

for item in wordlist:
print item

the wordlist contains (english word, swedish word), using Unicode where
appropriate.

you can get elementtree and elementtidy via

http://effbot.org/zone/element.htm
http://effbot.org/zone/element-tidylib.htm

on the other hand, for this specific case, a regular expression-based approach
is probably easier. see alternative 2 below for one way to do it.

</F>

# --------------------------------------------------------------------
# alternative 1: using the TIDY->XML approach

from elementtidy.TidyHTMLTreeBuilder import parse
from urllib import urlopen
from StringIO import StringIO
import re

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

# get rid of BOM crud
text = re.sub("^[^<]*", "", text) # bom crud

# the page seems to be UTF-8 encoded, but it doesn't say so;
# convert it to Latin 1 to simplify further processing
text = unicode(text, "utf-8").encode("iso-8859-1")

# get rid of things that Tidy doesn't like
text = re.sub("(?i)</?xml*?>", "", text) # embedded <xml>
text = re.sub("(?i)</?ms.*?>", "", text) # <mshelp> stuff

# now, let's process it
tree = parse(StringIO(text))

# look for TR tags, and pick out the text from the first two TDs
wordlist = []
for row in tree.getiterator(XHTML("tr")):
cols = row.findall(XHTML("td"))
if len(cols) == 3:
wordlist.append((fixword(cols[0]), fixword(cols[1])))
return wordlist

# helpers

def XHTML(tag):
# map a tag to its XHTML name
return "{http://www.w3.org/1999/xhtml}" + tag

def fixword(column):
# get text from TD and subelements
word = flatten(column)
# get rid of leading number and whitespace
word = re.sub("^\d+\.\s+", "", word)
return word

def flatten(node):
# get text from an element and all its subelements
text = ""
if node.text:
text += node.text
for subnode in node:
text += flatten(subnode)
if subnode.tail:
text += subnode.tail
return text

# --------------------------------------------------------------------
# alternative 2: using regular expressions

import re
from urllib import urlopen

def parse_microsoft_wordlist(url):

text = urlopen(url).read()

text = unicode(text, "utf-8")

pattern = "(?s)<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>"

def fixword(word):
# get rid of leading nnn.
word = re.sub("^\d+\.\s+", "", word)
# get rid of embedded tags
word = re.sub("<[^>]+>", "", word)
return word

wordlist = []
for w1, w2 in re.findall(pattern, text):
wordlist.append((fixword(w1), fixword(w2)))

return wordlist

# --------------------------------------------------------------------

Jul 18 '05 #7
I would like to thank everyone that have help on this!

The solution I settled for was a using BeautifulSoup and a script that Mr.
Leonard Richardson sent me.

Now to the next part of the problem, how to manage Unicode....
// Anders
--
To promote the usage of BeautifulSoup here is the script by Mr. Leonard
Richarson

import urllib
import re
from BeautifulSoup import BeautifulSoup

URL =
"http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm"
text = urllib.urlopen(URL).read()
# remove all <b> and </b>
p = re.compile('\<b\>|\</b\>')
text = p.sub('',text)

# soupify it
soup = BeautifulSoup(text)

def unmunge(value):
"""Use this method to turn, eg "74. <b>Help</b> menu" into "Help menu",
probably using a regular expression."""
return value[value.find('.')+2:]

d = []
cols = soup.fetch('td', {'width' : '33%'})
for i in range(0, len(cols)):
if i % 3 != 2: #Every third column is a note which we ignore.
value = unmunge(cols[i].renderContents())
if not d or len(d[-1]) == 2:
#English term
d.append([value])
else:
#Swedish term
d[-1].append(value)
d = dict(d)
for key, val in d.items():
print "%s = %s" % (key, val)
Jul 18 '05 #8
Anders Eriksson <an*************@morateknikutveckling.se> wrote in message news:<jm***************@morateknikutveckling.se>.. .
Hello!

I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en...word_list.htm). I
want to take all the words, both English and the other language and create
a dictionary. so that I can look up About and get Om as the answer.

How is the best way to do this?


http://www.xml.com/pub/a/2004/09/08/pyxml.html

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com

A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerwork...ematron-i.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerwork...x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerwork...x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerwork...rary/x-stand4/
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time...
14
by: Viktor Rosenfeld | last post by:
Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages...
9
by: RiGGa | last post by:
Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with...
0
by: Fuzzyman | last post by:
I am trying to parse an HTML page an only modify URLs within tags - e.g. inside IMG, A, SCRIPT, FRAME tags etc... I have built one that works...
3
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in...
16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now...
4
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr>...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a...
1
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the...
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS...
0
by: Rahul1995seven | last post by:
Introduction: In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.