By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,312 Members | 1,850 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,312 IT Pros & Developers. It's quick & easy.

Output of HTML parsing

P: n/a
Hi, all,

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.

The following is my program:
--------------- Program
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>,&nbsp;(.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])
output =[]
for i in range(len(name)):
output.insert(i,[name[i],title[i]]) #Generate a list of
[name, title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file

-------------- End of Program
----------------------------------------------

My questions are:

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

3.Should I close the opened csv file("professor.csv")? How to close
it?

Thanks!

Jackie

Jun 15 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
[ Jackie <ja*********@gmail.com]
1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?
Use BeautifulSoup.
3.Should I close the opened csv file("professor.csv")? How to close
it?
Assign the file object to a separate name (e.g. stream) and then invoke its
close method after writing all csv data to it.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (GNU/Linux)

iD8DBQBGcp64n3IEGILecb4RAlblAKCmypoYjyPSciI0NaC7A9 dcPIa3owCgkn3G
owa3lSPAMdTDhzejhuF8ztg=
=FP0v
-----END PGP SIGNATURE-----

Jun 15 '07 #2

P: n/a
Jackie wrote:
I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"
That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.

http://codespeak.net/lxml

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)
namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>,&nbsp;(.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])
output =[]
for i in range(len(name)):
output.insert(i,[name[i],title[i]]) #Generate a list of
[name, title]
# untested
get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
name_list = []
for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
name_list.append(
tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file
writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(name_list) #output CSV file
-------------- End of Program
----------------------------------------------

3.Should I close the opened csv file("professor.csv")? How to close
it?
I guess it has a "close()" function?

Stefan
Jun 15 '07 #3

P: n/a
On 6 15 , 2 01 , Stefan Behnel <stefan.behnel-n05...@web.dewrote:
Jackie wrote:
import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)
Stefan- -

- -
Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py ", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Thank you

Jackie

Jun 19 '07 #4

P: n/a
Jackie schrieb:
On 6 15 , 2 01 , Stefan Behnel <stefan.behnel-n05...@web.dewrote:
>Jackie wrote:
>import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)
>Stefan- -

- -

Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py ", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?
Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom
instead:

parser = et.HTMLParser()
tree = et.parse(url, parser)

Stefan
Jun 19 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.