Output of HTML parsing

Jackie

Hi, all,

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.

The following is my program:
--------------- Program
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>, (.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])
output =[]
for i in range(len(name)):
output.insert(i,[name[i],title[i]]) #Generate a list of
[name, title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file

-------------- End of Program
----------------------------------------------

My questions are:

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

3.Should I close the opened csv file("professor.csv")? How to close
it?

Thanks!

Jackie

Jun 15 '07 #1

Subscribe Post Reply

1906

Sebastian Wiesner

[ Jackie <ja*********@gmail.com]

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

Use BeautifulSoup.

3.Should I close the opened csv file("professor.csv")? How to close
it?

Assign the file object to a separate name (e.g. stream) and then invoke its
close method after writing all csv data to it.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (GNU/Linux)

iD8DBQBGcp64n3IEGILecb4RAlblAKCmypoYjyPSciI0NaC7A9 dcPIa3owCgkn3G
owa3lSPAMdTDhzejhuF8ztg=
=FP0v
-----END PGP SIGNATURE-----

Jun 15 '07 #2

Stefan Behnel

Jackie wrote:

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.

http://codespeak.net/lxml

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>, (.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])
output =[]
for i in range(len(name)):
output.insert(i,[name[i],title[i]]) #Generate a list of
[name, title]

# untested
get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
name_list = []
for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
name_list.append(
tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(name_list) #output CSV file

-------------- End of Program
----------------------------------------------

3.Should I close the opened csv file("professor.csv")? How to close
it?

I guess it has a "close()" function?

Stefan

Jun 15 '07 #3

Jackie

On 6 15 , 2 01 , Stefan Behnel <stefan.behnel-n05...@web.dewrote:

Jackie wrote:

import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)

Stefan- -

- -

Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py ", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Thank you

Jackie

Jun 19 '07 #4

Stefan Behnel

Jackie schrieb:

On 6 15 , 2 01 , Stefan Behnel <stefan.behnel-n05...@web.dewrote:
>Jackie wrote:

>import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)

>Stefan- -

- -

Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py ", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom
instead:

parser = et.HTMLParser()
tree = et.parse(url, parser)

Stefan

Jun 19 '07 #5

by: Fraser Gordon | last post by:

Hello, Hopefully someone can help me out with this issue... I have a python script that needs to run a shell command and be able to react to output from that command as it occurs (as opposed...

Python

How can I get a nicer HTML output from POD parsers???

by: dede | last post by:

Dear community, from using the tools pod2html and pod2htmltree I understand that I might add a stylesheet-file (*.css) to modify the appearance of the generated tags of the html-resultfiles. ...

Perl

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

-output-def, -soname using gcc

by: Ernesto | last post by:

Hi everybody: I am developing a library using mingw (windows). I created my binary files (.dll), my library file (.a) and my def file (.def) using g++ -output-def=XXX -soname=XXX The problem...

C / C++

Understanding simplest HTML page

by: Eric Lindsay | last post by:

I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...

HTML / CSS

PHP-Yes, HTML-No --- Why?

by: Lennart Björk | last post by:

Hi All, I have a tiny program: <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>MyTitle</title> <meta...

PHP

Manipulating HTML Output of Placeholder and Child Controls

by: Thomas Bandt | last post by:

Hi all. I am filling some stuff into a PlaceHolder Control in my WebForm, i.e. UserControls and WebControls. So far so good. Now I want to manipulate the HTML output of this control and its...

ASP.NET

html parsing / regular expressions

by: yonido | last post by:

hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...

.NET Framework

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Output of HTML parsing

Similar topics