473,587 Members | 2,483 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

extract news article from web

Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information. Since
each news site has a different layout, I think I need some
template-based techniques to build news extractors for each site,
ignoring information such as table, image, advertise, flash that I'm
not interested in.

So far I have built a simple GUI using Tkinter, a link extractor
using HTMLlib to extract HREFs from web page. But I really have no idea
how to extract news from web site. Is anyone aware of general
techniques for extracting web news? Or can point me to some falimiar
projects.
I have seen some search engines doing this, for
example:http://news.ithaki.net/, but do not know the technique used.
Any tips?

Thanks in advance,

Zhang Le

Jul 18 '05 #1
6 3441
Zhang Le wrote:
Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information. Since
each news site has a different layout, I think I need some
template-based techniques to build news extractors for each site,
ignoring information such as table, image, advertise, flash that I'm
not interested in.

So far I have built a simple GUI using Tkinter, a link extractor
using HTMLlib to extract HREFs from web page. But I really have no idea
how to extract news from web site. Is anyone aware of general
techniques for extracting web news? Or can point me to some falimiar
projects.
I have seen some search engines doing this, for
example:http://news.ithaki.net/, but do not know the technique used.
Any tips?

Thanks in advance,

Zhang Le

Well, for Python-related news is suck stuff from O'Reilly's meerkat
service using xmlrpc. Once upon a time I used to update
www.holdenweb.com every four hours, but until my current hosting
situation changes I can't be arsed.

However, the code to extract the news is pretty simple. Here's the whole
program, modulo newsreader wrapping. It would be shorter if I weren't
stashing the extracted links it a relational database:

#!/usr/bin/python
#
# mkcheck.py: Get a list of article categories from the O'Reilly Network
# and update the appropriate section database
#
import xmlrpclib
server =
xmlrpclib.Serve r("http://www.oreillynet. com/meerkat/xml-rpc/server.php")

from db import conn, pmark
import mx.DateTime as dt
curs = conn.cursor()

pyitems = server.meerkat. getItems(
{'search':'/[Pp]ython/','num_items':1 0,'descriptions ':100})

sqlinsert = "INSERT INTO PyLink (pylWhen, pylURL, pylDescription)
VALUES(%s, %s, %s)" % (pmark, pmark, pmark)
for itm in pyitems:
description = itm['description'] or itm['title']
if itm['link'] and not ("<" in description):
curs.execute("" "SELECT COUNT(*) FROM PyLink
WHERE pylURL=%s""" % pmark, (itm['link'], ))
newlink = curs.fetchone()[0] == 0
if newlink:
print "Adding", itm['link']
curs.execute(sq linsert,

(dt.DateTimeFro mTicks(int(dt.n ow())), itm['link'], description))

conn.commit()
conn.close()

Similar techniques can be used on many other sites, and you will find
that (some) RSS feeds are a fruitful source of news.

regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
Jul 18 '05 #2
Steve Holden wrote:

[...]
However, the code to extract the news is pretty simple. Here's the whole
program, modulo newsreader wrapping. It would be shorter if I weren't
stashing the extracted links it a relational database:

[...]

I see that, as is so often the case, I only told half the story, and you
will be wondering what the "db" module does. The main answer is adapts
the same logic to two different database modules in an attempt to build
a little portability into the system (which may one day be open sourced).

The point is that MySQLdb requires a "%s" in queries to mark a
substitutable parameter, whereas mxODBC requires a "?". In order to work
around this difference the db module is imported by anything that uses
the database. This makes it easier to migrate between different database
technologies, though still far from painless, and allows testing by
accessing a MySQL database directly and via ODBC as another option.

Significant strings have been modified to protect the innocent.
--------
#
# db.py: establish a database connection with
# the appropriate parameter style
#
try:
import MySQLdb as db
conn = db.connect(host ="****", db="****",
user="****", passwd="****")
pmark = "%s"
print "Using MySQL"
except ImportError:
import mx.ODBC.Windows as db
conn = db.connect("*** *", user="****", password="****" )
pmark = "?"
print "Using ODBC"
--------
regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
Jul 18 '05 #3
Thanks for the hint. The xml-rpc service is great, but I want some
general techniques to parse news information in the usual html pages.

Currently I'm looking at a script-based approach found at:
http://www.namo.com/products/handsto...ual/hsceditor/
User can write some simple template to extract certain fields from a
web page. Unfortunately, it is not open source, so I can not look
inside the blackbox.:-(

Zhang Le

Jul 18 '05 #4
Zhang Le wrote:
Thanks for the hint. The xml-rpc service is great, but I want some
general techniques to parse news information in the usual html pages.

Currently I'm looking at a script-based approach found at:
http://www.namo.com/products/handsto...ual/hsceditor/
User can write some simple template to extract certain fields from a
web page. Unfortunately, it is not open source, so I can not look
inside the blackbox.:-(

Zhang Le

That's a very large topic, and not one that I could claim to be expert
on, so let's hope that others will pitch in with their favorite
techniques. Otherwise it's down to providing individual parsers for each
service you want to scan, and maintaining the parsers as each group of
designers modifies their pages.

You might want to look at BeutifulSoup, which is a module for extracting
stuff from (possibly) irregularly-formed HTML.

regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
Jul 18 '05 #5
If you have a reliably structured page, then you can write a custom
parser. As Steve points out - BeautifulSOup would be a very good place
to start.

This is the problem that RSS was designed to solve. Many newssites will
supply exactly the information you want as an RSS feed. You should then
use Universal Feed Parser to process the feed.

The module you need for fecthing the webpages (in case you didn't know)
is urllib2. There is a great article on fetching webpages in the
current issue of pyzine. See http://www.pyzine.com :-)
Regards,

Fuzzy
http://www.voidspace.org.uk/python/index.shtml

Jul 18 '05 #6
On 22 Dec 2004 09:22:15 -0800, Zhang Le <si*******@snea kemail.com> wrote:
Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information.


Well, the BBC publishes an RSS feed[1], as do most sites like it. You
can read RSS feed with Mark Pilgrim's Feed Parser[2].

Granted, you can't read *every* site like this. But I daresay that
*most* news related sites publish feeds of some kind these days. Where
they do, using the feed is a *far* better idea than trying to parse
the HTML.

--
Cheers,
Simon B,
si***@brunningo nline.net,
http://www.brunningonline.net/simon/blog/
[1] http://news.bbc.co.uk/2/hi/help/3223484.stm
[2] http://feedparser.org/
Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1466
by: Jonathan M. Rose | last post by:
I am looking for a script that I can sit on an HTML server (Linux, Apache, PHP/Perl/Python/Etc.) that will allow me to do the following things: 1) Post news articles that consists of (i) a title and (ii) a body. 2) Show the last X posts (or, better yet, just the last X titles) on a home page. 3) Show all posts on a "news" page. 4) When a news article is posted, email the news article (with the title being the email subject and the news...
3
3903
by: Phong Ho | last post by:
Hi everyone, I try to write a simple web crawler. It has to do the following: 1) Open an URL and retrieve a HTML file. 2) Extract news headlines from the HTML file 3) Put the headlines into a RSS file. For example, I want to go to this site and extract the headlines: www.unstrung.com/section.asp?section_id=86
2
1297
by: python | last post by:
Hello, How to post a news article with NNTPlib if the news server requires login. I did not find nay login command in nntplib module. Thank you lad.
4
13982
by: Tharma | last post by:
Hi I wanted to extract the decimal portion from a number. I tried the following code but I didn't get the exact decimal portion. If anyone know how to extract please let me know. $num = 1234.56; $decimal = ($num - int($num)); print $decimal;
10
1928
by: asj | last post by:
BIG news from the web services front. Amazon will use web services to tie all its vendors together. The company implementing the system will be using Java/C++ (migrating to all-java later). Isn't it funny how Microsoft spent so much touting .NET for web services, and J2EE is actually taking a bigger slice of the pie? http://www.internetnews.com/ec-news/article.php/3077221
0
1122
by: Cameron Laird | last post by:
QOTW: "I'm a huge fan of single digit numbers ..." - Jim Hugunin, illustrating his undiminished grasp on the Pythonic ethos "It's hard to say exactly what constitutes research in the computer world, but as a first approximation, it's software that doesn't have users." - Paul Graham Microsoft makes IronPython official. Sort-of: http://www.redmondmag.com/news/article.asp?EditorialsID=7116
3
1348
by: anthonykallay | last post by:
Hi, I am trying to display my news articles by month, i am aproaching this via the use of a nested repeater so that the main rptr will display the month abnd then using itemdatabound i can pull all articles for that month.. What i am struggling with is the code to programatically fill the main repeater with months automatically, any help to achieve this would be greatly appreciated.. eg:
3
4580
by: Jeigh | last post by:
Hello, I've never used Javascript before (my site is basically all in PHP), and I wanted 5 news articles to rotate on the front page eg. first news will show, 5 seconds later it will switch to the second etc. I was lucky enough to find a script that I figured out how to modify to my needs and it's working fine, however I'd also like people to be able to manually scroll through them. I want there to be '1 - 2 - 3 - 4 - 5' down the bottom,...
0
7924
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
7854
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8349
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
7978
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8221
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6629
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5722
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
1
2364
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
0
1192
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.