473,563 Members | 2,667 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Hands-on HTML Table Parser/Matrix?

Often I want to extract some web table contents. Formats are
mostly static, simple text & numbers in it, other tags to be
stripped off. So a simple & fast approach would be ok.

What of the different modules around is most easy to use, stable,
up-to-date, iterator access or best matrix-access (without need
for callback functions,class es.. for basic tasks)?
Robert
Jul 6 '08 #1
3 1589
There are couple of HTML examples using Pyparsing here:

http://pyparsing.wikispaces.com/Examples
--Tim

On Sun, 2008-07-06 at 14:40 +0200, robert wrote:
Often I want to extract some web table contents. Formats are
mostly static, simple text & numbers in it, other tags to be
stripped off. So a simple & fast approach would be ok.

What of the different modules around is most easy to use, stable,
up-to-date, iterator access or best matrix-access (without need
for callback functions,class es.. for basic tasks)?
Robert
--
http://mail.python.org/mailman/listinfo/python-list
--
Timothy Cook, MSc
Health Informatics Research & Development Services
LinkedIn Profile:http://www.linkedin.com/in/timothywaynecook
Skype ID == timothy.cook
*************** *************** *************** *************** **
*You may get my Public GPG key from popular keyservers or *
*from this link http://timothywayne.cook.googlepages.com/home*
*************** *************** *************** *************** **

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBIcL/72TFRV0OoZwMRAu OEAKCpDdFwDmNP6 XzHfiFQlMeKkvnp rwCeMT/H
EhH0g7ctU0eiz8X tbLZBoLI=
=V+Xf
-----END PGP SIGNATURE-----

Jul 6 '08 #2
Tim Cook wrote:
>
On Sun, 2008-07-06 at 14:40 +0200, robert wrote:
>Often I want to extract some web table contents. Formats are
mostly static, simple text & numbers in it, other tags to be
stripped off. So a simple & fast approach would be ok.

What of the different modules around is most easy to use, stable,
up-to-date, iterator access or best matrix-access (without need
for callback functions,class es.. for basic tasks)?
There are couple of HTML examples using Pyparsing here:

http://pyparsing.wikispaces.com/Examples

hm - nothing special with HTML tables.

Meanwhile:

I dislike "ClientTabl e" (file centric, too much parsing errors in
real world).

"TableParse " works. Very simple&fast 70-liner regexp->matrix and
strip/clean/HTML-entities conversion. Fast success hands-on.
Doesn't separate nested tables and such complexities consciously -
but works though for simple hands-on tasks in real world.
Robert
Jul 6 '08 #3
robert <no*****@no-spam-no-spam.invalid>:
Often I want to extract some web table contents. Formats are
mostly static, simple text & numbers in it, other tags to be
stripped off. So a simple & fast approach would be ok.

What of the different modules around is most easy to use, stable,
up-to-date, iterator access or best matrix-access (without need
for callback functions,class es.. for basic tasks)?
Not more than a handful of lines with lxml.html:

def htmltable2matri x(table):
"""Converts a html table to a matrix.

:param table: The html table element
:type table: An lxml element
"""
matrix = []
for row in table:
matrix.append([e.text_content( ) for e in row])
return matrix

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)
Jul 6 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

27
5584
by: Simon Biber | last post by:
I was reading http://en.wikipedia.org/wiki/Poker_probability which has a good description of how to count the frequency of different types of poker hands using a mathematical approach. A sample Python program is given in the discussion page for doing it that way. I wanted to take a different approach and actually generate all the possible...
1
1135
by: shanmips | last post by:
Dear Partners, We are inviting you to join www.Corp-Corp.com, the fast growing network of recruiters. Please view the list of companies already joined in this network http://www.corp-corp.com/ads/now_hiring.aspx www.Corp-Corp.com, is a market place for, only available, consultants and contracts.
15
6455
by: sandy123456 | last post by:
At the moment im trying to write a hand class for a game poker patientnce But when i get to the part having to catergorise the difference of full house straight flush flush four of a kind and straight i got stuck.I need to write boolean methods to return these (stright flush , four of a kind..etc) I can only do a pair and 2 pairs and three of a...
13
2536
by: kinghippo423 | last post by:
Hello Everyone, I did a poker program in Java that essencially finds the strenght of a poker hand created Randomly. My program is doing OK...but I'm pretty sure it can be optimised. This is my results with 1 million hands: Number of hands generated : 1000000 Straight Flush : 0.0012 % In theory : 0.0012%
24
7042
by: bnashenas1984 | last post by:
Hi every one I'm trying to make a little poker game but I don't know how to evaluate the strength of a 7 card hand.. It's not that hard with 5 cards. Actually I found some program to do that with 5 cards but the problem is that there is 5 flop cards and 2 cards that the player has in hand. I don't even know how to start that .... (identify...
2
4897
by: DesiShaddy | last post by:
Hi Guys, I have all my code working except sort function . I need to sort the cards in hand......and I am having hard time with that.... Any help would be really helpful ;) class hand{ private:
7
2105
by: Concepts Systems | last post by:
Hello All, Advance C and Linux System Programming are an intensive hands-on course designed by Concepts Systems to provide a detailed examination of each topic. These modules enable professionals and students to rapidly identify issues critical to their project, and provide them in-depth knowledge to add Linux support to their product...
30
9091
by: imran akhtar | last post by:
i have a balckjack code, which does not seem to run, in python, it comes up with syntax error, i have try sortng it out. does not seem to work. below is my code, if anyone can work out wht wrong with it. that will be great. thereis an attched file, to see the code more cleaer. from random import choice as randomcards def total(hand): ...
0
8103
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7634
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5481
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5208
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3634
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3618
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2079
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1194
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
916
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.