What is the best way to process HTML Data?

ink

Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing to
make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and on
a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading with
in the first row of the table, or has a separate heading table just above
the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the future
should things change on the HTML pages. As I am new to both these types of
development having done them all at least once but on a much smaller scale I
sort of know how they work but not what the potential pit falls are and
weather it is possible to use these sorts of things for such complex HTML.
One thing I can think that is good about XPath is that I could store it in a
config file and if the web page changed I could change where it read the
data from with out to much work. I am not even sure that I would be able to
Deserialize such a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct the
first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #1

Subscribe Reply

2761

Jesse Houwing

Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse

Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink

--
Jesse Houwing
jesse.houwing at sogeti.nl

Nov 8 '07 #2

ink

Thanks Jesse,

I will give this a test tonight.

ink

"Jesse Houwing" <je***********@ newsgroup.nospa mwrote in message
news:21******** *************** ***@news.micros oft.com...

Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse

>Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink
--
Jesse Houwing
jesse.houwing at sogeti.nl

Nov 8 '07 #3

Ashot Geodakov

There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of around
$10 per month, and they'll send you spreadsheets with your morning coffee!
:)

"ink" <in*@notmyemail .comwrote in message
news:eo******** ******@TK2MSFTN GP03.phx.gbl...

Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for such
complex HTML. One thing I can think that is good about XPath is that I
could store it in a config file and if the web page changed I could change
where it read the data from with out to much work. I am not even sure that
I would be able to Deserialize such a complex HTML model or can I just
Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #4

ink

Really i have no problem paying but i cant find any that are selling UK data
to private investors.

They seem to think everyone is a large Broker and have £500 a month to blow
on data.

"Ashot Geodakov" <a_********@nos pam.hotmail.com wrote in message
news:OB******** ******@TK2MSFTN GP02.phx.gbl...

There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of
around $10 per month, and they'll send you spreadsheets with your morning
coffee! :)

"ink" <in*@notmyemail .comwrote in message
news:eo******** ******@TK2MSFTN GP03.phx.gbl...
>Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for
such complex HTML. One thing I can think that is good about XPath is that
I could store it in a config file and if the web page changed I could
change where it read the data from with out to much work. I am not even
sure that I would be able to Deserialize such a complex HTML model or can
I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 9 '07 #5

Similar topics

6410

What is/is not considered to be good OO programming

by: Tony Marston | last post by:

Several months ago I started a thread with the title "What is/is not considered to be good OO programming" which started a long and interesting discussion. I have condensed the arguments into a single article which can be viewed at http://www.tonymarston.net/php-mysql/good-bad-oop.html I fully expect this to be the start of another flame war, so sharpen your knives and get stuck in!

PHP

699

33885

Python syntax in Lisp and Scheme

by: mike420 | last post by:

I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro capabilities, unfortunately. I'd like to know if it may be possible to add a powerful macro system to Python, while keeping its amazing syntax, and if it could be possible to add Pythonistic syntax to Lisp or Scheme, while keeping all of the...

Python

46011

what is postback?

by: Matt | last post by:

I always see the term "postback" from ASP book, but I am not sure if I fully understand the meaning. Here's my understanding so far, please correct me if any mistakes. Here's a typical html form: <form action="process.asp" method="post"> 'GUI code </form> "postback" action happens when the user click the submit button, that means

ASP / Active Server Pages

125

14712

What so special about PostgreSQL and other RDBMS?

by: Sarah Tanembaum | last post by:

Beside its an opensource and supported by community, what's the fundamental differences between PostgreSQL and those high-price commercial database (and some are bloated such as Oracle) from software giant such as Microsoft SQL Server, Oracle, and Sybase? Is PostgreSQL reliable enough to be used for high-end commercial application? Thanks

Microsoft SQL Server

3293

What do other's make of this code?

by: Steven T. Hatton | last post by:

This is something I've been looking at because it is central to a currently broken part of the KDevelop new application wizard. I'm not complaining about it being broken, It's a CVS images. Such things happen. The whole subsystem is going through radical changes. I don't really want to say what I think of the code just yet. That would influence the opinions of others, and I really want to know how other people view these things,...

C / C++

4516

Advice needed: Should we upgrade MS Access 2000? And if so to what?

by: ship | last post by:

Hi We need some advice: We are thinking of upgrading our Access database from Access 2000 to Access 2004. How stable is MS Office 2003? (particularly Access 2003). We are just a small company and this is a big decision for us(!) It's not just the money it's committing to an new version of Access!

Microsoft Access / VBA

3671

Logon with Digital Siganture (PKI/OCES - or what else they're called)

by: Martin Høst Normark | last post by:

Hi everyone Has anyone got the least experience in integrating the Digital Signature with an ASP.NET Web Application? Here in Denmark, as I supose in many other countries, they're promoting the digital signature. A lot of people already has one, to do their taxes, and much more. I have to use for a business-to-business e-commerce solution, where it's vital that the right user is being logged on, and not give his username and password...

C# / C Sharp

1712

What is the best way to display data from a previous form

by: dbuchanan | last post by:

Hello, Windows forms & database What are the considerations when choosing how to display reference information from the calling form? What is recommended? Table1 is the parent of Table2. Form1 allows data entry into Table1.

Visual Basic .NET

3216

What's the "System" process doing?

by: Henrik | last post by:

Hi Is there any way to see what the System process is doing? We have developed an application running at a production site to measure and optimize the production. The application needs to be responsive at all times during production. I'm experiencing some problems with my application not responding or responding slow. I suspeced the problem had to do with high CPU utilization

.NET Framework

8483

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8926

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8824

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8603

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

7444

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5703

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4227

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4416

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2060

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP