What is the best way to process HTML Data?

ink

Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing to
make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and on
a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading with
in the first row of the table, or has a separate heading table just above
the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the future
should things change on the HTML pages. As I am new to both these types of
development having done them all at least once but on a much smaller scale I
sort of know how they work but not what the potential pit falls are and
weather it is possible to use these sorts of things for such complex HTML.
One thing I can think that is good about XPath is that I could store it in a
config file and if the web page changed I could change where it read the
data from with out to much work. I am not even sure that I would be able to
Deserialize such a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct the
first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #1

Subscribe Post Reply

2742

Jesse Houwing

Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse

Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink

--
Jesse Houwing
jesse.houwing at sogeti.nl

Nov 8 '07 #2

ink

Thanks Jesse,

I will give this a test tonight.

ink

"Jesse Houwing" <je***********@newsgroup.nospamwrote in message
news:21**************************@news.microsoft.c om...

Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse

>Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink
--
Jesse Houwing
jesse.houwing at sogeti.nl

Nov 8 '07 #3

Ashot Geodakov

There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of around
$10 per month, and they'll send you spreadsheets with your morning coffee!
:)

"ink" <in*@notmyemail.comwrote in message
news:eo**************@TK2MSFTNGP03.phx.gbl...

Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for such
complex HTML. One thing I can think that is good about XPath is that I
could store it in a config file and if the web page changed I could change
where it read the data from with out to much work. I am not even sure that
I would be able to Deserialize such a complex HTML model or can I just
Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #4

ink

Really i have no problem paying but i cant find any that are selling UK data
to private investors.

They seem to think everyone is a large Broker and have £500 a month to blow
on data.

"Ashot Geodakov" <a_********@nospam.hotmail.comwrote in message
news:OB**************@TK2MSFTNGP02.phx.gbl...

There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of
around $10 per month, and they'll send you spreadsheets with your morning
coffee! :)

"ink" <in*@notmyemail.comwrote in message
news:eo**************@TK2MSFTNGP03.phx.gbl...
>Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for
such complex HTML. One thing I can think that is good about XPath is that
I could store it in a config file and if the web page changed I could
change where it read the data from with out to much work. I am not even
sure that I would be able to Deserialize such a complex HTML model or can
I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 9 '07 #5

by: Tony Marston | last post by:

Several months ago I started a thread with the title "What is/is not considered to be good OO programming" which started a long and interesting discussion. I have condensed the arguments into a...

PHP

699

Python syntax in Lisp and Scheme

by: mike420 | last post by:

I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro...

Python

what is postback?

by: Matt | last post by:

I always see the term "postback" from ASP book, but I am not sure if I fully understand the meaning. Here's my understanding so far, please correct me if any mistakes. Here's a typical html...

ASP / Active Server Pages

125

What so special about PostgreSQL and other RDBMS?

by: Sarah Tanembaum | last post by:

Beside its an opensource and supported by community, what's the fundamental differences between PostgreSQL and those high-price commercial database (and some are bloated such as Oracle) from...

Microsoft SQL Server

What do other's make of this code?

by: Steven T. Hatton | last post by:

This is something I've been looking at because it is central to a currently broken part of the KDevelop new application wizard. I'm not complaining about it being broken, It's a CVS images. ...

C / C++

Advice needed: Should we upgrade MS Access 2000? And if so to what?

by: ship | last post by:

Hi We need some advice: We are thinking of upgrading our Access database from Access 2000 to Access 2004. How stable is MS Office 2003? (particularly Access 2003). We are just a small...

Microsoft Access / VBA

Logon with Digital Siganture (PKI/OCES - or what else they're called)

by: Martin Høst Normark | last post by:

Hi everyone Has anyone got the least experience in integrating the Digital Signature with an ASP.NET Web Application? Here in Denmark, as I supose in many other countries, they're promoting...

C# / C Sharp

What is the best way to display data from a previous form

by: dbuchanan | last post by:

Hello, Windows forms & database What are the considerations when choosing how to display reference information from the calling form? What is recommended? Table1 is the parent of Table2. ...

Visual Basic .NET

What's the "System" process doing?

by: Henrik | last post by:

Hi Is there any way to see what the System process is doing? We have developed an application running at a production site to measure and optimize the production. The application needs to be...

.NET Framework

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

What is the best way to process HTML Data?

Similar topics