473,396 Members | 1,734 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

What is the best way to process HTML Data?

ink
Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing to
make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and on
a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading with
in the first row of the table, or has a separate heading table just above
the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the future
should things change on the HTML pages. As I am new to both these types of
development having done them all at least once but on a much smaller scale I
sort of know how they work but not what the potential pit falls are and
weather it is possible to use these sorts of things for such complex HTML.
One thing I can think that is good about XPath is that I could store it in a
config file and if the web page changed I could change where it read the
data from with out to much work. I am not even sure that I would be able to
Deserialize such a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct the
first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #1
4 2742
Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse
Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink
--
Jesse Houwing
jesse.houwing at sogeti.nl
Nov 8 '07 #2
ink
Thanks Jesse,

I will give this a test tonight.

ink

"Jesse Houwing" <je***********@newsgroup.nospamwrote in message
news:21**************************@news.microsoft.c om...
Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse
>Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink
--
Jesse Houwing
jesse.houwing at sogeti.nl

Nov 8 '07 #3
There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of around
$10 per month, and they'll send you spreadsheets with your morning coffee!
:)

"ink" <in*@notmyemail.comwrote in message
news:eo**************@TK2MSFTNGP03.phx.gbl...
Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for such
complex HTML. One thing I can think that is good about XPath is that I
could store it in a config file and if the web page changed I could change
where it read the data from with out to much work. I am not even sure that
I would be able to Deserialize such a complex HTML model or can I just
Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #4
ink
Really i have no problem paying but i cant find any that are selling UK data
to private investors.

They seem to think everyone is a large Broker and have £500 a month to blow
on data.


"Ashot Geodakov" <a_********@nospam.hotmail.comwrote in message
news:OB**************@TK2MSFTNGP02.phx.gbl...
There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of
around $10 per month, and they'll send you spreadsheets with your morning
coffee! :)

"ink" <in*@notmyemail.comwrote in message
news:eo**************@TK2MSFTNGP03.phx.gbl...
>Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for
such complex HTML. One thing I can think that is good about XPath is that
I could store it in a config file and if the web page changed I could
change where it read the data from with out to much work. I am not even
sure that I would be able to Deserialize such a complex HTML model or can
I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 9 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

52
by: Tony Marston | last post by:
Several months ago I started a thread with the title "What is/is not considered to be good OO programming" which started a long and interesting discussion. I have condensed the arguments into a...
699
by: mike420 | last post by:
I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro...
5
by: Matt | last post by:
I always see the term "postback" from ASP book, but I am not sure if I fully understand the meaning. Here's my understanding so far, please correct me if any mistakes. Here's a typical html...
125
by: Sarah Tanembaum | last post by:
Beside its an opensource and supported by community, what's the fundamental differences between PostgreSQL and those high-price commercial database (and some are bloated such as Oracle) from...
12
by: Steven T. Hatton | last post by:
This is something I've been looking at because it is central to a currently broken part of the KDevelop new application wizard. I'm not complaining about it being broken, It's a CVS images. ...
47
by: ship | last post by:
Hi We need some advice: We are thinking of upgrading our Access database from Access 2000 to Access 2004. How stable is MS Office 2003? (particularly Access 2003). We are just a small...
2
by: Martin Høst Normark | last post by:
Hi everyone Has anyone got the least experience in integrating the Digital Signature with an ASP.NET Web Application? Here in Denmark, as I supose in many other countries, they're promoting...
2
by: dbuchanan | last post by:
Hello, Windows forms & database What are the considerations when choosing how to display reference information from the calling form? What is recommended? Table1 is the parent of Table2. ...
8
by: Henrik | last post by:
Hi Is there any way to see what the System process is doing? We have developed an application running at a production site to measure and optimize the production. The application needs to be...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.