By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,189 Members | 2,167 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,189 IT Pros & Developers. It's quick & easy.

What is the best way to process HTML Data?

P: n/a
ink
Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing to
make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and on
a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading with
in the first row of the table, or has a separate heading table just above
the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the future
should things change on the HTML pages. As I am new to both these types of
development having done them all at least once but on a much smaller scale I
sort of know how they work but not what the potential pit falls are and
weather it is possible to use these sorts of things for such complex HTML.
One thing I can think that is good about XPath is that I could store it in a
config file and if the web page changed I could change where it read the
data from with out to much work. I am not even sure that I would be able to
Deserialize such a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct the
first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse
Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink
--
Jesse Houwing
jesse.houwing at sogeti.nl
Nov 8 '07 #2

P: n/a
ink
Thanks Jesse,

I will give this a test tonight.

ink

"Jesse Houwing" <je***********@newsgroup.nospamwrote in message
news:21**************************@news.microsoft.c om...
Hello ink,

Have a look at the HTML Agility Pack, which does all you need.

http://www.codeplex.com/htmlagilitypack

Jesse
>Hi all,

I am trying to pull some financial data off of an HTML web page so
that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the
best
way to do it but I am just not experienced enough with this sort of
thing to
make the best decision, so any advice would be great.
The data is on a number of different nested tables with in the HTML,
and on a number of different pages, and each page is laid out
differently.

The common factors are that each Table is well formed and has a
heading with in the first row of the table, or has a separate heading
table just above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.
Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the
tables.
3. Store the data into the database.
Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data
tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.
I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both
these types of development having done them all at least once but on a
much smaller scale I sort of know how they work but not what the
potential pit falls are and weather it is possible to use these sorts
of things for such complex HTML. One thing I can think that is good
about XPath is that I could store it in a config file and if the web
page changed I could change where it read the data from with out to
much work. I am not even sure that I would be able to Deserialize such
a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to
correct the first time as I can. So any ideas would be great. As you
can see I am struggling with a lot of new boy questions.

Thanks,
ink
--
Jesse Houwing
jesse.houwing at sogeti.nl

Nov 8 '07 #3

P: n/a
There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of around
$10 per month, and they'll send you spreadsheets with your morning coffee!
:)

"ink" <in*@notmyemail.comwrote in message
news:eo**************@TK2MSFTNGP03.phx.gbl...
Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for such
complex HTML. One thing I can think that is good about XPath is that I
could store it in a config file and if the web page changed I could change
where it read the data from with out to much work. I am not even sure that
I would be able to Deserialize such a complex HTML model or can I just
Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 8 '07 #4

P: n/a
ink
Really i have no problem paying but i cant find any that are selling UK data
to private investors.

They seem to think everyone is a large Broker and have 500 a month to blow
on data.


"Ashot Geodakov" <a_********@nospam.hotmail.comwrote in message
news:OB**************@TK2MSFTNGP02.phx.gbl...
There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of
around $10 per month, and they'll send you spreadsheets with your morning
coffee! :)

"ink" <in*@notmyemail.comwrote in message
news:eo**************@TK2MSFTNGP03.phx.gbl...
>Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing
to make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and
on a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading
with in the first row of the table, or has a separate heading table just
above the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from
the tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the
future should things change on the HTML pages. As I am new to both these
types of development having done them all at least once but on a much
smaller scale I sort of know how they work but not what the potential pit
falls are and weather it is possible to use these sorts of things for
such complex HTML. One thing I can think that is good about XPath is that
I could store it in a config file and if the web page changed I could
change where it read the data from with out to much work. I am not even
sure that I would be able to Deserialize such a complex HTML model or can
I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct
the first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink

Nov 9 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.