473,396 Members | 1,871 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

HTML Scrapping using XmlTextReader

Greetings.

Just wondering if it is possible to use XmlTextReader to
read off a html doc:

e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">test heading1</td>
<td class="head" width="10%">test heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">content1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">test</td>
<td nowarp align="right">
<nobr>0.123456</nobr>
</td>
</tr>
</table>
</td>
</tr>
</table>

It seems to work for the first few seconds and then it
crashes my win app after the XmlTextReader come across
certain situation when doing a Xml.TextReader.Read(). Is
it to do with the well-formness(is there such a word??) of
this html doc? Also, is there a way to detect and convert
&nbsp; to the #1390(can't remember if this is right but I
am trying to say the equivalent special character) on the
fly (i.e. without saving the html onto disk)?

Any thought will be appreciated.
Nov 11 '05 #1
3 2928
Daniel wrote:
Just wondering if it is possible to use XmlTextReader to
read off a html doc: Not really, because html is not xml. Some html docs might be well-formed, so
they can be read be XmlTextReader, but in general a single <br> tag or
ubiquitous in HTML &nbsp; will stop reading.
e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">test heading1</td>
<td class="head" width="10%">test heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">content1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">test</td>
<td nowarp align="right">


Watch nowrap - it's so-called boolean attribute, XML doesn't support that.

Try SGMLReader instead of XmlTextReader
http://www.gotdotnet.com/Community/U...4-C3BD760564BC
--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

Nov 11 '05 #2
Thanks Oleg,

The url you provided looks very interesting. And looking
at the replies the sgmlreader has got, people are
definitely finding it useful. And I will definitely
download it and have a play with it.

However, I do want to learn more about reading html using
the XmlTextReader. Do you (or anybody out there) know of a
good url to get me started?

Cheers.
-----Original Message-----
Daniel wrote:
Just wondering if it is possible to use XmlTextReader to read off a html doc:Not really, because html is not xml. Some html docs might

be well-formed, sothey can be read be XmlTextReader, but in general a single <br> tag orubiquitous in HTML will stop reading.
e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">test heading1</td>
<td class="head" width="10%">test heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">content1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">test</td>
<td nowarp align="right">
Watch nowrap - it's so-called boolean attribute, XML

doesn't support that.
Try SGMLReader instead of XmlTextReader
http://www.gotdotnet.com/Community/U...es/Details.asp x?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

.

Nov 11 '05 #3
Daniel wrote:
However, I do want to learn more about reading html using
the XmlTextReader. Do you (or anybody out there) know of a
good url to get me started?

Not really. It's just technically impossible to read HTML by XmlTextReader
without some sort of preprocessing of HTML (aka conversion HTML to XML or
XHTML). Often Tidy is used for that too. Google for "HTML Tidy".
--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

Nov 11 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Russell Mangel | last post by:
Is it possible to use the using statement with XmlTextReader? I tryed to use it, but it gives me the error message: Cannot implicitly convert type 'System.Xml.XmlTextReader' to...
9
by: MLibby | last post by:
How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them. Thanks, Mike...
2
by: Mitch | last post by:
I have some simple HTML I'm trying to read with the XMLTextReader. As in the MSDS examples, I set up a loop to read each XML node: while (reader.Read()) { switch (reader.NodeType) { case...
1
by: Jordan | last post by:
I have a unicode XML file that I am trying to read using the .NET XmlTextReader in C#. How do I read the unicode file? If I try to using the XmlTextReader.Read() method, it throws an exception. ...
2
by: newbie | last post by:
Hello anybody knows how i can strip out the html from a control's innerhtml to show only the relevant text? i.e <span id="test1" runat="server"><a href="test1.htm">abcde</a><b>blah</b></span ...
2
by: Amil | last post by:
I want to read/parse some simple HTML and look for certain tokens and content. Can I use a XmlTextReader for this? If not, any other ideas? Amil
6
by: atyant | last post by:
hey i want to know the funda of screen scrapping that how it is done using C#
2
by: bthubbard | last post by:
Hello All, I am hoping to hear other people's suggestions and advice regarding this issue. Generating and sending a basic HTML email with Microsoft.Net is easy. I can throw some HTML together...
5
by: Ramdas | last post by:
I am doing some HTML scrapping for a side project. I need a method using sgmllib or HTMLParser to parse an HTML file and get line nos of all the tags I tried a few things, but I am just not...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.