473,804 Members | 3,138 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

HTML Scrapping using XmlTextReader

Greetings.

Just wondering if it is possible to use XmlTextReader to
read off a html doc:

e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">tes t heading1</td>
<td class="head" width="10%">tes t heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">con tent1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">te st</td>
<td nowarp align="right">
<nobr>0.12345 6</nobr>
</td>
</tr>
</table>
</td>
</tr>
</table>

It seems to work for the first few seconds and then it
crashes my win app after the XmlTextReader come across
certain situation when doing a Xml.TextReader. Read(). Is
it to do with the well-formness(is there such a word??) of
this html doc? Also, is there a way to detect and convert
&nbsp; to the #1390(can't remember if this is right but I
am trying to say the equivalent special character) on the
fly (i.e. without saving the html onto disk)?

Any thought will be appreciated.
Nov 11 '05 #1
3 2952
Daniel wrote:
Just wondering if it is possible to use XmlTextReader to
read off a html doc: Not really, because html is not xml. Some html docs might be well-formed, so
they can be read be XmlTextReader, but in general a single <br> tag or
ubiquitous in HTML &nbsp; will stop reading.
e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">tes t heading1</td>
<td class="head" width="10%">tes t heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">con tent1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">te st</td>
<td nowarp align="right">


Watch nowrap - it's so-called boolean attribute, XML doesn't support that.

Try SGMLReader instead of XmlTextReader
http://www.gotdotnet.com/Community/U...4-C3BD760564BC
--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

Nov 11 '05 #2
Thanks Oleg,

The url you provided looks very interesting. And looking
at the replies the sgmlreader has got, people are
definitely finding it useful. And I will definitely
download it and have a play with it.

However, I do want to learn more about reading html using
the XmlTextReader. Do you (or anybody out there) know of a
good url to get me started?

Cheers.
-----Original Message-----
Daniel wrote:
Just wondering if it is possible to use XmlTextReader to read off a html doc:Not really, because html is not xml. Some html docs might

be well-formed, sothey can be read be XmlTextReader, but in general a single <br> tag orubiquitous in HTML will stop reading.
e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">tes t heading1</td>
<td class="head" width="10%">tes t heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">con tent1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">te st</td>
<td nowarp align="right">
Watch nowrap - it's so-called boolean attribute, XML

doesn't support that.
Try SGMLReader instead of XmlTextReader
http://www.gotdotnet.com/Community/U...es/Details.asp x?SampleGuid=B9 0FDDCE-E60D-43F8-A5C4-C3BD760564BC--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

.

Nov 11 '05 #3
Daniel wrote:
However, I do want to learn more about reading html using
the XmlTextReader. Do you (or anybody out there) know of a
good url to get me started?

Not really. It's just technically impossible to read HTML by XmlTextReader
without some sort of preprocessing of HTML (aka conversion HTML to XML or
XHTML). Often Tidy is used for that too. Google for "HTML Tidy".
--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

Nov 11 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
9648
by: Russell Mangel | last post by:
Is it possible to use the using statement with XmlTextReader? I tryed to use it, but it gives me the error message: Cannot implicitly convert type 'System.Xml.XmlTextReader' to 'System.IDisposable' Is there something I am doing wrong? // This no worky using(XmlTextReader xtr = new XmlTextReader("C:\\myfile.xml")) {
9
13712
by: MLibby | last post by:
How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them. Thanks, Mike -- mcp, mcse, mcsd, mcad.net, mcsd.net
2
5595
by: Mitch | last post by:
I have some simple HTML I'm trying to read with the XMLTextReader. As in the MSDS examples, I set up a loop to read each XML node: while (reader.Read()) { switch (reader.NodeType) { case XmlNodeType.Element: Console.WriteLine("<{0}>", reader.Name); break;
1
4186
by: Jordan | last post by:
I have a unicode XML file that I am trying to read using the .NET XmlTextReader in C#. How do I read the unicode file? If I try to using the XmlTextReader.Read() method, it throws an exception. The exception reads: The '€' character, hexadecimal value 0x80, cannot begin with a name. Line 1, position 2. Any suggestions? I read on Microsoft's website about writing surrogate pairs, but I can't find any documentation that confirms the
2
1140
by: newbie | last post by:
Hello anybody knows how i can strip out the html from a control's innerhtml to show only the relevant text? i.e <span id="test1" runat="server"><a href="test1.htm">abcde</a><b>blah</b></span so that on my server side code, somehow i would get "abcde blah"? thanks!
2
3266
by: Amil | last post by:
I want to read/parse some simple HTML and look for certain tokens and content. Can I use a XmlTextReader for this? If not, any other ideas? Amil
6
1566
by: atyant | last post by:
hey i want to know the funda of screen scrapping that how it is done using C#
2
1268
by: bthubbard | last post by:
Hello All, I am hoping to hear other people's suggestions and advice regarding this issue. Generating and sending a basic HTML email with Microsoft.Net is easy. I can throw some HTML together from strings, templates, excreta and then fire it out to the world via System.Net.Mail. If I want to generate a more advanced HTML email using the same the same templates as my site I could write a component which uses screen
5
1887
by: Ramdas | last post by:
I am doing some HTML scrapping for a side project. I need a method using sgmllib or HTMLParser to parse an HTML file and get line nos of all the tags I tried a few things, but I am just not able to work with either if the parsers.
0
9706
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9579
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10575
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10076
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9144
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6851
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
4297
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3816
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2990
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.