HTML Scrapping using XmlTextReader

Daniel

Greetings.

Just wondering if it is possible to use XmlTextReader to
read off a html doc:

e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">test heading1</td>
<td class="head" width="10%">test heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">content1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">test</td>
<td nowarp align="right">
<nobr>0.123456</nobr>
</td>
</tr>
</table>
</td>
</tr>
</table>

It seems to work for the first few seconds and then it
crashes my win app after the XmlTextReader come across
certain situation when doing a Xml.TextReader.Read(). Is
it to do with the well-formness(is there such a word??) of
this html doc? Also, is there a way to detect and convert
  to the #1390(can't remember if this is right but I
am trying to say the equivalent special character) on the
fly (i.e. without saving the html onto disk)?

Any thought will be appreciated.

Nov 11 '05 #1

Subscribe Post Reply

2928

Oleg Tkachenko

Daniel wrote:

Just wondering if it is possible to use XmlTextReader to
read off a html doc: Not really, because html is not xml. Some html docs might be well-formed, so
they can be read be XmlTextReader, but in general a single <br> tag or
ubiquitous in HTML   will stop reading.
e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">test heading1</td>
<td class="head" width="10%">test heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">content1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">test</td>
<td nowarp align="right">

Watch nowrap - it's so-called boolean attribute, XML doesn't support that.

Try SGMLReader instead of XmlTextReader
http://www.gotdotnet.com/Community/U...4-C3BD760564BC
--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

Nov 11 '05 #2

Daniel

Thanks Oleg,

The url you provided looks very interesting. And looking
at the replies the sgmlreader has got, people are
definitely finding it useful. And I will definitely
download it and have a play with it.

However, I do want to learn more about reading html using
the XmlTextReader. Do you (or anybody out there) know of a
good url to get me started?

Cheers.

-----Original Message-----
Daniel wrote:
Just wondering if it is possible to use XmlTextReader to read off a html doc:Not really, because html is not xml. Some html docs might

be well-formed, sothey can be read be XmlTextReader, but in general a single <br> tag orubiquitous in HTML will stop reading.
e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
<td class="head" width="20%">test heading1</td>
<td class="head" width="10%">test heading2</td>
</tr>
<tr valign="top">
<td class="content" width="20%">content1</td>
<td class="content" width="10%">
<table cellspacing="0" width="100%">
<tr>
<td align="left">test</td>
<td nowarp align="right">
Watch nowrap - it's so-called boolean attribute, XML

doesn't support that.
Try SGMLReader instead of XmlTextReader
http://www.gotdotnet.com/Community/U...es/Details.asp x?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

.

Nov 11 '05 #3

Oleg Tkachenko

Daniel wrote:

However, I do want to learn more about reading html using
the XmlTextReader. Do you (or anybody out there) know of a
good url to get me started?

Not really. It's just technically impossible to read HTML by XmlTextReader
without some sort of preprocessing of HTML (aka conversion HTML to XML or
XHTML). Often Tidy is used for that too. Google for "HTML Tidy".
--
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

Nov 11 '05 #4

by: Russell Mangel | last post by:

Is it possible to use the using statement with XmlTextReader? I tryed to use it, but it gives me the error message: Cannot implicitly convert type 'System.Xml.XmlTextReader' to...

.NET Framework

Convert HTML to XML

by: MLibby | last post by:

How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them. Thanks, Mike...

.NET Framework

Error when using XMLTextReader to read HTML

by: Mitch | last post by:

I have some simple HTML I'm trying to read with the XMLTextReader. As in the MSDS examples, I set up a loop to read each XML node: while (reader.Read()) { switch (reader.NodeType) { case...

.NET Framework

Using XmlTextReader to read unicode characters

by: Jordan | last post by:

I have a unicode XML file that I am trying to read using the .NET XmlTextReader in C#. How do I read the unicode file? If I try to using the XmlTextReader.Read() method, it throws an exception. ...

C# / C Sharp

strip out html

by: newbie | last post by:

Hello anybody knows how i can strip out the html from a control's innerhtml to show only the relevant text? i.e <span id="test1" runat="server"><a href="test1.htm">abcde</a><b>blah</b></span ...

ASP.NET

How to read HTML with XmlTextReader?

by: Amil | last post by:

I want to read/parse some simple HTML and look for certain tokens and content. Can I use a XmlTextReader for this? If not, any other ideas? Amil

ASP.NET

how to do screen scrapping

by: atyant | last post by:

hey i want to know the funda of screen scrapping that how it is done using C#

.NET Framework

Application Tier HTML Email

by: bthubbard | last post by:

Hello All, I am hoping to hear other people's suggestions and advice regarding this issue. Generating and sending a basic HTML email with Microsoft.Net is easy. I can throw some HTML together...

ASP.NET

Finding Line numbers of HTML file

by: Ramdas | last post by:

I am doing some HTML scrapping for a side project. I need a method using sgmllib or HTMLParser to parse an HTML file and get line nos of all the tags I tried a few things, but I am just not...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

HTML Scrapping using XmlTextReader

Similar topics