reading attributes with no quotes using XmlTextReader

apiringmvp

All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
Some text from the blog.

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "" + content + "";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "\r\n") + "
....";
return output;
}
}

Nov 28 '06 #1

Subscribe Reply

2003

Karl Seguin

You're problem, which you might already know, is that you are trying to use
a XML Text Reader to read non-XML content. XML strictly requires all
attributes to be enclosed in double quotes. HTML is based on SGML which
doesn't have such a requirement. XHTML on the other hand is based on XML
and so you shouldn't have any problems.

All this to say that there probably isn't a way to make XmlTExtReader work
without quote - if it did, it wouldn't be an Xml reader...Unfortunetly,
there isn't an SgmlTextReader - which is really what you should be using.

You could try to use regular expressions to turn your content into valid
XML, but I think you'll keep running into new issues with this...first it'll
be missing double quotes, then missing closing tags....

Using a regular expression or even just string manipulation (index of and
substrings) is probably the right way to go...

Karl
--
http://www.openmymind.net/
http://www.fuelindustries.com/
"apiringmvp" <bd******@hotmail.comwrote in message
news:11*********************@j44g2000cwa.googlegro ups.com...

All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
Some text from the blog.

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "" + content + "";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "\r\n") + "
...";
return output;
}
}

Nov 28 '06 #2

John Timney \(MVP\)

Your stuck to using string manipulation, and its not likely to be the
easiest task.

I have to ask - if its from a blog, why cant you syndicate the RSS and
consume it

--
--
Regards

John Timney (MVP)
VISIT MY WEBSITE:
http://www.johntimney.com
http://www.johntimney.com/blog
"apiringmvp" <bd******@hotmail.comwrote in message
news:11*********************@j44g2000cwa.googlegro ups.com...

All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
Some text from the blog.

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "" + content + "";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "\r\n") + "
...";
return output;
}
}

Nov 28 '06 #3

Rad [Visual C# MVP]

You are going to run into very serious problems using an XMLTextReader
to operate on HTML. HTML is almost always NOT valid XML.

You'd rather use regular expressions to manipulate the text.

On 28 Nov 2006 07:24:56 -0800, "apiringmvp" <bd******@hotmail.com>
wrote:

>All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
Some text from the blog.

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "" + content + "";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "\r\n") + "
...";
return output;
}
}

--

Bits.Bytes.
http://bytes.thinkersroom.com

Nov 28 '06 #4

Similar topics

3549

Reading XML from Database - Serious Problem - Need Help

by: Christopher Ambler | last post by:

This is long, but it's driving me nuts. I need some adult supervision :-) (and I'm not above bribing for help) I have a stored procedure that I call that returns XML to me. The SP returns 3...

.NET Framework

5534

Write XML to File while reading

by: Drew | last post by:

I'm reading in an XML file from the server using XmlTextReader in C# like so: XmlTextReader xr = new XmlTextReader(url); while(xr.Read()) { //parse the xml file here

.NET Framework

16378

Reading child nodes with XmlTextReader?

by: Emsi | last post by:

Hello, how can I read values of child nodes with the XmlTextReader? File format: <root> <items> <item> <field1>value1</field1> <field2>value2</field2>

.NET Framework

1769

System.IndexOutOfRangeException in XML Reading

by: Juan Galdeano | last post by:

Hi, I'm working on an ONIX project and when I try to validate or read XML files C# gives me this exception: System.IndexOutOfRangeException at System.Xml.XmlScanner.ScanDtdContent() at...

C# / C Sharp

1654

Reading XML File

by: Ian | last post by:

Hi I'm pretty new at this so please don't laugh too hard. I'm trying to load an xml document using VB.NET and having a hard time. My code doesn't crash but it doesn't work either, the first...

Visual Basic .NET

1670

No rows in dataset when reading in xml?

by: c.w.browne | last post by:

Hi, Ive had a bit of a look around for other people with this problem and cant find anything that solves it in my case, so I'm afraid im going to have to bother you all with a post of my own. ...

.NET Framework

1962

Confused on reading attributes of an XML element.

by: sherifffruitfly | last post by:

Hi, I have an xml file with structured like this: <?xml version="1.0" encoding="UTF-8"?> <Soldiers> <Soldier name="Billy Smith" rank="Private" serial="34" /> (a bunch more soldiers)

C# / C Sharp

2575

reading, writing xml and encoding question

by: billsahiker | last post by:

I am researching for an upcoming c# .net 2.0 project that may require reading and writing xml files. I don't want to use xmltextreader/ xmltextwriter as I prefer to have lower level file access...

C# / C Sharp

2148

New to XML. Need help reading XML.

by: Justin | last post by:

Here's my XML: <?xml version="1.0" ?> <AppMode Type="Network"> <CurrentFolder Path="c:\tabs"> <Tabs> <FilePath>tabs\Justin.tab</FilePath> <FilePath>tabs\Julie.tab</FilePath> *****There could...

.NET Framework

7138

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7418

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7075

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

5662

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

5063

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4737

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3222

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

1572

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

781

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP