473,390 Members | 1,158 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,390 software developers and data experts.

reading attributes with no quotes using XmlTextReader

All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
<p align=center>Some text from the blog.</p>

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "<p>" + content + "</p>";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "<p>" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
....</p>";
return output;
}
}

Nov 28 '06 #1
3 1996
You're problem, which you might already know, is that you are trying to use
a XML Text Reader to read non-XML content. XML strictly requires all
attributes to be enclosed in double quotes. HTML is based on SGML which
doesn't have such a requirement. XHTML on the other hand is based on XML
and so you shouldn't have any problems.

All this to say that there probably isn't a way to make XmlTExtReader work
without quote - if it did, it wouldn't be an Xml reader...Unfortunetly,
there isn't an SgmlTextReader - which is really what you should be using.

You could try to use regular expressions to turn your content into valid
XML, but I think you'll keep running into new issues with this...first it'll
be missing double quotes, then missing closing tags....

Using a regular expression or even just string manipulation (index of and
substrings) is probably the right way to go...

Karl
--
http://www.openmymind.net/
http://www.fuelindustries.com/
"apiringmvp" <bd******@hotmail.comwrote in message
news:11*********************@j44g2000cwa.googlegro ups.com...
All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
<p align=center>Some text from the blog.</p>

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "<p>" + content + "</p>";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "<p>" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
...</p>";
return output;
}
}
Nov 28 '06 #2
Your stuck to using string manipulation, and its not likely to be the
easiest task.

I have to ask - if its from a blog, why cant you syndicate the RSS and
consume it

--
--
Regards

John Timney (MVP)
VISIT MY WEBSITE:
http://www.johntimney.com
http://www.johntimney.com/blog
"apiringmvp" <bd******@hotmail.comwrote in message
news:11*********************@j44g2000cwa.googlegro ups.com...
All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
<p align=center>Some text from the blog.</p>

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "<p>" + content + "</p>";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "<p>" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
...</p>";
return output;
}
}

Nov 28 '06 #3
You are going to run into very serious problems using an XMLTextReader
to operate on HTML. HTML is almost always NOT valid XML.

You'd rather use regular expressions to manipulate the text.

On 28 Nov 2006 07:24:56 -0800, "apiringmvp" <bd******@hotmail.com>
wrote:
>All,

So I am creating a function that gets a short blurb of html from a
blog. I would like to retain all html formating and images. The code
below works well, with the exception of one issue.

My issue:
---------------------
When a blog's html has attributes with no quotes i get an exception.

Here's the example of the blog I am dealing with.
<p align=center>Some text from the blog.</p>

Questions:
----------------------
Is there a way to get the XmlTextReader to allow attributes without
quotes?

If not, do you like RegExs for this replace?

Then, Does anyone know any RegExs that could do this replace?
Code:
----------------------
public static string GetContentShortBlurb(string content, int len)
{
try
{
using (System.IO.MemoryStream ms = new
System.IO.MemoryStream())
{
if (!content.TrimStart(' ', '\r',
'\n').StartsWith("<"))
content = "<p>" + content + "</p>";

byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
+ content + "</doc>");
ms.Write(cb, 0, cb.Length);
ms.Position = 0;

// create Reader for parsing
XmlTextReader xr = new XmlTextReader(ms);

// Create Writer for output
System.Text.StringBuilder sb = new
System.Text.StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.ConformanceLevel = ConformanceLevel.Fragment;
xws.Encoding = new System.Text.UTF8Encoding(false);
XmlWriter xw = XmlTextWriter.Create(sb, xws);

xr.Read();

int strCount = 0;
int nodesToEnd = 0;
while (strCount < len)
{
xr.Read();

if (xr.NodeType == XmlNodeType.EndElement)
{
if (xr.Name == "doc") break;

xw.WriteEndElement();
nodesToEnd--;
}

if (xr.NodeType == XmlNodeType.Element)
{
xw.WriteStartElement(xr.Name);

nodesToEnd++;

// write attributes
while (xr.MoveToNextAttribute())
{
xw.WriteAttributeString(xr.Name, xr.Value);
}
}

if (xr.NodeType == XmlNodeType.Text)
{
string inner = xr.Value;
if (inner.Length + strCount len)
{
inner = inner.Substring(0,
inner.LastIndexOf(' ', len - strCount)) + " ...";
}
xw.WriteString(inner);
strCount += inner.Length;
}
}

for (int i = 0; i < nodesToEnd; i++)
xw.WriteEndElement();

xr.Close();
xw.Close();
return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
"");
}
}
catch (Exception ex)
{
// Just do the standard old string trim
string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
string output = Regex.Replace(content, stripHtmlEx, "");
if (output.Length len)
output = "<p>" + output.Substring(0,
output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
...</p>";
return output;
}
}
--

Bits.Bytes.
http://bytes.thinkersroom.com
Nov 28 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Christopher Ambler | last post by:
This is long, but it's driving me nuts. I need some adult supervision :-) (and I'm not above bribing for help) I have a stored procedure that I call that returns XML to me. The SP returns 3...
4
by: Drew | last post by:
I'm reading in an XML file from the server using XmlTextReader in C# like so: XmlTextReader xr = new XmlTextReader(url); while(xr.Read()) { //parse the xml file here
1
by: Emsi | last post by:
Hello, how can I read values of child nodes with the XmlTextReader? File format: <root> <items> <item> <field1>value1</field1> <field2>value2</field2>
0
by: Juan Galdeano | last post by:
Hi, I'm working on an ONIX project and when I try to validate or read XML files C# gives me this exception: System.IndexOutOfRangeException at System.Xml.XmlScanner.ScanDtdContent() at...
6
by: Ian | last post by:
Hi I'm pretty new at this so please don't laugh too hard. I'm trying to load an xml document using VB.NET and having a hard time. My code doesn't crash but it doesn't work either, the first...
0
by: c.w.browne | last post by:
Hi, Ive had a bit of a look around for other people with this problem and cant find anything that solves it in my case, so I'm afraid im going to have to bother you all with a post of my own. ...
4
by: sherifffruitfly | last post by:
Hi, I have an xml file with structured like this: <?xml version="1.0" encoding="UTF-8"?> <Soldiers> <Soldier name="Billy Smith" rank="Private" serial="34" /> (a bunch more soldiers)
16
by: billsahiker | last post by:
I am researching for an upcoming c# .net 2.0 project that may require reading and writing xml files. I don't want to use xmltextreader/ xmltextwriter as I prefer to have lower level file access...
5
by: Justin | last post by:
Here's my XML: <?xml version="1.0" ?> <AppMode Type="Network"> <CurrentFolder Path="c:\tabs"> <Tabs> <FilePath>tabs\Justin.tab</FilePath> <FilePath>tabs\Julie.tab</FilePath> *****There could...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.