473,789 Members | 1,895 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Valid Characters

I'm trying to ensure that all the characters in my XML document are
characters specified in this document:
http://www.w3.org/TR/2000/REC-xml-20001006#charsets

Would a function like this work:

private static string formatXMLString (string n)
{
if (string.IsNullO rEmpty(n)) return n;
System.Text.Str ingBuilder sb = new System.Text.Str ingBuilder();
char[] chrs = n.ToCharArray() ;
char c;
int x, j = chrs.Length;
for (x = 0; x < j; x++)
{
c = chrs[x];
if (c == 0x9 || c == 0xA || c == 0xD ||
(c 0x20 && c < 0xd7ff) ||
(c 0xe000 && c < 0xffd) ||
(c 0x10000 && c < 0x10ffff))
{
sb.Append(c);
}
}
return sb.ToString();
}

I've never compared characters to like this (0x9, 0xffd, etc...)?
I'm not trying to be lazy and not test it myself, I just don't know if this
type of character comparison is the correct logic for the results I'm
looking for.

Any input?
Nov 30 '06 #1
13 3233


"preport" <pr*****@newsgr oups.nospamwrot e in message
news:#C******** ******@TK2MSFTN GP02.phx.gbl...
I'm trying to ensure that all the characters in my XML document are
characters specified in this document:
http://www.w3.org/TR/2000/REC-xml-20001006#charsets

Would a function like this work:

private static string formatXMLString (string n)
{
if (string.IsNullO rEmpty(n)) return n;
System.Text.Str ingBuilder sb = new System.Text.Str ingBuilder();
char[] chrs = n.ToCharArray() ;
char c;
int x, j = chrs.Length;
for (x = 0; x < j; x++)
{
c = chrs[x];
if (c == 0x9 || c == 0xA || c == 0xD ||
(c 0x20 && c < 0xd7ff) ||
(c 0xe000 && c < 0xffd) ||
(c 0x10000 && c < 0x10ffff))
{
sb.Append(c);
}
}
return sb.ToString();
}

I've never compared characters to like this (0x9, 0xffd, etc...)?
I'm not trying to be lazy and not test it myself, I just don't know if
this type of character comparison is the correct logic for the results I'm
looking for.

Any input?
Sure. Don't be lazy.

And a char is a 2-byte type, so your literals should all be 2-byte literals,
and should be cast to char for comparison.

eg
char space = (char)0x0020;

David

Dec 1 '06 #2
Hello,

The data type "char" in C# is for 16-bit Unicode character, and its range
is from U+0000 to U+ffff. Therefore, the following line may be not
necessary in your code:

(c 0x10000 && c < 0x10ffff))

It has been beyond the range of C# char, and we won't get such a value in
C# application.

When load strings or file into XMLDocument element, the charactors will be
valid and exceptions will be thrown if there is any invalid charactors.
Your function will check the string before this. I think this is a good way
since you can control the validation. Anyway, is it possible that some data
will be lost if you just remove the invalid charactors? How about throw an
exception?

Sincerely,

Luke Zhang

Microsoft Online Community Support
=============== =============== =============== =====
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscripti...ult.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscripti...t/default.aspx.
=============== =============== =============== =====

This posting is provided "AS IS" with no warranties, and confers no rights.

Dec 1 '06 #3
The data type "char" in C# is for 16-bit Unicode character, and its range
is from U+0000 to U+ffff. Therefore, the following line may be not
necessary in your code:

(c 0x10000 && c < 0x10ffff))

It has been beyond the range of C# char, and we won't get such a value in
C# application.
C# uses UTF-16, so it can cover all Unicode range (up to U+10FFFF) using
surrogates.

See http://www.unicode.org/faq/utf_bom.html#UTF16
and http://mailman.ic.ac.uk/pipermail/xm...er/014933.html

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Dec 1 '06 #4


"Mihai N." <nm************ **@yahoo.comwro te in message
news:Xn******** ************@20 7.46.248.16...
>The data type "char" in C# is for 16-bit Unicode character, and its range
is from U+0000 to U+ffff. Therefore, the following line may be not
necessary in your code:

(c 0x10000 && c < 0x10ffff))

It has been beyond the range of C# char, and we won't get such a value in
C# application.

C# uses UTF-16, so it can cover all Unicode range (up to U+10FFFF) using
surrogates.

See http://www.unicode.org/faq/utf_bom.html#UTF16
and http://mailman.ic.ac.uk/pipermail/xm...er/014933.html
Yes, but a surrogate will occupy two char's. A char is not a Unicode
character; it's an unsigned 16bit integer, and the range of a char is not
U+0000 to U+ffff, it's 0x0000 to 0xffff.

David

Dec 1 '06 #5
My problem is that I have a simple object that I return from a webservice.
Sometimes this object gets populated with bad data (from a database), but it
gets serialized just fine (no exception) and sent down to the clients.

The clients error out because of the bad data though. So I thought as I
populate these objects I can strip out all the "illegal characters". I'm
not worried about losing data, because it shouldn't be there in the first
place.

So, except for the last condition (0x10000..) do you think this is a valid
option? Again, I don't know enough about these character ranges and I don't
want to accidently strip out "random" valid characters.

Thanks for you input so far..
Dec 1 '06 #6
I've tested this function in our system and it did exactly what I was
affraid of. It solved the "clients blowing up" problem, but it is stripping
out all the spaces.

Why is this? I know this sounds stupid...becaus e it is....but what the hell
am I stripping out? What are these character ranges? I'm familiar with the
ASCII tables and am use to checking against normal integers..but what the
heck are these character ranges? Can someone point me to a reference where
I can educate myself a little.
Dec 1 '06 #7
preport <pr*****@newsgr oups.nospamwrot e:
I've tested this function in our system and it did exactly what I was
affraid of. It solved the "clients blowing up" problem, but it is stripping
out all the spaces.

Why is this? I know this sounds stupid...becaus e it is....but what the hell
am I stripping out? What are these character ranges? I'm familiar with the
ASCII tables and am use to checking against normal integers..but what the
heck are these character ranges? Can someone point me to a reference where
I can educate myself a little.
You're stripping out spaces because of this condition:

(c 0x20 && c < 0xd7ff)

That (and all the other ranges) should use inclusive comparisons, not
exclusive:

(c >= 0x20 && c <= 0xd7ff)

Note that the condition (c 0xe000 && c < 0xffd) should use 0xfffd as
the top part, not 0xffd.

As for what you're stripping out:
1) some "control" characters (U+0000 to U+0020 except tab, carriage
return and line feed)
2) the byte order marker (U+FFFE)
3) U+FFFF (I can't remember off-hand if that has any special meaning)
4) The surrogate block (used for representing characters
U+10000-U+10FFFF)

However, if you want to be able to represent characters >= U+10000,
you'll need to only strip "rogue" characters from the surrogate block,
leaving "valid" surrogate pairs alone. This unfortunately makes the
code significantly messier - if you don't care about representing those
characters, you could leave the code stripping anything from the
surrogate block: do document this omission though!

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 1 '06 #8
<"David Browne" <davidbaxterbro wne no potted me**@hotmail.co m>wrote:

<snip>
Sure. Don't be lazy.

And a char is a 2-byte type, so your literals should all be 2-byte literals,
and should be cast to char for comparison.

eg
char space = (char)0x0020;
While I agree that the "high order" comparisons are invalid, where do
you see the benefit in converting to char for comparison?

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 1 '06 #9
OK, let me redifine my question...
I get the 0x9, 0xa, 0xd, and (0x20)...but what is:

oxd7ff and
(0xe000 >= c >= 0xffd)

I don't understand what those characters are?

"preport" <pr*****@newsgr oups.nospamwrot e in message
news:eW******** ******@TK2MSFTN GP02.phx.gbl...
I've tested this function in our system and it did exactly what I was
affraid of. It solved the "clients blowing up" problem, but it is
stripping out all the spaces.

Why is this? I know this sounds stupid...becaus e it is....but what the
hell am I stripping out? What are these character ranges? I'm familiar
with the ASCII tables and am use to checking against normal integers..but
what the heck are these character ranges? Can someone point me to a
reference where I can educate myself a little.

Dec 1 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
17352
by: Martin Lucas-Smith | last post by:
Can anyone point me to a regular expression in PHP which could be used to check that a proposed (My)SQL database/table/column name is valid, i.e. shouldn't result in an SQL error when created? The user of my (hopefully to be opensourced) program has the ability to create database/table/column names on the fly. I'm aware of obvious characters such as ., , things like >, etc., which won't work, but haven't been able to source a...
8
2767
by: John V | last post by:
What kind of regular expression pattern is needed to check if URL is valid? It's enought if most of cases are covered. I have PHP 4.x. Br
4
16983
by: Todd Perkins | last post by:
Hello all, surprisingly enough, this is my first newsgroup post, I usually rely on google. So I hope I have enough info contained. Thank you in advance for any help! Problem: I am getting this error when I try to pull up my edit page to display the current database information in the form, and then edit it on click:
4
1605
by: Lee Chapman | last post by:
Hi, I am having difficulty getting the ASP.NET framework to generate valid XHTML. My immediate problem surrounds user input in, for example, textbox controls. I consider characters such as less-than and ampersand perfectly valid in user input. So I've disabled request validation by adding the following to my web.config file.
14
4295
by: Jack Russell | last post by:
Is there a simple function to test if a string is a valid file name (i.e does not contain illegal characters etc) other than doing it the long way? Thanks Jack Russell
10
4272
by: SpreadTooThin | last post by:
Hi I'm writing a python script that creates directories from user input. Sometimes the user inputs characters that aren't valid characters for a file or directory name. Here are the characters that I consider to be valid characters... valid = ':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ' if I have a string called fname I want to go through each character in
1
1706
by: Chris Curvey | last post by:
Hey all, I'm trying to write something that will "fail fast" if one of my users gives me non-latin-1 characters. So I tried this: u'\x80' I would have thought that that should have raised an error, because \x80 is not a valid character in latin-1 (according to what I can find). Is this the expected behavior, or am I missing something?
6
2721
by: adurth | last post by:
Hi! Is there any function that converts a string containing characters that are invalid for use in an element name to a valid one? Thanks, Andreas
10
5936
by: Academia | last post by:
I'd like to check a string to see that it is a valid file name. Is there a Like pattern or RegEx that can do that. 1) Just the file name with maybe an extension 2)A full path An help for either of the above would be appreciated.
0
9499
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10374
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10177
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9969
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8998
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7519
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6750
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5540
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4078
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.