HTMLEncode: low surrogate char Error

>>Thanks for the response....

>>
I don't even know how I got on that post, but I have been
contributin g for a a while. I responsed to it probably becuase it
came up as a search result for something else and the question was
still unanwered.

Right.

>>As per my post, it probably has nothing to do with the data. If
the user inserts binary data using the windows code page into a
database, non standard UTF chars will through this exception using
a stream writer.

But that's the point - using arbitrary binary data as if it were
real text data is *wrong*. The data is effectively "dodgy" - just as
if you'd tried to edit a jpeg as if it were a text file.

Point Taken but this is not the case. Thus, if a person writes a text
file on her or his computer and does not use UNICODE to save it, the
current code page is used. If this file is given to someone with some
other current codepage, the file is not displayed correctly. Simply
converting the file to Unicode will make the data display properly.
When performing the encoding process the encoding will escape
incorrect caharacters instead of attempting to interpret them. During
the Encode Decode process you may see conversion like Ãœ = Ü, â„¢ =
™, Ã = Á, Î© = Ω, etc. Eventually you willhave non-
UTF characters that are part of the default windows code page throw an
error. By specifying the the system.text.enc oding as part of the
streamwriter, you will avoid throwing the exception.

Additionally, that data could also be url Encoded, %20="Space". The
Percent sign indicates to use the Hexidecimal equivalent of the the
char(); chr(20). Injection hackers will use %00 for null injection
attacks or use %10%13 for char(10) & chr(13) etc.

Considering all of the above, there are plenty of cases where you will
have data that is clean but is represented by different characters in
different encodings. Different operation systems have different new
line definitions. While Windows uses CRLF (Carriage Return plus Line
Feed), UNIX uses only CR. Addiotionally you may see some encoders
convert <BRto line feeds and vice versa.

To reproduce this issue....

Copy this into a text file in a Visual Studio Project and save it as
"Read_Me.tx t."

==========Begin Read_Me.txt

1) Create New Web project and copy the entire contents of this folder
into the projects root folder. Select yes to all prompts.

2) Browse to the Cms Folder, Right click and choose Exlude from
Project. Right Click The solution and choose "Add existing Project".
Browse to the Cms Folder and Choose CMS.vbproj, then add a reference
to the CMS Project to you Web Project.

4) Add a reference to the freeTextBox.dll in the /framework1.1 folder.

4) Browse to /admin/install.aspx, right click and choose view in
broswer. Follow the set up instructions.
============end Read_Me.txt

Now right click the file and choose properties, then select build
action and choose embedded resource. Create a new class names
Resources.vb and add this code.

Imports System.IO
Imports System.Reflecti on
Imports System.Xml
Public Class Resources

Dim _textStreamRead er As StreamReader
Dim _assembly As [Assembly]
Sub New()
End Sub

Function GetResource(ByV al ResourceName As String)

_assembly = [Assembly].GetExecutingAs sembly()
If _assembly Is Nothing Then
Throw New Exception("asse mbly is nothing")
End If
Dim stream As IO.Stream =
_assembly.GetMa nifestResourceS tream("Assembly Name." & ResourceName)

If stream Is Nothing Then
Throw New Exception("stre am is nothing")
End If

_textStreamRead er = New StreamReader(st ream)
Return Me._textStreamR eader.ReadToEnd
End Function

Now Open a web page in the page load sub add the following code:

Dim resources As New Resources
Dim Code As String

Try
code = resources.GetRe source(Resource Name)
Catch ex As Exception
log("Resource : " & ResourceName & " is nothing", LogFile)
End Try

If Not code Is Nothing Then
Dim Sw As New IO.StreamWriter (FileName, False)
Sw.Write(Code)
Sw.Close()

End If

When you execute this code the surroage error is thrown. Why, because
the Text file was embedded using the windows code page. The fix

If Not code Is Nothing Then
Dim Sw As New IO.StreamWriter (FileName, False,
System.Text.Enc oding.GetEncodi ng(1252)
)
Sw.Write(Code)
Sw.Close()

End If

Clearly you'll see the data is written to the text file in it's
original format, with no funky characters and no data corruption.

Hope this helps give you a better understanding of the process.

Alex Higgins
http://alexanderhiggins.com

>Any time that you've read in text data with the wrong encoding, your
string has the wrong data in it, and therefore the data is dodgy.

Do you see what I mean?

Jon

--------------------------------------------------------------------------------
Subject: Re: HTMLEncode: low surrogate char Error?
Date: Fri, 27 Jul 2007 19:03:52 +0100

>alex higgins wrote:

Thanks for the response....

>>
I don't even know how I got on that post, but I have been
contributin g for a a while. I responsed to it probably becuase it
came up as a search result for something else and the question was
still unanwered.

Right.

>
>>As per my post, it probably has nothing to do with the data. If
the user inserts binary data using the windows code page into a
database, non standard UTF chars will through this exception using
a stream writer.

>But that's the point - using arbitrary binary data as if it were
real text data is *wrong*. The data is effectively "dodgy" - just as
if you'd tried to edit a jpeg as if it were a text file.

>Any time that you've read in text data with the wrong encoding, your
string has the wrong data in it, and therefore the data is dodgy.

Do you see what I mean?

>

Jon

Hello,

I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.Htm lEncode(strIn) function
to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.Argument Exception: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."
Any ideas? Is there anyway to to an HtmlEncode with UTF-8 bit?
Here is the affected code...
bResult = CommonUtil.Enco deForHTML (strKeywords, ref strConvert);
if (bResult) strKeywords = strConvert;
if (strKeywords.Le ngth >1)
{
strDetail += "<TR><TH<DI V class=HF Keywords </DIV></TH>\r\n";
strDetail += "<TD colspan = 7<DIV class= DF>" + strKeywords +
"</DIV></TD</TR>\r\n";
}
fReport.WriteLi ne(strDetail); <<< WHERE ERROR OCCURS

public static bool EncodeForHTML(s tring strIn, ref string strOut)
{
try
{
if (strIn.Length < 1) return false;
strOut = HttpUtility.Htm lEncode(strIn);
return true;

}
catch
{
return false;
}
Thank you,
Marta
Marta Pia

I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.Htm lEncode(strIn) function to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.Argument Exception: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."

If you're getting an exception like that, it suggests you've got some
very dodgy data to start with. Have you examined it to look at the
character being complained about?
Marta Pia <clio...@hotmai l.comwrote:

If you're getting an exception like that, it suggests you've got some
very dodgy data to start with. Have you examined it to look at the
character being complained about?

Oh yes, the characters are dodgy. I am trying to decode which one
actually tripped up the writeline/encode. I might need to strip all
non-printing characters out of the string before writing it to the
file (although, previous to this one, the presence of non-printing
characters didn't cause an exception). Is there an .net function to
strip out non printing characters or should I write a function to go
through the string character by character?

Well, you could do that. I would think the first port of call should
be
working out how you got dodgy data to start with though.

That aside, why does the character save into a string and encode
without error, but when I try to write it, it fails... ?

Chars are just 16-bit numbers, and a lot of routines will just treat
them as such, whether they're surrogates or not. I suspect that it's
when the string is written out, it is the process of encoding it to a
byte array for transmission over the wire that notices the problem.

Jul 27 '07 #1

Subscribe Reply

2755

Jon Skeet [C# MVP]

Alexander Higgins <al************ @hotmail.comwro te:

But that's the point - using arbitrary binary data as if it were
real text data is *wrong*. The data is effectively "dodgy" - just as
if you'd tried to edit a jpeg as if it were a text file.

Point Taken but this is not the case. Thus, if a person writes a text
file on her or his computer and does not use UNICODE to save it, the
current code page is used. If this file is given to someone with some
other current codepage, the file is not displayed correctly. Simply
converting the file to Unicode will make the data display properly.

Yes - that means the *original* data is correct. That's fine - but the
data in the form loaded with the incorrect code page is invalid.

I can have a perfectly valid image file on disk, but if I load it and
throw away the high bit of every byte, the loaded version will be
"dodgy" will it not?

I believe that any string which contains only half of a surrogate pair
either comes from bad data to start with, or has been loaded
inappropriately , resulting in bad data in memory.

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jul 27 '07 #2

Similar topics

8107

INNER JOIN using surrogate ID, or [Date] BETWEEN?

by: DCM Fan | last post by:

{CREATE TABLEs and INSERTs follow...} Gents, I have a main table that is in ONE-MANY with many other tables. For example, if the main table is named A, there are these realtionships: A-->B A-->C A-->D

Microsoft SQL Server

5279

Creating a Unicode Surrogate Pair

by: Chris Mullins | last post by:

I've got a big unicode character, and i'm trying to build it into a string. The unicode character is in the range "0x10400", so it's going to require a surrogate pair. I've been through all the logic to iterate over strings that already have these pairs in them, but how do I encode this Unicode Character INTO the string? The string is UTF-8 encoded, but none of the things I've trided using the encoders seems to work right...

.NET Framework

3276

UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET

by: Chris Mullins | last post by:

I've spent a bit of time over the last year trying to implement RFC 3454 (Preparation of Internationalized Strings, aka 'StringPrep'). This RFC is also a dependency for RFC 3491 (Internationalized Domain Names / IDNA) which is something that I also need to support. The problem that I've been struggling with in .NET is that of Unicode Code Points > 0xFFFF. These points are encoded into UTF8 using the Surrogate Pair encoding scheme that...

.NET Framework

2322

COM Surrogate Errors

by: neilt100 | last post by:

Am running Win XP SP2 with Classic ASP+VB6 dll's alongside ASP.NET+C# dll's. I keep getting the following error â€¦ â€˜COM Surrogate encountered a problem and needs to shut downâ€™. This requires me to do an IISReset and re-start whatever I was doing, which is extremely frustrating. Google reveals that this is a pretty common type of error but does not point me to any kind of solution to it. MS's knowledge base appears pretty silent on...

ASP.NET

2625

surrogate characters and chars

by: guy | last post by:

if a string contains surrogate chars (i.e. Unicode characters that consiste of more than 1 char) do functions that use an indexer or a string length into the string e.g. Mid, Len work correctly? guy

Visual Basic .NET

2009

xsl and unicode surrogate characters

by: Sakcee | last post by:

Hi In one of the data files that I have , I am seeing these characters \xed\xa0\xa0 . They seem to break the xsl. --------------------------------------------------------------- Extra content at the end of the document XML/XSL Error: </data><data ><![CDATA[ í Pls advice ----------------------------------------------------------------

Python

1954

HtmlEncode? Other Alternative?

by: Groove | last post by:

Hey guys. I'm working a large project that has dozens of forms to collect user input. A lot of the fields are text and capture long text from the user. It writes to MS SQL 2000. I've built a simple "replace" function to replace and encode harmful characters on the server side such as single quotes, commas and so on. Problem is that when a user submits a < or a > char, the server barks and sees it as harmful. For example: A...

ASP.NET

1504

ValidateRequest=False HtmlEncode and The Best Method

by: \A_Michigan_User\ | last post by:

I guess I'm not understanding this correctly. I have to set "ValidateRequest=False" for my asp.net 1.1 page that has a TextBox... so that I can avoid an error... if some user enters some html or script coding into it. (Should I HtmlEncode it with Server.htmlEncode or HttpServerUtility.HtmlEncode ???) From what I've read... I guess I'm supposed to do it this way: 1. HtmlEncode the user input that's in the TextBox. 2. Save it to the...

ASP.NET

1308

How to catch (or prevent) "Found a low surrogate char..." error

by: =?Utf-8?B?RGF2aWQ=?= | last post by:

I need to convert an incoming byte array to a Unicode string, when (if) that byte array contains valid Unicode values (sometimes it doesn't). I've been trying to use UnicodeEncoding.GetString(bytes)... However, when the byte array doesn't contain value Unicode values, I get a "Found a low surrogate char without a preceding high surrogate at index..." exception. I understand why the error occurs, but I don't understand how to catch...

.NET Framework

9602

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10237

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10071

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9882

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8905

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7431

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5326

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3987

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3589

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP