Point Taken but this is not the case. Thus, if a person writes a text>>
I don't even know how I got on that post, but I have been
contributin g for a a while. I responsed to it probably becuase it
came up as a search result for something else and the question was
still unanwered.
Right.
>>As per my post, it probably has nothing to do with the data. If
the user inserts binary data using the windows code page into a
database, non standard UTF chars will through this exception using
a stream writer.
But that's the point - using arbitrary binary data as if it were
real text data is *wrong*. The data is effectively "dodgy" - just as
if you'd tried to edit a jpeg as if it were a text file.
file on her or his computer and does not use UNICODE to save it, the
current code page is used. If this file is given to someone with some
other current codepage, the file is not displayed correctly. Simply
converting the file to Unicode will make the data display properly.
When performing the encoding process the encoding will escape
incorrect caharacters instead of attempting to interpret them. During
the Encode Decode process you may see conversion like Ü = Ü, ™ =
™, Á = Á, Ω = Ω, etc. Eventually you willhave non-
UTF characters that are part of the default windows code page throw an
error. By specifying the the system.text.enc oding as part of the
streamwriter, you will avoid throwing the exception.
Additionally, that data could also be url Encoded, %20="Space". The
Percent sign indicates to use the Hexidecimal equivalent of the the
char(); chr(20). Injection hackers will use %00 for null injection
attacks or use %10%13 for char(10) & chr(13) etc.
Considering all of the above, there are plenty of cases where you will
have data that is clean but is represented by different characters in
different encodings. Different operation systems have different new
line definitions. While Windows uses CRLF (Carriage Return plus Line
Feed), UNIX uses only CR. Addiotionally you may see some encoders
convert <BRto line feeds and vice versa.
To reproduce this issue....
Copy this into a text file in a Visual Studio Project and save it as
"Read_Me.tx t."
==========Begin Read_Me.txt
1) Create New Web project and copy the entire contents of this folder
into the projects root folder. Select yes to all prompts.
2) Browse to the Cms Folder, Right click and choose Exlude from
Project. Right Click The solution and choose "Add existing Project".
Browse to the Cms Folder and Choose CMS.vbproj, then add a reference
to the CMS Project to you Web Project.
4) Add a reference to the freeTextBox.dll in the /framework1.1 folder.
4) Browse to /admin/install.aspx, right click and choose view in
broswer. Follow the set up instructions.
============end Read_Me.txt
Now right click the file and choose properties, then select build
action and choose embedded resource. Create a new class names
Resources.vb and add this code.
Imports System.IO
Imports System.Reflecti on
Imports System.Xml
Public Class Resources
Dim _textStreamRead er As StreamReader
Dim _assembly As [Assembly]
Sub New()
End Sub
Function GetResource(ByV al ResourceName As String)
_assembly = [Assembly].GetExecutingAs sembly()
If _assembly Is Nothing Then
Throw New Exception("asse mbly is nothing")
End If
Dim stream As IO.Stream =
_assembly.GetMa nifestResourceS tream("Assembly Name." & ResourceName)
If stream Is Nothing Then
Throw New Exception("stre am is nothing")
End If
_textStreamRead er = New StreamReader(st ream)
Return Me._textStreamR eader.ReadToEnd
End Function
Now Open a web page in the page load sub add the following code:
Dim resources As New Resources
Dim Code As String
Try
code = resources.GetRe source(Resource Name)
Catch ex As Exception
log("Resource : " & ResourceName & " is nothing", LogFile)
End Try
If Not code Is Nothing Then
Dim Sw As New IO.StreamWriter (FileName, False)
Sw.Write(Code)
Sw.Close()
End If
When you execute this code the surroage error is thrown. Why, because
the Text file was embedded using the windows code page. The fix
If Not code Is Nothing Then
Dim Sw As New IO.StreamWriter (FileName, False,
System.Text.Enc oding.GetEncodi ng(1252)
)
Sw.Write(Code)
Sw.Close()
End If
Clearly you'll see the data is written to the text file in it's
original format, with no funky characters and no data corruption.
Hope this helps give you a better understanding of the process.
Alex Higgins
http://alexanderhiggins.com
>Any time that you've read in text data with the wrong encoding, your
string has the wrong data in it, and therefore the data is dodgy.
Do you see what I mean?
Jon
--------------------------------------------------------------------------------
Subject: Re: HTMLEncode: low surrogate char Error?
Date: Fri, 27 Jul 2007 19:03:52 +0100
>alex higgins wrote:Thanks for the response....
Right.>>
I don't even know how I got on that post, but I have been
contributin g for a a while. I responsed to it probably becuase it
came up as a search result for something else and the question was
still unanwered.
>>>As per my post, it probably has nothing to do with the data. If
the user inserts binary data using the windows code page into a
database, non standard UTF chars will through this exception using
a stream writer.
>But that's the point - using arbitrary binary data as if it were
real text data is *wrong*. The data is effectively "dodgy" - just as
if you'd tried to edit a jpeg as if it were a text file.
>Any time that you've read in text data with the wrong encoding, yourDo you see what I mean?
string has the wrong data in it, and therefore the data is dodgy.
>Jon
Hello,
I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.Htm lEncode(strIn) function
to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.Argument Exception: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."
Any ideas? Is there anyway to to an HtmlEncode with UTF-8 bit?
Here is the affected code...
bResult = CommonUtil.Enco deForHTML (strKeywords, ref strConvert);
if (bResult) strKeywords = strConvert;
if (strKeywords.Le ngth >1)
{
strDetail += "<TR><TH<DI V class=HF Keywords </DIV></TH>\r\n";
strDetail += "<TD colspan = 7<DIV class= DF>" + strKeywords +
"</DIV></TD</TR>\r\n";
}
fReport.WriteLi ne(strDetail); <<< WHERE ERROR OCCURS
public static bool EncodeForHTML(s tring strIn, ref string strOut)
{
try
{
if (strIn.Length < 1) return false;
strOut = HttpUtility.Htm lEncode(strIn);
return true;
}
catch
{
return false;
}
Thank you,
Marta
Marta Pia
I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.Htm lEncode(strIn) function to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.Argument Exception: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."
If you're getting an exception like that, it suggests you've got some
very dodgy data to start with. Have you examined it to look at the
character being complained about?
Marta Pia <clio...@hotmai l.comwrote:
If you're getting an exception like that, it suggests you've got some
very dodgy data to start with. Have you examined it to look at the
character being complained about?
Oh yes, the characters are dodgy. I am trying to decode which one
actually tripped up the writeline/encode. I might need to strip all
non-printing characters out of the string before writing it to the
file (although, previous to this one, the presence of non-printing
characters didn't cause an exception). Is there an .net function to
strip out non printing characters or should I write a function to go
through the string character by character?
Well, you could do that. I would think the first port of call should
be
working out how you got dodgy data to start with though.
That aside, why does the character save into a string and encode
without error, but when I try to write it, it fails... ?
Chars are just 16-bit numbers, and a lot of routines will just treat
them as such, whether they're surrogates or not. I suspect that it's
when the string is written out, it is the process of encoding it to a
byte array for transmission over the wire that notices the problem.