By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,086 Members | 1,875 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,086 IT Pros & Developers. It's quick & easy.

WebClient and Encoding

P: n/a
Is it possible to tell to the WebClient to use an "automatic" encoding when
doing DownloadString? The encoding of the connection is written in the
header, so the WebClient should be able to sense it, but I wasn't able to
find the option. I can only use a fixed Encoding (UTF8 for example) and hope
the site use it.

--- bye
May 27 '07 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Hello MaxMax,

See HttpResponse.Charset and HttpResponse.ContentEncoding

---
WBR, Michael Nemtsev [.NET/C# MVP].
My blog: http://spaces.live.com/laflour
Team blog: http://devkids.blogspot.com/

"The greatest danger for most of us is not that our aim is too high and we
miss it, but that it is too low and we reach it" (c) Michelangelo

MIs it possible to tell to the WebClient to use an "automatic"
Mencoding when doing DownloadString? The encoding of the connection is
Mwritten in the header, so the WebClient should be able to sense it,
Mbut I wasn't able to find the option. I can only use a fixed Encoding
M(UTF8 for example) and hope the site use it.
M>
M--- bye
M>
May 27 '07 #2

P: n/a
MIs it possible to tell to the WebClient to use an "automatic"
Mencoding when doing DownloadString? The encoding of the connection is
Mwritten in the header, so the WebClient should be able to sense it,
Mbut I wasn't able to find the option. I can only use a fixed Encoding
M(UTF8 for example) and hope the site use it.
See HttpResponse.Charset and HttpResponse.ContentEncoding
In the "classical" example of DownloadString from the MSDN:

{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);

Console.WriteLine (reply);
}

I can't use the HttpResponse before I make the query.... And if I use it
later then it's useless: DownloadString has already decodified (using a
possibly wrong codepage) the stream to a CodePage.

--- bye
May 27 '07 #3

P: n/a
On Sun, 27 May 2007 13:45:54 +0200, MaxMax <no**@none.comwrote:
>MIs it possible to tell to the WebClient to use an "automatic"
Mencoding when doing DownloadString? The encoding of the connectionis
Mwritten in the header, so the WebClient should be able to sense it,
Mbut I wasn't able to find the option. I can only use a fixed Encoding
M(UTF8 for example) and hope the site use it.
>See HttpResponse.Charset and HttpResponse.ContentEncoding

In the "classical" example of DownloadString from the MSDN:

{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);

Console.WriteLine (reply);
}

I can't use the HttpResponse before I make the query.... And if I use it
later then it's useless: DownloadString has already decodified (using a
possibly wrong codepage) the stream to a CodePage.

--- bye
WebClient.DownloadString uses the encoding specified in the WebClient object when it converts the downloaded data to string. If you know the encoding in advance you can use WebClient.Encoding to set it to the properencoding, otherwise it will use Encoding.Default, which is the codepageused by your operating system.

If you don't know the Encoding in advance you probably should take a closer look at the HttpRequest/HttpResponse classes. The trick is to download it as a byte[], then using the information provides by the headers toconvert it to the proper string format.

--
Happy coding!
Morten Wennevik [C# MVP]
May 27 '07 #4

P: n/a
WebClient internally uses a WebRequest to do the downloading; and it will
use WebRequest.ContentType to search for "charset" header as the encoding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by default
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a valid
encoding, it's always used to decode the returned data.

For a HttpWebRequest, the ContentType is from the HttpWebResponse. You can
use Fiddler (http://www.fiddlertool.com/) to trace the http headers and
see if WebClient used the correct Encoding to return the string.
Regards,
Walter Wang (wa****@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
May 27 '07 #5

P: n/a

"Walter Wang [MSFT]" <wa****@online.microsoft.comha scritto nel messaggio
news:Oj**************@TK2MSFTNGHUB02.phx.gbl...
WebClient internally uses a WebRequest to do the downloading; and it will
use WebRequest.ContentType to search for "charset" header as the encoding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by default
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a valid
encoding, it's always used to decode the returned data.
I'm pretty sure it isn't so. If I set Encoding to (for example) UTF32 the
WebClient throws an exception. And if I have a page with an UTF8 character
(a page that in the WebRequest IS correctly shown as UTF8 page) and I don't
set the Encoder I receive a wrong String.

--- bye
May 28 '07 #6

P: n/a
On Mon, 28 May 2007 08:19:21 +0200, MaxMax <no**@none.comwrote:
>
"Walter Wang [MSFT]" <wa****@online.microsoft.comha scritto nel messaggio
news:Oj**************@TK2MSFTNGHUB02.phx.gbl...
>WebClient internally uses a WebRequest to do the downloading; and it will
use WebRequest.ContentType to search for "charset" header as the encoding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by default
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a valid
encoding, it's always used to decode the returned data.
I'm pretty sure it isn't so. If I set Encoding to (for example) UTF32 the
WebClient throws an exception. And if I have a page with an UTF8 character
(a page that in the WebRequest IS correctly shown as UTF8 page) and I don't
set the Encoder I receive a wrong String.

--- bye
Try this code. It attemps to get the CharacterSet in various ways and falls back to UTF-8. Checking for ContentEncoding may not be necessary as I have yet to see it specified. The code is a bit of cut and paste and you may have to tweak it to get it running.

public string DownloadPage(url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{

using (Stream s = resp.GetResponseStream())
{
buffer = ReadStream(s);
}

string pageEncoding = "";
Encoding e = Encoding.UTF8;
if (resp.ContentEncoding != "")
pageEncoding = resp.ContentEncoding;
else if (resp.CharacterSet != "")
pageEncoding = resp.CharacterSet;
else if (resp.ContentType != "")
pageEncoding = GetCharacterSet(resp.ContentType);

if(pageEncoding == "")
pageEncoding = GetCharacterSet(buffer);

if (pageEncoding != "")
{
try
{
e = Encoding.GetEncoding(pageEncoding);
}
catch
{
MessageBox.Show("Invalid encoding: " + pageEncoding);
}
}

string data = e.GetString(buffer);

Status = "";

return data;
}
}

private string GetCharacterSet(string s)
{
s = s.ToUpper();
int start = s.LastIndexOf("CHARSET");
if (start == -1)
return "";

start = s.IndexOf("=", start);
if (start == -1)
return "";

start++;
s = s.Substring(start).Trim();
int end = s.Length;

int i = s.IndexOf(";");
if (i != -1)
end = i;
i = s.IndexOf("\"");
if (i != -1 && i < end)
end = i;
i = s.IndexOf("'");
if (i != -1 && i < end)
end = i;
i = s.IndexOf("/");
if (i != -1 && i < end)
end = i;

return s.Substring(0, end).Trim();
}

private string GetCharacterSet(byte[] data)
{
string s = Encoding.Default.GetString(data);
return GetCharacterSet(s);
}

private byte[] ReadStream(Stream s)
{
try
{
byte[] buffer = new byte[8096];
using (MemoryStream ms = new MemoryStream())
{
while (true)
{
int read = s.Read(buffer, 0, buffer.Length);
if (read <= 0)
{
CurLength = 0;
return ms.ToArray();
}
ms.Write(buffer, 0, read);
CurLength = ms.Length;
}
}
}
catch (Exception ex)
{
return null;
}
}

--
Happy coding!
Morten Wennevik [C# MVP]
May 28 '07 #7

P: n/a
Hi MaxMax,

I've done some test and it seems my previous comment isn't correct. Sorry
about that.

Please use Morten's posted code to detect the encoding and read the text
correctly.

I will consult this question within our internal discussion list to see if
this is a known issue.

Regards,
Walter Wang (wa****@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.

May 29 '07 #8

P: n/a
We have confirmed this is an issue in WebClient. I've filed an internal bug
for it.

Thanks for the feedback!

Regards,
Walter Wang (wa****@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.

May 30 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.