Is this HttpWebRequest correct?

Nightcrawler

I am currently using the HttpWebRequest and HttpWebResponse to pull
webpages down from a few urls.

string url = "some url";
HttpWebRequest httpWebRequest =
(HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse httpWebResponse =
(HttpWebResponse)httpWebRequest.GetResponse())
{
string html = string.Empty;

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF7);
html = responseReader.ReadToEnd();
}

My code works but my question is, am I doing it the right way
(especially the encoding part)? Some of the websites I pull content
from have charachters in them that do not exist in the english
alphabet and currently the only way for these to be read correctly by
my streamreader is if I am using UTF7 encoding. Is this really the
only way?

Before I move forward in the project I would like to understand if
this indeed is the way to do this or if I am missing anything?

Any help is appreciated.

Thanks

Oct 3 '08 #1

Subscribe Post Reply

3066

Martin Honnen

Nightcrawler wrote:

I am currently using the HttpWebRequest and HttpWebResponse to pull
webpages down from a few urls.

string url = "some url";
HttpWebRequest httpWebRequest =
(HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse httpWebResponse =
(HttpWebResponse)httpWebRequest.GetResponse())
{
string html = string.Empty;

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF7);
html = responseReader.ReadToEnd();
}

My code works but my question is, am I doing it the right way
(especially the encoding part)? Some of the websites I pull content
from have charachters in them that do not exist in the english
alphabet and currently the only way for these to be read correctly by
my streamreader is if I am using UTF7 encoding. Is this really the
only way?

You should check the HTTP response header Content-Type for a charset
parameter and use that to create the stream reader. So for instance if
the server sends a header
Content-Type: text/html; charset=Windows-1252
then you would use
new StreamReader(httpWebResponse.GetResponseStream(),
Encoding.GetEncoding("Windows-1252"))

On the other hand on the wild wild web the server often does not send a
charset parameter and the author of the HTML document only includes the
charset in a meta element e.g.
<meta http-equiv="Content-Type" content="text/html;
charset=Windows-1252">
Therefore user agents like browsers put in a lot of effort to try to
read enough of the document to find and parse that meta element to then
be able to decode the rest of the document.
--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Oct 3 '08 #2

Nightcrawler

So what you basically are saying is that my best bet is to look for
the meta tags in the page to determine the encoding to use and don't
rely on the HTTP response header.

Most of the sites I read using the streamreader say: <meta http-
equiv="Content-type" content="text/html; charset=UTF-8" /but there
are a few that do not have that meta tag included in their code. How
should I approach those? Is there a way for the streamreader to detect
what encoding the page is using?

Thanks for you help!

Oct 3 '08 #3

Nightcrawler

What is even more annoying is that one of the websites I read is
stating it's using UTF-8 and my streamreader still does not translate
the charachters correctly. I get little square boxes instead of the
charachters.

Oct 3 '08 #4

Peter Duniho

On Fri, 03 Oct 2008 10:28:21 -0700, Nightcrawler
<th************@gmail.comwrote:

What is even more annoying is that one of the websites I read is
stating it's using UTF-8 and my streamreader still does not translate
the charachters correctly. I get little square boxes instead of the
charachters.

"little square boxes" might, but does not necessarily, mean that the
characters are being decoded incorrectly. It may simply be that the
characters are not displaying with whatever font you're using to show them.

How are you determining that the StreamReader doesn't correctly decode the
characters? How are you specifying, if at all, that the encoding used by
the StreamReader is UTF-8?

Pete

Oct 3 '08 #5

Nightcrawler

If I view the very same page in my browser it shows up correctly.

The meta tag states it's using UTF-8 but when I use:

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF8);

The charachters are still unreadable. However, if I use UTF7 instead
the charachters show up correctly BUT, when I try to convert the page
to XML I get an error saying "hexadecimal value 0xD85E, is an invalid
character". I am very confused with all this. Seems a little like the
wild wild west.

Any further help is highly appreciated.

Thanks

Oct 3 '08 #6

Nightcrawler

I guess another interesting point is that when I change the code to
use: "ISO-8859-1" instead of UTF-8 like the website claims it uses, it
seems that it actuallly is reading the charachters correctly AND the
string translates into XML without any issues. Why? I have no idea and
I wish I understood it better. Again, any insight to this problem is
appreciated.

Thanks

Oct 3 '08 #7

Peter Duniho

On Fri, 03 Oct 2008 10:43:19 -0700, Nightcrawler
<th************@gmail.comwrote:

If I view the very same page in my browser it shows up correctly.

Unless your own code is using the same fonts to display the text that the
browser uses, that's not a relevant test.

As for the other behaviors you've noticed, it does sound to me as though
it's possible that the page is not encoded in UTF-8, but rather
ISO-8859-1. But it's hard to know for sure, since we don't have the
actual data to look at.

Pete

Oct 3 '08 #8

Nightcrawler

Pete,

You can see the page if you go to the link below. It's iTunes
linkmaker page:

http://ax.phobos.apple.com.edgesuite...ss&media=music

As you can see they claim they use utf-8 but when you read it using a
streamreader with that encoding, it does not read "foreign"
charachters correctly. However, when I tried the ISO-8859-1 encoding
it seemed to work.

Thanks

Oct 3 '08 #9

Peter Duniho

On Fri, 03 Oct 2008 13:47:43 -0700, Nightcrawler
<th************@gmail.comwrote:

Pete,

You can see the page if you go to the link below. It's iTunes
linkmaker page:

http://ax.phobos.apple.com.edgesuite...ss&media=music

As you can see they claim they use utf-8 but when you read it using a
streamreader with that encoding, it does not read "foreign"
charachters correctly. However, when I tried the ISO-8859-1 encoding
it seemed to work.

What data in the page are you having trouble with? Can you be more
specific about what's not being shown correctly?

I haven't spend a huge amount of time with the file. But a cursory look
at it shows that it appears, at least to me, to have ISO-8859-1 data
embedded within the page itself, in certain URLs.

It seems possible to me that the page encoding is technically UTF-8, but
using only the subset of UTF-8 that is the same as ISO-8859-1, and that
the page also has data that's not supposed to be interpreted as text
within the HTML, but rather should be decoded as ISO-8859-1.

That would explain why the page claims to be encoded as UTF-8 but there
are still characters that don't display correctly unless you read the HTML
as ISO-8859-1.

Or maybe the meta tag really is wrong. I'm not completely sure. :)

Pete

Oct 3 '08 #10

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Nightcrawler wrote:

I am currently using the HttpWebRequest and HttpWebResponse to pull
webpages down from a few urls.

string url = "some url";
HttpWebRequest httpWebRequest =
(HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse httpWebResponse =
(HttpWebResponse)httpWebRequest.GetResponse())
{
string html = string.Empty;

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF7);
html = responseReader.ReadToEnd();
}

My code works but my question is, am I doing it the right way
(especially the encoding part)? Some of the websites I pull content
from have charachters in them that do not exist in the english
alphabet and currently the only way for these to be read correctly by
my streamreader is if I am using UTF7 encoding. Is this really the
only way?

I am a bit surprised by the UTF-7, that is a rare encoding - at least
where I surf.

But else Martin Honnen is correct - you need to look at HTTP header
and HTML META tag.

See the code attached below for a starting point.

Arne

================================================== =======

public class HttpDownloadCharset
{
private static Regex encpat = new
Regex("charset=([A-Za-z0-9-]+)", RegexOptions.IgnoreCase |
RegexOptions.Compiled);
private static string ParseContentType(string contenttype)
{
Match m = encpat.Match(contenttype);
if(m.Success)
{
return m.Groups[1].Value;
}
else
{
return "ISO-8859-1";
}
}
private static Regex metaencpat = new
Regex("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static string ParseMetaContentType(String html, String
defenc)
{
Match m = metaencpat.Match(html);
if(m.Success)
{
return ParseContentType(m.Groups[1].Value);
} else {
return defenc;
}
}
private const int DEFAULT_BUFSIZ = 1000000;
public static string Download(string urlstr)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlstr);
using(HttpWebResponse resp =
(HttpWebResponse)req.GetResponse())
{
if (resp.StatusCode == HttpStatusCode.OK)
{
string enc = ParseContentType(resp.ContentType);
int bufsiz = (int)resp.ContentLength;
if(bufsiz < 0) {
bufsiz = DEFAULT_BUFSIZ;
}
byte[] buf = new byte[bufsiz];
Stream stm = resp.GetResponseStream();
int ix = 0;
int n;
while((n = stm.Read(buf, ix, buf.Length - ix)) 0) {
ix += n;
}
stm.Close();
string temp = Encoding.ASCII.GetString(buf);
enc = ParseMetaContentType(temp, enc);
return Encoding.GetEncoding(enc).GetString(buf);
}
else
{
throw new ArgumentException("URL " + urlstr + "
returned " + resp.StatusDescription);
}
}
}
}

Oct 5 '08 #11

Nightcrawler

Peter,

Thanks for your feedback. One example of data that I was having
trouble with would be the 6th row from the bottom (Love & Happiness
(Yemaya y Ochùn) [12' Club Mix]). The special "u" charachter in the
word Ochun was coming out wrong when I used UTF-8 encoding. Once I
changed it to ISO-8859-1 I was able to parse it out correctly.

I really would like to understand encodings and why I was running into
this problem. Are there any articles or websites you can recommend
that will allow me to learn a bit more about this. I hate "solving" a
problem and moving on without really knowing why it works.

Thanks again.

Oct 6 '08 #12

Nightcrawler

Arne,

Thanks for the code. I will give this a try.

Oct 6 '08 #13

Jon Skeet [C# MVP]

On Oct 6, 3:15*pm, Nightcrawler <thomas.zale...@gmail.comwrote:

Thanks for your feedback. One example of data that I was having
trouble with would be the 6th row from the bottom (Love & Happiness
(Yemaya y Ochùn) [12' Club Mix]). The special "u" charachter in the
word Ochun was coming out wrong when I used UTF-8 encoding. Once I
changed it to ISO-8859-1 I was able to parse it out correctly.

I really would like to understand encodings and why I was running into
this problem. Are there any articles or websites you can recommend
that will allow me to learn a bit more about this. I hate "solving" a
problem and moving on without really knowing why it works.

I have an article on Unicode at http://pobox.com/~skeet/csharp/unicode.html
Whether it contains anything you don't already know is a different
matter...

Jon

Oct 6 '08 #14

Nightcrawler

Jon,

Thanks, I will check this out.

Oct 6 '08 #15

Peter Duniho

On Mon, 06 Oct 2008 07:15:01 -0700, Nightcrawler
<th************@gmail.comwrote:

Peter,

Thanks for your feedback. One example of data that I was having
trouble with would be the 6th row from the bottom (Love & Happiness
(Yemaya y OchÃ¹n) [12' Club Mix]). The special "u" charachter in the
word Ochun was coming out wrong when I used UTF-8 encoding. Once I
changed it to ISO-8859-1 I was able to parse it out correctly.

Well, I'm no character encoding expert, but it does look to me as though
that page is not a valid UTF-8 document. In particular, the character
you're concerned with is coming across as value 0xf9 (249), while UTF-8 as
far as I know allows single-byte characters only for the first 128
characters.

In other words, it probably is in fact encoded using ISO-8859-1, as your
investigation suggested.

As for how to deal with that situation, I'm not really sure. I don't know
of any truly reliable way to detect the character encoding. If the
provider lies to you, you may be left having to leave it up to the user to
override whatever information was given to you by the provider.

I really would like to understand encodings and why I was running into
this problem. Are there any articles or websites you can recommend
that will allow me to learn a bit more about this. I hate "solving" a
problem and moving on without really knowing why it works.

Unfortunately, I don't know of any good resources first-hand. They're
probably out there, but I've just muddled by with looking at various
specifications and implementations I find on Google. That said, Jon's
article might be helpful, and you might find the Wikipedia discussion of
UTF-8 useful:
http://en.wikipedia.org/wiki/UTF-8

In particular, while it's not a discussion of character encodings
generally, seeing how UTF-8 works may give you a little better
understanding of the issues involved.

Pete

Oct 6 '08 #16

Similar topics

Problem with HttpWebRequest class...

by: Darryl | last post by:

I'm trying to use the HttpWebRequest class to retrieve XML generated by a jsp page, but an 'The remote server returned an error: (500) Internal Server Error' exception is thrown when I call the...

.NET Framework

Downloading WebSites using HttpWebRequest

by: thomas peter | last post by:

I am building a precache engine... one that request over 100 pages on an remote server to cache them remotely... can i use the HttpWebRequest and WebResponse classes for this? or must i use the...

C# / C Sharp

How to maintain session using HttpWebRequest and SetCookies?

by: Peter Qian | last post by:

Hi, I'm working on a windows form based program that can log into a web service (Apache based, https is used for auth). I was able to post the login data and obtain a sessionID. However I'm not...

C# / C Sharp

HttpWebRequest.GetResponse on POST returns 405 method not allowed

by: GlennLanier | last post by:

Hello, I've searched the forums and can't find an answer -- if it i there, kindly point me in that direction. I would like to simulate a browser POSTing a FORM and be able to pars the response....

ASP.NET

HttpWebRequest & HttpWebResponse Cookies

by: Cheung, Jeffrey Jing-Yen | last post by:

I have a windows form application that generates a request, downloads an image, and waits the user to enter in login info. Unfortunately, this image is dynamic and based on session data. I have...

Visual Basic .NET

Understanding HttpWebRequest CookieContainer?

by: rlueneberg | last post by:

I am totally confused. Can someone please illuminate what is going on under the hood in this piece of code from John Lewis. My main confusion is how the cookieContainer can be passed to the...

C# / C Sharp

How to correctly carry over SessionID via HttpWebRequest?

by: rlueneberg | last post by:

I am trying to foward the old sessionID using "Session.SessionID" to an HttpWebRequest CookieContainer so that I can capture the requested page session variables but it is not working as it is...

C# / C Sharp

HttpWebRequest.Abort() does not stop network traffic

by: Marc Bartsch | last post by:

Hi, I have a background worker in my C# app that makes a synchronous HttpWebRequest.GetResponse() call. The idea is to POST a file to a server on the internet. When I call HttpWebRequest.Abort()...

.NET Framework

HttpWebRequest problem with WWW-authenticate header

by: Proogeren | last post by:

I have a problem with a httpwebrequest that I am creating. The request in itself looks correct but using fiddler I see that a www-authentication header is sent along as well. The code is pasted...

.NET Framework

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++