469,159 Members | 1,518 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,159 developers. It's quick & easy.

File Read Spanish characters

There is surprisingly little information on the various encoding options for
reading a text file. I have what seems to be a very basic issue: I'm reading
a text file that includes Spanish characters such as "˝". When I read the
file into a string, that character is missing. Encoding seems to be the
culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to
let us know what encoding to read the file with, but most software doesn't
do this so we are left with BOMless files. So how can we reliably read these
files without knowing what encoding it was written with?

Through trial and error I have found that using UTF-7 picks up these Spanish
characters, along with the English.
Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7).

Since I am clueless on matters of encoding, my question is: am I safe using
UTF-7 if I only care about English and Spanish? What is the downside? I
won't be able to read Romanian? Japanese?

Is there a way to programatically find the correct encoding without the BOM?

Chip
Dec 9 '05 #1
3 4902
Chip wrote:
There is surprisingly little information on the various encoding
options for reading a text file. I have what seems to be a very basic
issue: I'm reading a text file that includes Spanish characters such
as "˝". When I read the file into a string, that character is
missing. Encoding seems to be the culprit. File writers SHOULD begin
a file with the BOM (Byte Order Mark) to let us know what encoding to
read the file with, but most software doesn't do this so we are left
with BOMless files.
Remember that these are byte order marks, which are intended to be used
for identifying whether an encoding uses Big Endian or Little Endian
representation. The fact that some encodings can be identified by their
BOM is just a nice side effect.
So how can we reliably read these files without
knowing what encoding it was written with?
Only through application specific meta data (like HTTP headers).
There's no grand universal scheme to tell a file's character encoding.
Through trial and error I have found that using UTF-7 picks up these
Spanish characters, along with the English. Dim Reader As New
StreamReader(fs, System.Text.Encoding.UTF7).
That's quite likely not what you want. Try Encoding.Default.
Since I am clueless on matters of encoding, my question is: am I safe
using UTF-7 if I only care about English and Spanish? What is the
downside? I won't be able to read Romanian? Japanese?
Depends on the input. UTF-7 is only (and rarely?) used for E-mail. I
guess the chance to find a true UTF-7 encoded file is pretty much zero.
Is there a way to programatically find the correct encoding without
the BOM?


As I said, in general no. If the range of possible encodings is
limited, you may be able to create a proper detection algorithm, though.

Cheers,
--
http://www.joergjooss.de
mailto:ne********@joergjooss.de
Dec 12 '05 #2
If you only care about english and spanish,
you'll be safe using iso-8859-1.

Juan T. Llibre
ASP.NET MVP
============
"Chip" <ch**@intradata.com> wrote in message news:%2****************@TK2MSFTNGP10.phx.gbl...
There is surprisingly little information on the various encoding options for reading a text file.
I have what seems to be a very basic issue: I'm reading a text file that includes Spanish
characters such as "˝". When I read the file into a string, that character is missing. Encoding
seems to be the culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to let us
know what encoding to read the file with, but most software doesn't do this so we are left with
BOMless files. So how can we reliably read these files without knowing what encoding it was
written with?

Through trial and error I have found that using UTF-7 picks up these Spanish characters, along
with the English.
Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7).

Since I am clueless on matters of encoding, my question is: am I safe using UTF-7 if I only care
about English and Spanish? What is the downside? I won't be able to read Romanian? Japanese?

Is there a way to programatically find the correct encoding without the BOM?

Chip

Dec 14 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Carolina | last post: by
3 posts views Thread by leticia larrosa | last post: by
5 posts views Thread by Amy L. | last post: by
2 posts views Thread by Bart Kastermans | last post: by
4 posts views Thread by =?Utf-8?B?TGVvbg==?= | last post: by
4 posts views Thread by =?Utf-8?B?QWxoYW1icmEgRWlkb3MgS2lxdWVuZXQ=?= | last post: by
1 post views Thread by Mortomer39 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.