469,645 Members | 1,971 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,645 developers. It's quick & easy.

Determining possible encodings of a given text

How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?

Thanks in advance,
Nordlöw
Jun 27 '08 #1
4 1281
I don't know ..
Hope expert the answer.

"Nordlöw" <pe*********@gmail.com????
news:e0**********************************@x41g2000 hsb.googlegroups.com...
How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?

Thanks in advance,
Nordlöw
Jun 27 '08 #2
In comp.lang.c Nordloew <pe*********@gmail.comwrote:
How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?
Sorry, but that's not a question related to the C programming
language but about some specific task and libraries (that may
be written in C, but that doesn't make it on-topic). The basic
question would remain the same if you would use C, C++, Perl
or any other programming language.

So just a few hints: figuring out which encoding is used for a
file is probably a very difficult task since it would require
that the program understands something about the content of
the file. It's probably possible to make some well-educated
guess if the file is long enough, but a method that gets it
always right is, as far as I can see, impossible. And libiconv
isn't going to be of any help since it's for converting from an
already known encoding to another, it doesn't try to guess the
source encoding (except in the most trival way, using the
locale dependent character encoding when no source encoding
has been specified).

If you're interested in a more in-depth discussion it probably
would make sense to post to comp.programming instead.

Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
Jun 27 '08 #3
In article <e0**********************************@x41g2000hsb. googlegroups.com>,
Nordlöw <pe*********@gmail.comwrote:
>How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?
What do you need to know?

If it doesn't contain any bytes above 127, it's probably ascii. If it
contains lots of zeros in the even or odd positions it's probably
UTF-16. If it contains bytes above 127 *and* they're consistent with
UTF-8, then it's almost certainly UTF-8. If it contains a small
proportion of bytes above 127, it's quite likely some ISO-Latin-N
encoding. I don't know much about far-eastern encoding.

You might look at http://jchardet.sourceforge.net/

-- Richard
--
:wq
Jun 27 '08 #4
"Nordlöw" <pe*********@gmail.comwrote in message
>How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?
It can't be done efficiently. You need to run through the common encodings
and check for plaintext.
If you don't have a set of encodings it is even more difficult.
Look up Markov modelling.
--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Jun 27 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Rajorshi | last post: by
5 posts views Thread by F. GEIGER | last post: by
9 posts views Thread by Andy Burchill | last post: by
4 posts views Thread by tinkerbarbet | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.