473,499 Members | 1,655 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Determining possible encodings of a given text

How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?

Thanks in advance,
Nordlöw
Jun 27 '08 #1
4 1500
I don't know ..
Hope expert the answer.

"Nordlöw" <pe*********@gmail.com????
news:e0**********************************@x41g2000 hsb.googlegroups.com...
How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?

Thanks in advance,
Nordlöw
Jun 27 '08 #2
In comp.lang.c Nordloew <pe*********@gmail.comwrote:
How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?
Sorry, but that's not a question related to the C programming
language but about some specific task and libraries (that may
be written in C, but that doesn't make it on-topic). The basic
question would remain the same if you would use C, C++, Perl
or any other programming language.

So just a few hints: figuring out which encoding is used for a
file is probably a very difficult task since it would require
that the program understands something about the content of
the file. It's probably possible to make some well-educated
guess if the file is long enough, but a method that gets it
always right is, as far as I can see, impossible. And libiconv
isn't going to be of any help since it's for converting from an
already known encoding to another, it doesn't try to guess the
source encoding (except in the most trival way, using the
locale dependent character encoding when no source encoding
has been specified).

If you're interested in a more in-depth discussion it probably
would make sense to post to comp.programming instead.

Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
Jun 27 '08 #3
In article <e0**********************************@x41g2000hsb. googlegroups.com>,
Nordlöw <pe*********@gmail.comwrote:
>How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?
What do you need to know?

If it doesn't contain any bytes above 127, it's probably ascii. If it
contains lots of zeros in the even or odd positions it's probably
UTF-16. If it contains bytes above 127 *and* they're consistent with
UTF-8, then it's almost certainly UTF-8. If it contains a small
proportion of bytes above 127, it's quite likely some ISO-Latin-N
encoding. I don't know much about far-eastern encoding.

You might look at http://jchardet.sourceforge.net/

-- Richard
--
:wq
Jun 27 '08 #4
"Nordlöw" <pe*********@gmail.comwrote in message
>How do I efficiently determine which possible encoding(s) a given text
is in? Can I use the iconv.h api somehow?
It can't be done efficiently. You need to run through the common encodings
and check for plaintext.
If you don't have a set of encodings it is even more difficult.
Look up Markov modelling.
--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Jun 27 '08 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
8260
by: Rajorshi | last post by:
Hello! How do I determine the encoding of a text file ? That is, given a text file I want to know the encoding it is in UTF8 or UTF16 or Latin etc. It would be very helpful if you could tell me...
5
2914
by: F. GEIGER | last post by:
I'm on WinXP, Python 2.3. I don't have problems with umlauts (ä, ö, ü and their uppercase instances) in my wxPython-GUIs, when displayed as static texts. But when filling controls with text...
8
2518
by: Mark English | last post by:
I'd like to write a Tkinter app which, given a class, pops up a window(s) with fields for each "attribute" of that class. The user could enter values for the attributes and on closing the window...
1
3053
by: LP | last post by:
I need to figure encoding or code page of a file programmatically. Also I was asked to figure out what was the original encoding of different records stored as Unicode in SQL Server table. So,...
9
1487
by: Andy Burchill | last post by:
I am using the StreamReader to read in some text from a plain txt file and then display it in a text box. When I look at the text file in notepad and my program the text looks all messed up,...
13
1517
by: Stacey Levine | last post by:
Ok.. Maybe I am trying the wrong approach. If given a URL to a graphic, I want to save that graphic to a local file. The approach below gets the response, but I can't quite figure out how to save...
4
2038
by: Rémi | last post by:
Question: How can you determine the character set used by a webpage you built? My understanding of the issue is that the character set used by an HTML file (or any other file, for that matter)...
40
3146
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
4
1792
by: tinkerbarbet | last post by:
Hi I've read around quite a bit about Unicode and python's support for it, and I'm still unclear about how it all fits together in certain scenarios. Can anyone help clarify? * When I say "#...
0
7134
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7014
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
1
6905
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7395
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
4609
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3103
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1429
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
667
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
311
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.