473,320 Members | 2,112 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Identifying extended ASCII subset

Hi,

I have to treat a given text file, but haven't got a clue which
extended ASCII set it is using.
Opening the file in Windows' Notepad or in DOS, all accented letters
and symbols are wrong.
Any idea how to identify the subset used?
Is there some text editor which can cycle easy through all known
subsets, or even better: cycle subsets automatically until found a
given test-string with some accents and symbols?
If someone knows a solution which involves VB, C++, XML or whatever
please don't hesitate sharing it with me.

TIA,
K

Nov 7 '05 #1
13 2475
kr********@matt.es wrote:
Hi,

I have to treat a given text file, but haven't got a clue which
extended ASCII set it is using.
Opening the file in Windows' Notepad or in DOS, all accented letters
and symbols are wrong.
Any idea how to identify the subset used?
Is there some text editor which can cycle easy through all known
subsets, or even better: cycle subsets automatically until found a
given test-string with some accents and symbols?

If you expect a computer to do this for you, you're probably dreaming. Since the actual character codes don't change, only the visual representations, someone has to look at the result to make a judgement.

If you have OCR code that will work on a memory bitmap, you could conceivably draw out the characters using a given code page and try to OCR the result, but even then I don't see any way to tell one 'close' result from another.

What is it you need to do to the text, that requires you to know what the codes represent?

--

Jim Mack
MicroDexterity Inc
www.microdexterity.com

Nov 7 '05 #2
On Mon, 07 Nov 2005 05:08:37 -0800, kristofvdw wrote:
I have to treat a given text file, but haven't got a clue which
extended ASCII set it is using.
Files contain bytes. Bytes are numerical values. There are no ASCII sets
or extended ASCII sets, AFA files are concerned. It's all in _our_ minds.
To make your program understand and tell one set from another, you need to
basically *teach* it the same "algorithm" _you_ are using to differentiate
those sets.
[...]


And avoid cross-posing to too many newsgroups at once. It makes your post
that more irrelevant in many newsgroups.

V
Nov 7 '05 #3
In article <11********************@o13g2000cwo.googlegroups.c om>,
<kr********@matt.es> wrote:
I have to treat a given text file, but haven't got a clue which
extended ASCII set it is using.
Opening the file in Windows' Notepad or in DOS, all accented letters
and symbols are wrong.
Any idea how to identify the subset used?


You can get Mozilla's character set guesser:

http://www.mozilla.org/projects/intl/chardet.html

There's a Java version too:

http://jchardet.sourceforge.net/

-- Richard
Nov 7 '05 #4
kr********@matt.es wrote:
Hi,

I have to treat a given text file, but haven't got a clue which
extended ASCII set it is using.
Opening the file in Windows' Notepad or in DOS, all accented letters
and symbols are wrong.
Any idea how to identify the subset used?
Is there some text editor which can cycle easy through all known
subsets, or even better: cycle subsets automatically until found a
given test-string with some accents and symbols?
If someone knows a solution which involves VB, C++, XML or whatever
please don't hesitate sharing it with me.


Open the file is a hexadecimal editor, pick some of the characters,
and use the Unicode charts (www.unicode.org) to identify what
encoding they are.

Or just ask whoever created it.

///Peter

Nov 7 '05 #5
mmm, you're right there; automating would be quite difficult and
probable even take longer than browsing the sets manually... any tool
you know to do so?

The data are our clients, gotten through legacy-software. Now I'm
putting the data in an Oracle DB, but it's impossible to get
information on which coding the program uses. Lots of names and
addresses have accents in them, which we can't afford to loose.

Nov 8 '05 #6
Thanks for the suggestion, I'll look into that.
Unfortionately, the universal_charset_detector isn't built yet, and
doesn't support rare sets, so I don't have much hope...

Nov 8 '05 #7
kr********@matt.es wrote:
mmm, you're right there; automating would be quite difficult and
probable even take longer than browsing the sets manually... any tool
you know to do so?

The data are our clients, gotten through legacy-software. Now I'm
putting the data in an Oracle DB, but it's impossible to get
information on which coding the program uses. Lots of names and
addresses have accents in them, which we can't afford to loose.


Do you know for sure that there is more than one character-set encoding in use? And what would you change these to, once you knew what they represented?

Is this something you have to do just once, or is there a continuing need? For a one-time use, manually cycling through your choices may not be that painful.

If this is truly an 'extended ASCII' file, which might be a legacy DOS file, you could try an OEM character set. There are several OEM code pages, but CP 437 is the most common. Just using an OEM font (like Ms Terminal or FoxPrint) will reveal whether this is the case. If it is, then applying the API OemToCharBuff will do the translation into the current code page.

--
Jim
Nov 8 '05 #8
Apparently, the problem is worse than expected.
As Peter suggested, I took a look at the hex-codes.
I discovered some apparent extended characters refered to the basic
ASCII codes!
For example, a name with "Ç" (code 199/hex C7) got exported as "G"
(code 71/hex 47).
So, when exporting from an apparent extended ASCII set, it uses a basic
ASCII set, overlapping extended codes at 128 (for the example:
199-128=71).
What a moron! The programmer who managed to achieve this!

Thanks all for your contributions, I now have to search for the
original programmer and kill him...

Nov 8 '05 #9
On Tue, 8 Nov 2005, Jim Mack wrote, seen in comp.text.xml:
If this is truly an 'extended ASCII' file, which might be a legacy
DOS file, you could try an OEM character set. There are several OEM
code pages, but CP 437 is the most common.


In the USA, perhaps; but CP850 is the DOS codepage for a multinational
situation, at least in basically latin-1 usage - and had been for
quite some time.

[f'ups proposed]
Nov 8 '05 #10
On 7 Nov 2005 05:08:37 -0800, kr********@matt.es wrote:
Hi,

I have to treat a given text file, but haven't got a clue which
extended ASCII set it is using.


The .es in your name is interesting

How much do you know about where this 'legacy' data came from ?

Was it Windows, was it DOS ... or maybe something mainframe-ish ?

What is the 'context' - for example a Turkish directory printed in
Spain ?
Nov 8 '05 #11
I suspect the original is from an IBM mainframe in EBCDIC, but we only
get a flat text file exportation.
Additionally, we have a tough deal getting trough to the original
programmers, so we'd have to work with what they provide us...

Nov 8 '05 #12
On 8 Nov 2005 05:14:52 -0800, kr********@matt.es wrote:
I suspect the original is from an IBM mainframe in EBCDIC, but we only
get a flat text file exportation.
Additionally, we have a tough deal getting trough to the original
programmers, so we'd have to work with what they provide us...


The original programmers will just mislead you

- you need to look into 'inferential logic'

Like re-inventing the rules that make sense of the mess

BTW - this sounds like a classic case of data transfer saboutage

Nov 8 '05 #13
In article <If******************************@comcast.com>,
Jim Mack <jm***@mdxi.nospam.com> wrote:
If you expect a computer to do this for you, you're probably dreaming.
Since the actual character codes don't change, only the visual
representations, someone has to look at the result to make a judgement.


It's not that bad. By comparing the frequencies of individual
characters, and pairs and triples and so on, against those found in
known documents, it should be possible to achieve good enough accuracy
for many purposes.

If the data is really random, not even a human will be able to
answer the question.

-- Richard
Nov 8 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: barthome1 | last post by:
Hello, My company collects data from non-US sources. We are starting projects where this data will be output in an XML document and passed around to our applications and third party tools. ...
1
by: | last post by:
Hey all, Quick question...been bugging me for some time, really. I have a console app, it does some things, and I want to save an array of text to a text file. The text consists of ASCII and...
13
by: bgbauer70 | last post by:
My appologies if this ends up being a duplicate post. For some reason the first post never showed up. I've tried about 300 iterrations of this same ability, and none of them seem to work in...
12
by: kristofvdw | last post by:
Hi, I have to treat a given text file, but haven't got a clue which extended ASCII set it is using. Opening the file in Windows' Notepad or in DOS, all accented letters and symbols are wrong....
12
by: chunhui_true | last post by:
i have a class, it can read one line(\r\n ended) from string,when i read line from utf8 string i can't get any thing! maybe i should conversion utf8 to ascii??there is any function can conversion...
4
by: wob | last post by:
Many thanks for those who responded to my question of "putting greek char into C string". In searching for an solution, I noticed that there are more than one version of "Extended ASCII...
3
by: JSM | last post by:
Hi, I am just trying to port an existing simple encryption routine to C#. this routine simply adds/substracts 10 ascii characters to each character in a text file (except quotes). The routine...
4
by: =?Utf-8?B?Um9zaGFuIFIuRA==?= | last post by:
Hi All, I am new to C# programming; I am developing an application for recording audio data using voice modem. I am using HyperTerminal to manually process audio data. The modem when configured...
13
by: ramif | last post by:
Is there a way to print extended ASCII in C?? I tried to code something, but it only displays strange symbols. here is my code: main() { char chr = 177; //stores the extended ASCII...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.