473,394 Members | 1,778 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Interpreting non-ascii characters.


Hello everybody,

I want to create a script which reads files in a
current directory and renames them according to some
scheme. The file names are in Russian - sometimes
the names encoded as win-1251, sometimes as koi8-r etc.
I want to read in file name and convert it to list for
further processing. The problem is that Python treats
non-ascii characters as multibyte characters - for
example, hex code for "Small Character A" in koi8-r is
0xc1, but Python interprets it as a sequence of
\xd0, \xb1 bytes.

What can I do so that Python interprets non-ascii
characters correctly?
Jul 17 '07 #1
3 5430
On 18/07/2007 4:11 AM, ddtl wrote:
Hello everybody,

I want to create a script which reads files in a
current directory and renames them according to some
scheme. The file names are in Russian - sometimes
the names encoded as win-1251, sometimes as koi8-r etc.
You have a file system with 8-bit file names with no indication of
'codepage' or 'encoding', either globally or per file? Which operating
system are you using?
I want to read in file name and convert it to list for
further processing.
Read file name from a text file? Or do you mean using e.g. glob.glob()
or os.listdir()

What do you mean by "convert it to list"? Do you mean 'foo.txt' -['f',
'o', ....etc]??? Why?
The problem is that Python treats
non-ascii characters as multibyte characters - for
example, hex code for "Small Character A" in koi8-r is
0xc1, but Python interprets it as a sequence of
\xd0, \xb1 bytes.
Python is very unlikely to do that all by itself. Please show us the
script or whatever evidence you have. I strongly suggest that
immediately after "reading" a file name, you do
print repr(file_name)
NOT
print file_name
so that you can see *exactly* what you've got.

Are you sure about the \xb1??? Consider this:
>>'\xc1'.decode('koi8-r')
u'\u0430'
>>'\xc1'.decode('koi8-r').encode('utf8')
'\xd0\xb0'
>>>
Also:
>>import sys; sys.stdout.encoding
'cp850' # Win XP Pro, command prompt
>>>
What do you get when you do that?
>
What can I do so that Python interprets non-ascii
characters correctly?
Know how your non-ascii characters are encoded. Tell Python what to do
with them.

Read this:
http://www.amk.ca/python/howto/unicode

Hope this helps,
John
Jul 17 '07 #2
I want to create a script which reads files in a
current directory and renames them according to some
scheme. The file names are in Russian - sometimes
the names encoded as win-1251, sometimes as koi8-r etc.
I want to read in file name and convert it to list for
further processing. The problem is that Python treats
Apparently os.listdir returns a list of Unicode objects if the pathname
you give it is a Unicode object. So, Python should then convert the
Russian filenames to Unicode, using whatever encoding necessary. (I
don't know, however, how Python would know what to do if the filenames
are in a bunch of different encodings, as you say.)

If you can get the filenames into Unicode, then you can manipulate them
however you like.
--
For help at any time, press *H.
Jul 17 '07 #3
On Wed, 18 Jul 2007 08:29:58 +1000, John Machin <sj******@lexicon.netwrote:
...
I have a bunch of directories and files from different systems
(each directory contains files from the same system) which are
encoded differently (though all of them are in Russian), so the
following encodings are present: koi8-r, win-1251, utf-8 etc.,
and I want to transliterate them into a regular ASCII so that they
would be readable regardless of the system. Personally I use both
Linux and Windows. So what I do, is read file name using os.listdir,
convert to list ('foo.txt' =['f', 'o', ... , 't'], except that
file names are in Russian), transliterate (some letters in Russian
have to be transliterated into 2 or even 3 Latin letters),
and then rename file.

It seems though that after all I solved the problem - I thought
that my Windows (2000) used win-1251 and Linux used koi8-r and
because of that I couldn't understand what are those strange
codes I got while experimenting with locally created Cyrillic
file names, but in effect Linux uses utf-8, and Windows uses cp866,
so after getting it and reading the article you suggested I
solved the problem.

Thanks.

Jul 18 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: klaus triendl | last post by:
hi, recently i discovered a memory leak in our code; after some investigation i could reduce it to the following problem: return objects of functions are handled as temporary objects, hence...
3
by: Mario | last post by:
Hello, I couldn't find a solution to the following problem (tried google and dejanews), maybe I'm using the wrong keywords? Is there a way to open a file (a linux fifo pipe actually) in...
25
by: Yves Glodt | last post by:
Hello, if I do this: for row in sqlsth: ________pkcolumns.append(row.strip()) ________etc without a prior:
3
by: Roger Sherman | last post by:
I'm using CYGWIN g++. I'm having trouble making an API call the WindowFromPoint function. Here is my code. FILE: t.cc #include <windows.h> int main () { POINT p;
3
by: Chris Saunders | last post by:
I am attempting to write and interface from another language to some C code. I am having some difficulty interpreting a declaration. int (*SSL_CTX_get_verify_callback(SSL_CTX...
11
by: Dennis Allison | last post by:
Which C libraries (current and historical) interpret a null pointer as a pointer to a null (that is, empty) string?
4
by: zolli | last post by:
Hi, This question is about a piece of Linux kernel code, but is in fact a C language question. I was looking throught some memory map init code and ran into the following: p = mem_map +...
3
by: Chris Saunders | last post by:
Hope this question is appropriate here. I'm writing an interface to some C code for the language Eiffel. I have come across this macro and am having difficuly interpreting what it returns: ...
32
by: Adrian Herscu | last post by:
Hi all, In which circumstances it is appropriate to declare methods as non-virtual? Thanx, Adrian.
8
by: Bern McCarty | last post by:
Is it at all possible to leverage mixed-mode assemblies from AppDomains other than the default AppDomain? Is there any means at all of doing this? Mixed-mode is incredibly convenient, but if I...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.