Where is the ucs-32 codec?

beni.cherniavsky

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or
both (aliased)?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

--
Beni Cherniavsky <cb**@users.sf.net>, who can only read email on
weekends.

Jun 4 '06 #1

Subscribe Post Reply

2050

Erik Max Francis

be**************@gmail.com wrote:

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or
both (aliased)?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

Note that UTF-32 is UCS-4. UCS-32 ("Universial Character Set in 32
octets") wouldn't make much sense.

Not that Python has a UCS-4 encoding available either. I'm really not
sure why.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
Could it be / That we need loving to survive
-- Neneh Cherry

Jun 4 '06 #2

Méta-MCI

Hi!

Look at: http://cjkpython.berlios.de (iconvcodec)

(Serge Orlov has built a version for Python 2.4 "special for me"; thanks to
him).
@-salutations
--
Michel Claveau

Jun 4 '06 #3

Martin v. Löwis

be**************@gmail.com wrote:

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?
The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
Right.
And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

Right.

Also, it should support the incremental interface (as any multi-byte
codec should).

If you want it complete, it should also support line-oriented input.
Notice that .readline/.readlines is particularly difficult to implement,
as you can't rely on the underlying stream's .readline implementation
to provide meaningful results.

While we are discussing problems: there also is the issue whether
..readline/.readlines should take the additional Unicode linebreak
characters into account (e.g. U+2028, U+2029), and if so, whether
that should be restricted to "universal newlines" mode.

Regards,
Martin

Jun 5 '06 #4

Erik Max Francis

Martin v. Löwis wrote:

The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?

Why would it be "quite some work"? Converting from UTF-16 to UTF-32 is
pretty straightforward, and UTF-16 is already supported.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
Democritus may have come from Abdera, but he was no dummy.
-- Carl Sagan

Jun 5 '06 #5

Martin v. Löwis

Erik Max Francis wrote:

The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?

Why would it be "quite some work"? Converting from UTF-16 to UTF-32 is
pretty straightforward, and UTF-16 is already supported.

I would like to see it correct, unlike the current UTF-16 codec. Perhaps
whoever contributes an UTF-32 codec could also deal with the defects of
the UTF-16 codec.

Regards,
Martin

Jun 5 '06 #6

cben

Méta-MCI wrote:

Hi!

Look at: http://cjkpython.berlios.de (iconvcodec)

(Serge Orlov has built a version for Python 2.4 "special for me"; thanks to
him).

Thanks for the pointer.
iconvcodec should do the job, but I still want a native implementation
to be included with any python.

Jun 9 '06 #7

cben

Martin v. Löwis wrote:

Erik Max Francis wrote:
The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?

Somebody asked me about generating UTF-32 (he didn't have choice of the
output format).
I was about to propose the obvious ``u.encode('utf-32')`` but
discovered it's missing.
Someone proposed 'unicode-internal' but it depends on the build and is
an ugly answer.
Next time, I want Guido's Time Machine to just work, so I have to fix
this ;-).
Why would it be "quite some work"? Converting from UTF-16 to UTF-32 is
pretty straightforward, and UTF-16 is already supported.

I would like to see it correct, unlike the current UTF-16 codec. Perhaps
whoever contributes an UTF-32 codec could also deal with the defects of
the UTF-16 codec.

Now this is interesting, as I hoped to base my code on UTF-16 (and
perhaps UTF-8 for combining surrogates)... Can you elaborate?

I could attempt to fix UTF-16 as well but I don't have the expertise to
choose the right behaviour,
so you'll have to specify precisely what it should do (that it doesn't
do now).

Jun 9 '06 #8

Fredrik Lundh

cb**@users.sf.net wrote:

Somebody asked me about generating UTF-32 (he didn't have choice of the
output format). I was about to propose the obvious ``u.encode('utf-32')``
but discovered it's missing.

hint 1:

u = u"Hello"
a = array.array("I", map(ord, u))
a.tostring() 'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\ x00o\x00\x00\x00' a.byteswap()
a.tostring() '\x00\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x 00l\x00\x00\x00o'

hint 2:
import sys
sys.byteorder

'little'

</F>

Jun 9 '06 #9

Martin v. Löwis

cb**@users.sf.net wrote:

I would like to see it correct, unlike the current UTF-16 codec. Perhaps
whoever contributes an UTF-32 codec could also deal with the defects of
the UTF-16 codec.

Now this is interesting, as I hoped to base my code on UTF-16 (and
perhaps UTF-8 for combining surrogates)... Can you elaborate?

The codec doesn't do line-oriented input correctly (i.e. readline);
it raises NotImplementedError.

Regards,
Martin

Jun 9 '06 #10

Similar topics

unicode (UCS-2 encoded)

by: wael | last post by:

hello all, i want convert w_char to UCS2 encoded (0041) this is a char encoded UCS2 please look at this http://www.unicode.org/charts/ http://www.unicode.org/ every language has a chart bye...

C / C++

Creating UTF-16/UCS-2 database in AIX

by: panda | last post by:

Dear ALL, How can I create an database of codepage UTF-16/UCS-2 in AIX DB2 V8? I tried to use "db2set db2codepage=1200" and "db2 terminate" but failed. I also tried to use "db2 create db abc...

DB2 Database

Where's the daily FAQ

by: Craig Alexander Morrison | last post by:

I have not seen the FAQ daily post recently is this group being discontinued. -- Slainte Craig Alexander Morrison

Microsoft Access / VBA

Concatenating text fields with a WHERE condition

by: Dixie | last post by:

I have a problem using Dev Ashish's excellent module to concatenate the results of a field from several records into one record. I am using the code to concatenate certain awards onto a...

Microsoft Access / VBA

UCS-2 to ASCII

by: test | last post by:

I'm trying to convert UCS-2 code to ASCII. Does anyone have an example in C?. thanks.

C / C++

SECURITY ADVISORY [PSF-2006-001] Buffer overrun in repr() for UCS-4encoded unicode strings

by: Anthony Baxter | last post by:

SECURITY ADVISORY Buffer overrun in repr() for UCS-4 encoded unicode strings http://www.python.org/news/security/PSF-2006-001/ Advisory ID: PSF-2006-001 Issue Date: October 12, 2006...

Python

Oracle 10g RC2 ODBC Access - odbc failed - where xyz <> "S"

by: network-admin | last post by:

We have Problems with Access query on Oracle 10g Database with ODBC Connection. The Query_1 is such as select * from xtable where ycolumn <"S" Result = ODBC Faild...

Microsoft Access / VBA

Upgrade from Windows-1252 to UCS-2

by: Boris | last post by:

I'm trying to find out what the steps look like to upgrade a program (which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI" code page) to UCS-2. Currently the program reads and...

C / C++

UCS and BMP Character Sets

by: Jeffrey Walton | last post by:

HI All, I' working on an ASN.1 parser. The Content Octets (data values) are stored in a byte. The conversion of byte to char is fairly trivial. BMP is a special case of UCS, using the lower 65...

C# / C Sharp

Loading UCS-2(UTF-16) XML file

by: sm0a9f4 | last post by:

Hi everyone, I'm a noob with a problem... In non-IE browsers, I'm trying to load a UCS-2 (UTF-16) encoded file using the following lines: ...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA