UTF16 codec doesn't round-trip?

(My Python uses UTF16 natively; can someone with UTF32 Python let me
know if that behaves differently?)

import codecs
u'\ud800' # part of surrogate pair

u'\ud800'
codecs.utf_16_b e_encode(_)[0]
'\xd8\x00'
codecs.utf_16_b e_decode(_)[0]
Traceback (most recent call last):
File "<input>", line 1, in ?
UnicodeDecodeEr ror: 'utf16' codec can't decode bytes in position 0-1:
unexpected end of data

If the ascii can't be recognized as UTF16, then surely the codec
shouldn't have allowed it to be encoded in the first place? I could
understand if it was trying to decode ascii into (native) UTF32.

On a similar note, if you are using UTF32 natively, are you allowed to
have raw surrogate escape sequences (paired or otherwise) in unicode
literals?

Thanks

John

Jul 19 '05 #1

Subscribe Reply

2989

Martin v. Löwis

John Perks and Sarah Mount wrote:

If the ascii can't be recognized as UTF16, then surely the codec
shouldn't have allowed it to be encoded in the first place? I could
understand if it was trying to decode ascii into (native) UTF32.
Please don't call the thing you are trying to decode "ascii". ASCII
is the name of the American Standard Code for Information Interchange;
it is a 7-bit code, and what you are trying to decode certainly isn't
ascii. Call it "bytes" instead.

So you are trying to decode bytes as UTF-16. The bytes you have
definitely are not UTF-16 - the specific sequence of bytes is invalid
in UTF-16. Therefore, the codec is right to reject it when decoding.

It might be considered as a bug that the codec encoded the characters
in the first place.
On a similar note, if you are using UTF32 natively, are you allowed to
have raw surrogate escape sequences (paired or otherwise) in unicode
literals?

Python accepts such literals.

Regards,
Martin

Jul 19 '05 #2

Similar topics

4450

Python LaTeX codec?

by: David Eppstein | last post by:

Does anyone have an implemented Python codec for converting between unicode and LaTeX markup? E.g. I'd like 'ï' to be converted to '{\"\i}' and vice versa. Preferably including at least the Latin Extended-A characters as well as the basic Latin-1 Supplement. If not, anyone with experience writing Python codecs have any advice on how to do this? -- David Eppstein http://www.ics.uci.edu/~eppstein/

Python

4576

Changing the default text codec

by: Fuzzyman | last post by:

Sorry if my terminology is wrong..... but I'm having intermittent problems dealing with accented characters in python. (Only from the 8 bit latin-1 character set I think..) I've written an anagram finder that produces anagrams from a dictionary of words. The user can load their own dictionary. ( http://www.voidspace.org.uk/atlantibots/nanagram.html ) It's particularly difficult for me to understand what is happening -

Python

4030

convert gb18030 to utf16

by: Xah Lee | last post by:

i have a bunch of files encoded in GB18030. Is there a way to convert them to utf16 with python? Xah xah@xahlee.org http://xahlee.org/PageTwo_dir/more.html

Python

4116

UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

by: Chris Mullins | last post by:

I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework: When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column

.NET Framework

3135

UTF16, BOM, and Windows Line endings

by: Fuzzyman | last post by:

Hello all, I'm handling some text files where I don't (necessarily) know the encoding beforehand. Because I use regular expressions to parse the text I *must* decode UTF16 encoded text (otherwise the regexes split on byte boundaries). I can recognise UTF8 and BOM and remove (but not necessarily decode). For UTF16 it seems that the Python codec will automatically remove the BOM. Having detected it (to trigger a decode) is it considered

Python

2069

Where is the ucs-32 codec?

by: beni.cherniavsky | last post by:

Python seems to be missing a UCS-32 codec, even in wide builds (not that it the build should matter). Is there some deep reason or should I just contribute a patch? If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or both (aliased)? There should be '-le' and '-be' variats, I suppose. Should there be a variant without explicit endianity, using a BOM to decide (like 'utf-16')? And it should combine surrogates into...

Python

5381

Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

by: Oleg Parashchenko | last post by:

Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128) I spent two hours fixing it, and I hope it's done. The solution is one

Python

3938

Python for Vcard Parsing in UTF16

by: R Wood | last post by:

Greetings - A recent Perl experiment hasn't turned out so well, which has piqued my interest in Python. The project is this: take a Vcard file exported from Apple's Addressbook and use a language that is good at parsing text to convert it into a mutt alias file. There are better ways to use Mutt with Mac's addressbook, but I want to be able to periodically convert my working addressbook file into an alias file I can then transfer...

Python

2218

Wanted: safe codec for filenames

by: Torsten Bronger | last post by:

Hallöchen! I'd like to map general unicode strings to safe filename. I tried punycode but it is case-sensitive, which Windows is not. Thus, "Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however, I need uppercase Latin letters being encoded, too, and the encoding must contain only lowercase Latin letters, numbers, underscores, and maybe a little bit more. The result should be more legible than base64, though.

Python

5359

Oracle Text: Indexing UTF8 or UTF16

by: Server Applications | last post by:

Hello I am trying to build a system where I can full-text index documents with UTF8 or UTF16 data using Oracle Text. I am doing the filtering in a third-party component outside the database, so the I dont need filtering in Oracle, but only indexing. If I put file references to the filtered files in the database and index these (using FILE_DATASTORE), everything works fine. But I rather put the filtered data in the database, and index it...

Oracle Database

10169

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10110

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8993

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6749

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5398

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5534

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4067

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3670

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2894

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General