473,787 Members | 2,989 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF16 codec doesn't round-trip?

(My Python uses UTF16 natively; can someone with UTF32 Python let me
know if that behaves differently?)
import codecs
u'\ud800' # part of surrogate pair

u'\ud800'
codecs.utf_16_b e_encode(_)[0]
'\xd8\x00'
codecs.utf_16_b e_decode(_)[0]
Traceback (most recent call last):
File "<input>", line 1, in ?
UnicodeDecodeEr ror: 'utf16' codec can't decode bytes in position 0-1:
unexpected end of data

If the ascii can't be recognized as UTF16, then surely the codec
shouldn't have allowed it to be encoded in the first place? I could
understand if it was trying to decode ascii into (native) UTF32.

On a similar note, if you are using UTF32 natively, are you allowed to
have raw surrogate escape sequences (paired or otherwise) in unicode
literals?

Thanks

John
Jul 19 '05 #1
1 2989
John Perks and Sarah Mount wrote:
If the ascii can't be recognized as UTF16, then surely the codec
shouldn't have allowed it to be encoded in the first place? I could
understand if it was trying to decode ascii into (native) UTF32.
Please don't call the thing you are trying to decode "ascii". ASCII
is the name of the American Standard Code for Information Interchange;
it is a 7-bit code, and what you are trying to decode certainly isn't
ascii. Call it "bytes" instead.

So you are trying to decode bytes as UTF-16. The bytes you have
definitely are not UTF-16 - the specific sequence of bytes is invalid
in UTF-16. Therefore, the codec is right to reject it when decoding.

It might be considered as a bug that the codec encoded the characters
in the first place.
On a similar note, if you are using UTF32 natively, are you allowed to
have raw surrogate escape sequences (paired or otherwise) in unicode
literals?


Python accepts such literals.

Regards,
Martin
Jul 19 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
4450
by: David Eppstein | last post by:
Does anyone have an implemented Python codec for converting between unicode and LaTeX markup? E.g. I'd like 'ï' to be converted to '{\"\i}' and vice versa. Preferably including at least the Latin Extended-A characters as well as the basic Latin-1 Supplement. If not, anyone with experience writing Python codecs have any advice on how to do this? -- David Eppstein http://www.ics.uci.edu/~eppstein/
5
4576
by: Fuzzyman | last post by:
Sorry if my terminology is wrong..... but I'm having intermittent problems dealing with accented characters in python. (Only from the 8 bit latin-1 character set I think..) I've written an anagram finder that produces anagrams from a dictionary of words. The user can load their own dictionary. ( http://www.voidspace.org.uk/atlantibots/nanagram.html ) It's particularly difficult for me to understand what is happening -
2
4030
by: Xah Lee | last post by:
i have a bunch of files encoded in GB18030. Is there a way to convert them to utf16 with python? Xah xah@xahlee.org http://xahlee.org/PageTwo_dir/more.html
12
4116
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework: When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column
4
3135
by: Fuzzyman | last post by:
Hello all, I'm handling some text files where I don't (necessarily) know the encoding beforehand. Because I use regular expressions to parse the text I *must* decode UTF16 encoded text (otherwise the regexes split on byte boundaries). I can recognise UTF8 and BOM and remove (but not necessarily decode). For UTF16 it seems that the Python codec will automatically remove the BOM. Having detected it (to trigger a decode) is it considered
9
2069
by: beni.cherniavsky | last post by:
Python seems to be missing a UCS-32 codec, even in wide builds (not that it the build should matter). Is there some deep reason or should I just contribute a patch? If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or both (aliased)? There should be '-le' and '-be' variats, I suppose. Should there be a variant without explicit endianity, using a BOM to decide (like 'utf-16')? And it should combine surrogates into...
4
5381
by: Oleg Parashchenko | last post by:
Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128) I spent two hours fixing it, and I hope it's done. The solution is one
4
3938
by: R Wood | last post by:
Greetings - A recent Perl experiment hasn't turned out so well, which has piqued my interest in Python. The project is this: take a Vcard file exported from Apple's Addressbook and use a language that is good at parsing text to convert it into a mutt alias file. There are better ways to use Mutt with Mac's addressbook, but I want to be able to periodically convert my working addressbook file into an alias file I can then transfer...
3
2218
by: Torsten Bronger | last post by:
Hallöchen! I'd like to map general unicode strings to safe filename. I tried punycode but it is case-sensitive, which Windows is not. Thus, "Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however, I need uppercase Latin letters being encoded, too, and the encoding must contain only lowercase Latin letters, numbers, underscores, and maybe a little bit more. The result should be more legible than base64, though.
1
5359
by: Server Applications | last post by:
Hello I am trying to build a system where I can full-text index documents with UTF8 or UTF16 data using Oracle Text. I am doing the filtering in a third-party component outside the database, so the I dont need filtering in Oracle, but only indexing. If I put file references to the filtered files in the database and index these (using FILE_DATASTORE), everything works fine. But I rather put the filtered data in the database, and index it...
0
10169
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10110
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8993
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6749
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5398
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5534
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4067
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3670
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2894
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.