473,326 Members | 2,173 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

a unicode question?

Hello,
There is a unicode string, I want to change it to ansi string. but
it raise an exception.
Could you help me?

## I want to change s1 to s2.

s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

Apr 10 '06 #1
6 1622
What do you mean by "ansi string"?

Here is a superficially not-unreasonable answer to your more specific
question:

# >>> s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
# >>> s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
# >>> s3 = s1.encode('latin1')
# >>> s2 == s3
# True

But what are you really trying to achieve? Where does your Unicode data
come from? What ranges of characters do you expect it to contain? You
need to crunch it into an 8-bit representation because ... what?

Apr 10 '06 #2
Mr. John Machin, Thank you very much!

Apr 10 '06 #3
Mr. John Machin

This question come form the flow codes. I use the PyXml to build a DOM
tree.

from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
title_string = title_elem.firstChild.data
print title_string

# the title_string is unicode, but it is not "latin1" code, so I wantto
change it.

Apr 10 '06 #4

zd****@xinces.com wrote:
Mr. John Machin

This question come form the flow codes. I use the PyXml to build a DOM
tree.

from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
title_string = title_elem.firstChild.data
print title_string

# the title_string is unicode, but it is not "latin1" code, so I wantto
change it.


Errr, but the title of the page is written in Chinese and it is not
supposed to be crammed into latin1 encoding. What are you trying to do
with the string after you squeezed Chinese into latin1?

Apr 10 '06 #5
Errrrrrrr, it get's worse: not only is the title written in Chinese, it
is encoded as gb2312 -- here is the repr() of the first few chunks:

"<html>\n<head>\n <title>\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) :
\xc4\xd
a\xb2\xbf\xc8\xcb\xd4\xb1\xb3\xd6\xb9\xc9 -
\xcb\xd1\xba\xfc\xb9\xc9\xc6\xb1</ti
tle>\n<meta http-equiv='Content-Type' content='text/html;
charset=gb2312'>\n"

and here is what you get after that_guff.decode('gb2312')

u"<html>\n<head>\n <title>\u4e2d\u56fd\u77f3\u5316(600028) :
\u5185\u90e8\u
4eba\u5458\u6301\u80a1 - \u641c\u72d0\u80a1\u7968</title>\n<meta
http-equiv='Con
tent-Type' content='text/html; charset=gb2312'>\n"

The first 2 characters of the title are recognisable both visually on
the browser title and in the unicode as "zhong guo" i.e. China.

BUT the OP's first message is interpreting that gb2312-encoded stuff as
Unicode:
s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

*SOMEBODY* is seriously deluded, and it ain't me, and it ain't Serge
:-)

.... and yes Peter, info travels faster also from China that it does
from Armenia :-())

Apr 10 '06 #6
John Machin wrote:
... and yes Peter, info travels faster also from China that it does
from Armenia :-())


Q: Can info travel faster from Armenia than from China?
Radio Yerevan: In principle, yes. Just make sure that it doesn't go the
other way round the globe or meets some friends on the way...
Apr 11 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: sebastien.hugues | last post by:
Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name:...
9
by: François Pinard | last post by:
Hi, people. I hope someone would like to enlighten me. For any application handling Unicode internally, I'm usually careful at properly converting those Unicode strings into 8-bit strings before...
27
by: EU citizen | last post by:
Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?
3
by: Supratim | last post by:
Hi, For past few weeks I am working on a function that would take encoded Unicode characters from query string of http requests and then decode them back to Unicode numbers. I have full success...
3
by: dalei | last post by:
My question is presented more clearly in following web page: http://www.pinyinology.com/signs2.html <html> HTML entities display outside script tags: a&sup1;, a&sup2;, a&sup3;, a⁴ But...
12
by: damjan | last post by:
This may look like a silly question to someone, but the more I try to understand Unicode the more lost I feel. To say that I am not a beginner C++ programmer, only had no need to delve into...
14
by: abhi147 | last post by:
Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should...
2
by: willie | last post by:
Martin v. Löwis: Thanks for the thorough explanation. One last question about terminology then I'll go away :) What is the proper way to describe "ustr" below? <type 'unicode'>
5
by: =?Utf-8?B?S2V2aW4gVGFuZw==?= | last post by:
In MFC, CRichEditCtrl contrl, I want to set the codepage for the control to Unicode. I used the following method to set codepage for it (only for ANSI or BIG5, etc, not unicode). How should I...
0
by: deloford | last post by:
Hi This is going to be a question for anyone who is an expert in C# Text Encoding. My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.