473,772 Members | 3,731 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to Split Chinese Character with backslash representation?


Hi all,

I was trying to split a string that
represent chinese characters below:

>>str = '\xc5\xeb\xc7\x d5\xbc'
print str2,
???
>>fields2 = split(r'\\',str )
print fields2,
['\xc5\xeb\xc7\x d5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb',' \xc7','\xd5','\ xbc']

Regards,
-- Edward WIJAYA
SINGAPORE

------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------
Oct 27 '06 #1
8 4367
Wijaya Edward wrote:
Hi all,

I was trying to split a string that
represent chinese characters below:

>>>str = '\xc5\xeb\xc7\x d5\xbc'
print str2,
???
>>>fields2 = split(r'\\',str )
print fields2,
['\xc5\xeb\xc7\x d5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb',' \xc7','\xd5','\ xbc']
Depends on what you want to do with them:
>>string = '\xc5\xeb\xc7\x d5\xbc'
for char in string:
print char
Å
ë
Ç
Õ
¼
>>list_of_chara cters = list(string)
list_of_chara cters
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc']
>>for char in string:
char
'\xc5'
'\xeb'
'\xc7'
'\xd5'
'\xbc'
>>for char in list_of_charact ers:
print char
Å
ë
Ç
Õ
¼
>>string[3]
'\xd5'
>>string[1:3]
'\xeb\xc7'

Basically, you characters are already separated into a list of
characters, that's effectively what a string is (but with a few more
methods applicable only to lists of characters, not to other lists).
Oct 27 '06 #2

Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:
>>str = '\xc5\xeb\xc7\x d5\xbc--FOO--BAR'
field = list(str)
print field
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-','B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','B AR]

What's the best way to do it?

-- Edward WIJAYA
SINGAPORE

_______________ _______________ __

From: py************* *************** *************** @python.org on behalfof Cameron Walsh
Sent: Fri 10/27/2006 12:03 PM
To: py*********@pyt hon.org
Subject: Re: How to Split Chinese Character with backslash representation?

Wijaya Edward wrote:
Hi all,

I was trying to split a string that
represent chinese characters below:


>>>str = '\xc5\xeb\xc7\x d5\xbc'
print str2,
???
>>>fields2 = split(r'\\',str )
print fields2,
['\xc5\xeb\xc7\x d5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb',' \xc7','\xd5','\ xbc']
Depends on what you want to do with them:
>>string = '\xc5\xeb\xc7\x d5\xbc'
for char in string:
print char
Å
ë
Ç
Õ
¼
>>list_of_chara cters = list(string)
list_of_chara cters
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc']
>>for char in string:
char
'\xc5'
'\xeb'
'\xc7'
'\xd5'
'\xbc'
>>for char in list_of_charact ers:
print char
Å
ë
Ç
Õ
¼
>>string[3]
'\xd5'
>>string[1:3]
'\xeb\xc7'

Basically, you characters are already separated into a list of
characters, that's effectively what a string is (but with a few more
methods applicable only to lists of characters, not to other lists).
--
http://mail.python.org/mailman/listinfo/python-list

------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not theintended recipient, please delete it and notify us immediately. Please donot copy or use it for any purpose, or disclose its contents to any otherperson. Thank you.
--------------------------------------------------------
Oct 27 '06 #3
On 10/27/06, Wijaya Edward <ew*****@i2r. a-star.edu.sgwrot e:
>
Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:
>str = '\xc5\xeb\xc7\x d5\xbc--FOO--BAR'
field = list(str)
print field
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','B AR]

What's the best way to do it?
If the case is very simple, why not just replace '_' with '', for example:

str.replace('-', '')

--
I like python!
UliPad <<The Python Editor>>: http://wiki.woodpecker.org.cn/moin/UliPad
My Blog: http://www.donews.net/limodou
Oct 27 '06 #4
limodou wrote:
On 10/27/06, Wijaya Edward <ew*****@i2r. a-star.edu.sgwrot e:
>>
Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:
>>str = '\xc5\xeb\xc7\x d5\xbc--FOO--BAR'
field = list(str)
print field
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-',
'-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','B AR]

What's the best way to do it?
If the case is very simple, why not just replace '_' with '', for example:

str.replace('-', '')
Except he appears to want the Chinese characters as elements of the
list, and English words as elements of the list. Note carefully the
last two elements in his desired list. I'm still puzzling this one...
Oct 27 '06 #5
Wijaya Edward wrote:
Since there are separator I need to include as delimiter
Especially for the case like this:
>>>str = '\xc5\xeb\xc7\x d5\xbc--FOO--BAR'
field = list(str)
print field
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','B AR]
>>s = '\xc5\xeb\xc7\x d5\xbc--FOO--BAR'
re.findall("( ?i)[a-z]+|[\xA0-\xFF]", s)
'\xd5', '\xbc', 'FOO', 'BAR']

the RE matches either a sequence of latin characters, *or* a single
non-ASCII character.

you may want to adjust the character ranges to match the encoding you're
using, and your definition of non-chinese words.

</F>

Oct 27 '06 #6
On 10/27/06, Cameron Walsh <ca***********@ gmail.comwrote:
limodou wrote:
On 10/27/06, Wijaya Edward <ew*****@i2r. a-star.edu.sgwrot e:
>
Thanks but my intention is to strictly use regex.
Since there are separator I need to include as delimiter
Especially for the case like this:

str = '\xc5\xeb\xc7\x d5\xbc--FOO--BAR'
field = list(str)
print field
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-',
'-', 'B', 'A', 'R']

What we want as the output is this instead:
['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','B AR]

What's the best way to do it?
If the case is very simple, why not just replace '_' with '', for example:

str.replace('-', '')
Except he appears to want the Chinese characters as elements of the
list, and English words as elements of the list. Note carefully the
last two elements in his desired list. I'm still puzzling this one...
Oh, I see. I made a mistake.

--
I like python!
UliPad <<The Python Editor>>: http://wiki.woodpecker.org.cn/moin/UliPad
My Blog: http://www.donews.net/limodou
Oct 27 '06 #7
"Wijaya Edward" <ew*****@i2r. a-star.edu.sgwrot e in message
news:ma******** *************** *************** *@python.org...
>
Hi all,

I was trying to split a string that
represent chinese characters below:

>>>str = '\xc5\xeb\xc7\x d5\xbc'
print str2,
???
>>>fields2 = split(r'\\',str )
print fields2,
['\xc5\xeb\xc7\x d5\xbc']

But why the split function here doesn't seem
to do the job for obtaining the desired result:

['\xc5','\xeb',' \xc7','\xd5','\ xbc']
There are no backslash characters in the string str, so split finds nothing
to split on. I know it looks like there are, but the backslashes shown are
part of the \x escape sequence for defining characters when you can't or
don't want to use plain ASCII characters (such as in your example in which
the characters are all in the range 0x80 to 0xff). Look at this example:
>>s = "\x40"
print s
@

I defined s using the escaped \x notation, but s does not contain any
backslashes, it contains the '@' character, whose ordinal character value is
64, or 40hex.

Also, str is not the best name for a string variable, since this masks the
built-in str type.

-- Paul
Oct 27 '06 #8
Paul McGuire wrote:
"Wijaya Edward" <ew*****@i2r. a-star.edu.sgwrot e in message
news:ma******** *************** *************** *@python.org...
>Hi all,

I was trying to split a string that
represent chinese characters below:

>>>>str = '\xc5\xeb\xc7\x d5\xbc'
fields2 = split(r'\\',str )

There are no backslash characters in the string str, so split finds nothing
to split on. I know it looks like there are, but the backslashes shown are
part of the \x escape sequence for defining characters when you can't or
don't want to use plain ASCII characters (such as in your example in which
the characters are all in the range 0x80 to 0xff).
Moreover, you are not splitting on a backslash; since you used a
r'raw_string', you are in fact splitting on TWO backslashes. It looks
like you want to treat str as a raw string to get at the slashes, but it
isn't a raw string and I don't think you can directly convert it to one.
If you want the numeric values of each byte, you can do the following:

Py >>char_values = [ ord(c) for c in str ]
Py >>char_values
[ 197, 235, 199, 213, 188 ]
Py >>>

Note that those numbers are decimal equivalents of the hex values given
in your string, but are now in integer format.

On the other hand, you may want to use str.encode('gbk ') (or whatever
your encoding is) so that you're actually dealing with characters rather
than bytes:

Py >>str.decode('g bk')

Traceback (most recent call last):
File "<pyshell#2 9>", line 1, in -toplevel-
str.decode('gbk ')
UnicodeDecodeEr ror: 'gbk' codec can't decode byte 0xbc in position 4:
incomplete multibyte sequence
Py >>str[0:4].decode('gbk')
u'\u70f9\u94a6'

Py >>print str[0:4].decode('gbk')
烹钦
Py >>print str[0:4]
ÅëÇÕ

OK, so gbk choked on the odd character at the end. Maybe you need a
different encoding, or maybe your string got truncated somewhere along
the line....

Cheers,
Cliff
Oct 27 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3231
by: Coco | last post by:
Hi! I managed to display chinese character in my web form (.aspx), in certain situation i need to to set the text of the label of my web form in chinese character programatically which is done in my code behind. when running the web form, the character which is entered directly to the web form during design time is displayed correctly, while those set from code behind appear as funny character
5
7565
by: Arjen | last post by:
Hi All, What I want to is using a string as PATTERN in a split function. This makes it possible for me to change the PATTERN on one place in my script... For example: $separator = ";"; $line = "field1;value1"; local($field, $value) = split(/$separator/, $line);
6
22274
by: Zhang Weiwu | last post by:
Hello. I am working with a php software project, in it (www.egroupware.org) Chinese simplified locate is "zh" while Traditional Chinese "tw". I wish to send correct language attribute in http header, I found "zh" is not standard. I found this line in apache2's default httpd.conf # Simplified Chinese (zh-CN) AddLanguage zh-CN .zh-cn
7
96328
by: teachtiro | last post by:
Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %% should be used. Wouldn't it have been better (from design perspective) if the same escape character had been used in this case too. Forgive me for posting without verfying things with any standard compiler, i don't have the means for now.
8
11983
by: pabv | last post by:
Hello all, I am having a few issues with encoding to chinese characters and perhaps someone might be able to assist. At the moment I am only able to see chinese characters when displayed as part of a datagrid. When an input textbox is displayed it does not display chinese characters, but rather the unicode characters stored in the mssql 2000 server backend.
19
32839
by: many_years_after | last post by:
Hi,everyone: Have you any ideas? Say whatever you know about this. thanks.
0
991
by: limodou | last post by:
---------- Forwarded message ---------- From: limodou <limodou@gmail.com> Date: Oct 27, 2006 11:51 AM Subject: Re: How to Split Chinese Character with backslash representation? To: Wijaya Edward <ewijaya@i2r.a-star.edu.sg> On 10/27/06, Wijaya Edward <ewijaya@i2r.a-star.edu.sgwrote: \xXX just internal representation of None ASCII, I guess above string is encoded with local locale, maybe gbk. You can get the bytes list
2
1869
by: christopher taylor | last post by:
hello python-list! the other day, i was trying to match unicode character sequences that looked like this: \\uAD0X... my issue, is that the pattern i used was returning:
13
3928
by: Liang Chen | last post by:
Hope you all had a nice weekend. I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the program: "Could not write output: <type "exceptions: UnicodeEncodeError'>, 'ascii' codec can't...
0
9620
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10261
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10104
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10038
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8934
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5354
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5482
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3609
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2850
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.