By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,403 Members | 1,089 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,403 IT Pros & Developers. It's quick & easy.

problems with character

P: n/a
I have a mysql database with characters like » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa aaa'
print type(s)
print s
print s.find('')

This returns
<type 'str'>
aaaaa aaa
6

The other odd thing is that the character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as when I
print from the simple script above.
What am I doing wrong?

Jul 18 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a
jdonnell wrote:
I have a mysql database with characters like » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.


use the "hammer" recipe. i'm using it to create URL-friendly
fragment from latin-1 album titles:

<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871>
(check the last comment, "a cleaner solution"
for a better implementation).

it basically hammers down accented chars like and
to the most near ASCII representation.

since you receive string data as str from mysql
object first convert them as unicode with:

u = unicode('', 'latin-1')

then feed u to the hammer function (the fix_unicode at the
end).

HTH,
deelan

--
"Per bello sapere che, di questi tempi spietati, almeno
un mistero sopravvive: l'et di Afef Jnifen." -- dagospia.com
Jul 18 '05 #2

P: n/a
>>s = 'aaaaa aaa'
What am I doing wrong?


First get rid of characters not allowed
in Python code.
Replace with appropriate escape
sequence: /x## where ## is the
hexadecimal code of the ASCII
character.

Claudio
Jul 18 '05 #3

P: n/a

"Claudio Grondi" <cl************@freenet.de> schrieb im Newsbeitrag
news:3a*************@individual.net...
s = 'aaaaa aaa'
What am I doing wrong?


First get rid of characters not allowed
in Python code.
Replace with appropriate escape
sequence: /x## where ## is the (should be \x##)
hexadecimal code of the ASCII
character.

Claudio


i.e. probably instead of 'aaaaa aaa'
'aaaaa \xC2 aaa'
In my ASCII table '' is '\xC2'

Claudio
Jul 18 '05 #4

P: n/a
aaaaa aaa'
0123456
It's OK

Jul 18 '05 #5

P: n/a
And this run OK for me :

s = 'aaaaa aaa'
print s
print s.replace('', '')

Jul 18 '05 #6

P: n/a
In <11*********************@o13g2000cwo.googlegroups. com>, jdonnell wrote:
I have a mysql database with characters like   » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

[...]

The other odd thing is that the  character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as  when I
print from the simple script above.
What am I doing wrong?


Is it possible that your DB stores strings UTF-8 encoded? The
byte sequence '\xc2\xa0' which displays as 'Â ' in latin-1 encoding is a
non breakable space character.

Ciao,
Marc 'BlackJack' Rintsch

Jul 18 '05 #7

P: n/a
I had this problem recently. It turned out that something
had encoded a unicode string into utf-8. When I found
the culprit and fixed the underlying design issue, it went away.

John Roth

"jdonnell" <ja********@gmail.com> wrote in message
news:11*********************@o13g2000cwo.googlegro ups.com...
I have a mysql database with characters like » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa aaa'
print type(s)
print s
print s.find('')

This returns
<type 'str'>
aaaaa aaa
6

The other odd thing is that the character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as when I
print from the simple script above.
What am I doing wrong?

Jul 18 '05 #8

P: n/a
On Tue, 22 Mar 2005 20:09:55 -0600, "John Roth" <ne********@jhrothjr.com> wrote:
I had this problem recently. It turned out that something
had encoded a unicode string into utf-8. When I found
the culprit and fixed the underlying design issue, it went away.

John Roth

"jdonnell" <ja********@gmail.com> wrote in message
news:11*********************@o13g2000cwo.googlegr oups.com...
I have a mysql database with characters like » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa aaa'
print type(s)
print s
print s.find('')

This returns
<type 'str'>
aaaaa aaa
6

The other odd thing is that the character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as when I
print from the simple script above.
What am I doing wrong?

What encodings are involved?

This is from idle on windows, which seems to display latin-1 source ok:
----
"Latin-1:»\n".decode('latin-1') u'Latin-1:\xc2\xbb\n' "Latin-1:»\n".decode('latin-1').encode('cp437', 'replace') 'Latin-1:?\xaf\n' "Latin-1:»\n".decode('latin-1').encode('cp437', 'ignore') 'Latin-1:\xaf\n' u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' ----
Now this is in an NT4 console windows with code page 437:

---- u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' import sys
sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','replace')) Latin-1:?
----

Notice that the interactive output does a repr that creates the \xaf, but
the character is available and can be written non-repr'd via sys.stdout.write.

For the heck of it:
sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','xmlcharrefreplace'))

Latin-1:Â

I don't know if this is going to get through to your screen ;-)

Regards,
Bengt Richter
Jul 18 '05 #9

P: n/a
Thanks for all the replies. I just got in to work so I haven't tried
any of them yet. I see that I wasn't as clear as I should have been so
I'll clarify a little. I'm grabbing some data from msn's rss feed.
Here's an example.
http://search.msn.com/results.aspx?q...=rss&FORM=ZZRE

The string ' all domain name extensions Good' is where I have a
problem. The
' ' shows up as ' »' when I write it to a file or stick
it in mysql. I did a hex dump and this is what I see.

jay@localhost:~/scripts> cat test.txt
extensions Good
jay@localhost:~/scripts> xxd test.txt
0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. ..
0000010: 20c2 bb20 476f 6f64 0a .. Good

One thing that jumps out is that two of the 's are c2a0, but one of
them is c2bb. Well, those are the details since I wasn't clear before.

Jul 18 '05 #10

P: n/a
In <11*********************@g14g2000cwa.googlegroups. com>, jdonnell wrote:
Thanks for all the replies. I just got in to work so I haven't tried
any of them yet. I see that I wasn't as clear as I should have been so
I'll clarify a little. I'm grabbing some data from msn's rss feed.
Here's an example.
http://search.msn.com/results.aspx?q...=rss&FORM=ZZRE
Then you are getting UTF-8 encoded strings.
The string ' all domain name extensions » Good' is where I have a
problem. The
' »' shows up as '  »' when I write it to a file or stick
it in mysql. I did a hex dump and this is what I see.

jay@localhost:~/scripts> cat test.txt
extensions » Good
jay@localhost:~/scripts> xxd test.txt
0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. ..
0000010: 20c2 bb20 476f 6f64 0a .. Good

One thing that jumps out is that two of the Â's are c2a0, but one of
them is c2bb. Well, those are the details since I wasn't clear before.


That are two no-break spaces and a '»' character::

In [42]: import unicodedata

In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8'))
Out[43]: 'NO-BREAK SPACE'

In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8'))
Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK'

Ciao,
Marc 'BlackJack' Rintsch
Jul 18 '05 #11

P: n/a
Thanks everyone, I got it working earlier this morning using deelan's
suggestion. I modified the code in his link so that it removes rather
than replaces the characters.

Also, this was my first experience with unicode and what confused me is
that I was thinking of a unicode object as an encoding, but it's not.
It's just a series of bytes and you later tell it to use a specific
encoding like utf-8 or latin-1. Thanks again for all the help.

Jul 18 '05 #12

P: n/a
On Tue, 22 Mar 2005 21:39:30 -0000, "Claudio Grondi"
<cl************@freenet.de> wrote:

In my ASCII table '' is '\xC2'


You've got an *ASCII* table that includes that??

I hope you paid for it in Confederate dollars or czarist roubles --
that's about what such a table would be worth.

Jul 18 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.