473,383 Members | 1,984 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,383 software developers and data experts.

problems with  character

I have a mysql database with characters like   » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('Â', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa  aaa'
print type(s)
print s
print s.find('Â')

This returns
<type 'str'>
aaaaa  aaa
6

The other odd thing is that the  character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as  when I
print from the simple script above.
What am I doing wrong?

Jul 18 '05 #1
12 9786
jdonnell wrote:
I have a mysql database with characters like   » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.


use the "hammer" recipe. i'm using it to create URL-friendly
fragment from latin-1 album titles:

<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871>
(check the last comment, "a cleaner solution"
for a better implementation).

it basically hammers down accented chars like à and Â
to the most near ASCII representation.

since you receive string data as str from mysql
object first convert them as unicode with:

u = unicode('Â', 'latin-1')

then feed u to the hammer function (the fix_unicode at the
end).

HTH,
deelan

--
"Però è bello sapere che, di questi tempi spietati, almeno
un mistero sopravvive: l'età di Afef Jnifen." -- dagospia.com
Jul 18 '05 #2
>>s = 'aaaaa  aaa'
What am I doing wrong?


First get rid of characters not allowed
in Python code.
Replace  with appropriate escape
sequence: /x## where ## is the
hexadecimal code of the ASCII
character.

Claudio
Jul 18 '05 #3

"Claudio Grondi" <cl************@freenet.de> schrieb im Newsbeitrag
news:3a*************@individual.net...
s = 'aaaaa  aaa'
What am I doing wrong?


First get rid of characters not allowed
in Python code.
Replace  with appropriate escape
sequence: /x## where ## is the (should be \x##)
hexadecimal code of the ASCII
character.

Claudio


i.e. probably instead of 'aaaaa  aaa'
'aaaaa \xC2 aaa'
In my ASCII table 'Â' is '\xC2'

Claudio
Jul 18 '05 #4
aaaaa  aaa'
0123456
It's OK

Jul 18 '05 #5
And this run OK for me :

s = 'aaaaa  aaa'
print s
print s.replace('Â', '')

Jul 18 '05 #6
In <11*********************@o13g2000cwo.googlegroups. com>, jdonnell wrote:
I have a mysql database with characters like   » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

[...]

The other odd thing is that the  character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as  when I
print from the simple script above.
What am I doing wrong?


Is it possible that your DB stores strings UTF-8 encoded? The
byte sequence '\xc2\xa0' which displays as 'Â ' in latin-1 encoding is a
non breakable space character.

Ciao,
Marc 'BlackJack' Rintsch

Jul 18 '05 #7
I had this problem recently. It turned out that something
had encoded a unicode string into utf-8. When I found
the culprit and fixed the underlying design issue, it went away.

John Roth

"jdonnell" <ja********@gmail.com> wrote in message
news:11*********************@o13g2000cwo.googlegro ups.com...
I have a mysql database with characters like   » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('Â', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa  aaa'
print type(s)
print s
print s.find('Â')

This returns
<type 'str'>
aaaaa  aaa
6

The other odd thing is that the  character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as  when I
print from the simple script above.
What am I doing wrong?

Jul 18 '05 #8
On Tue, 22 Mar 2005 20:09:55 -0600, "John Roth" <ne********@jhrothjr.com> wrote:
I had this problem recently. It turned out that something
had encoded a unicode string into utf-8. When I found
the culprit and fixed the underlying design issue, it went away.

John Roth

"jdonnell" <ja********@gmail.com> wrote in message
news:11*********************@o13g2000cwo.googlegr oups.com...
I have a mysql database with characters like   » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('Â', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa  aaa'
print type(s)
print s
print s.find('Â')

This returns
<type 'str'>
aaaaa  aaa
6

The other odd thing is that the  character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as  when I
print from the simple script above.
What am I doing wrong?

What encodings are involved?

This is from idle on windows, which seems to display latin-1 source ok:
----
"Latin-1:»\n".decode('latin-1') u'Latin-1:\xc2\xbb\n' "Latin-1:»\n".decode('latin-1').encode('cp437', 'replace') 'Latin-1:?\xaf\n' "Latin-1:»\n".decode('latin-1').encode('cp437', 'ignore') 'Latin-1:\xaf\n' u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' ----
Now this is in an NT4 console windows with code page 437:

---- u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' import sys
sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','replace')) Latin-1:?»
----

Notice that the interactive output does a repr that creates the \xaf, but
the character is available and can be written non-repr'd via sys.stdout.write.

For the heck of it:
sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','xmlcharrefreplace'))

Latin-1:»

I don't know if this is going to get through to your screen ;-)

Regards,
Bengt Richter
Jul 18 '05 #9
Thanks for all the replies. I just got in to work so I haven't tried
any of them yet. I see that I wasn't as clear as I should have been so
I'll clarify a little. I'm grabbing some data from msn's rss feed.
Here's an example.
http://search.msn.com/results.aspx?q...=rss&FORM=ZZRE

The string ' all domain name extensions » Good' is where I have a
problem. The
' »' shows up as '  »' when I write it to a file or stick
it in mysql. I did a hex dump and this is what I see.

jay@localhost:~/scripts> cat test.txt
extensions » Good
jay@localhost:~/scripts> xxd test.txt
0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. ..
0000010: 20c2 bb20 476f 6f64 0a .. Good

One thing that jumps out is that two of the Â's are c2a0, but one of
them is c2bb. Well, those are the details since I wasn't clear before.

Jul 18 '05 #10
In <11*********************@g14g2000cwa.googlegroups. com>, jdonnell wrote:
Thanks for all the replies. I just got in to work so I haven't tried
any of them yet. I see that I wasn't as clear as I should have been so
I'll clarify a little. I'm grabbing some data from msn's rss feed.
Here's an example.
http://search.msn.com/results.aspx?q...=rss&FORM=ZZRE
Then you are getting UTF-8 encoded strings.
The string ' all domain name extensions » Good' is where I have a
problem. The
' »' shows up as '  »' when I write it to a file or stick
it in mysql. I did a hex dump and this is what I see.

jay@localhost:~/scripts> cat test.txt
extensions » Good
jay@localhost:~/scripts> xxd test.txt
0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. ..
0000010: 20c2 bb20 476f 6f64 0a .. Good

One thing that jumps out is that two of the Â's are c2a0, but one of
them is c2bb. Well, those are the details since I wasn't clear before.


That are two no-break spaces and a '»' character::

In [42]: import unicodedata

In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8'))
Out[43]: 'NO-BREAK SPACE'

In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8'))
Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK'

Ciao,
Marc 'BlackJack' Rintsch
Jul 18 '05 #11
Thanks everyone, I got it working earlier this morning using deelan's
suggestion. I modified the code in his link so that it removes rather
than replaces the characters.

Also, this was my first experience with unicode and what confused me is
that I was thinking of a unicode object as an encoding, but it's not.
It's just a series of bytes and you later tell it to use a specific
encoding like utf-8 or latin-1. Thanks again for all the help.

Jul 18 '05 #12
On Tue, 22 Mar 2005 21:39:30 -0000, "Claudio Grondi"
<cl************@freenet.de> wrote:

In my ASCII table 'Â' is '\xC2'


You've got an *ASCII* table that includes that??

I hope you paid for it in Confederate dollars or czarist roubles --
that's about what such a table would be worth.

Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Bart Plessers \(artabel\) | last post by:
Hello, I have problems with the quotation mark and strings in my asp script. I made a general FORM (myform.asp) to read out data from a dbase Some vars are defined in the FORM: SQL_DBASE...
0
by: Blah Blah | last post by:
i just thought i'd shoot out a quick email on problems i've been having with utf-8 in moving from 4.1.0 to 4.1.1. (please note that because i am using UTF-8 as my default character set, i compiled...
3
by: Curious Angel | last post by:
Help? Spec Character Problems w/JAVASCRIPT TOOLTIP Hi, I'm experiencing bizarre problems with quote marks that previously displayed properly in a Javascript TOOLTIP I wrote a year ago . . . and...
12
by: David | last post by:
I am having some issues with Firefox not rendering an element with the correct font. I am using the font-family style within a stylesheet class definition. I then set the element I am creating to...
18
by: gabriel | last post by:
greetings, I am currently working on a website where I need to print the Euro symbol and some "oe" like in "oeuvre". If I choose this : <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0...
1
by: Bryan Olson | last post by:
Yesterday I embarrassed myself on sci.crypt with some incorrect C code and corresponding claims about the language. My source was Harbison and Steele (H&S), /C, A Reference Manual/ and I thought...
18
by: james | last post by:
Hi, I am loading a CSV file ( Comma Seperated Value) into a Richtext box. I have a routine that splits the data up when it hits the "," and then copies the results into a listbox. The data also...
21
by: Doug Lerner | last post by:
I'm working on a client/server app that seems to work fine in OS Firefox and Windows IE and Firefox. However, in OS X Safari, although the UI/communications themselves work fine, if the...
6
by: ThunderMusic | last post by:
Hi, We are trying to encode to ISO-8859-1, but we have problems doing it using the encoders in .NET. We get some unknown characters in some culture which comes out fine if we post (from IE) from a...
3
by: Klaus Herzberg | last post by:
Hi, I come from the "dark side" php/mysql and there often problems with character sets (utf-8, latin...) and storing data in datebase. Exists in the world of dot.net and ms-sql-server similiar...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.