problems with Â character

jdonnell

I have a mysql database with characters like Â Â Â» in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('Â', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa Â aaa'
print type(s)
print s
print s.find('Â')

This returns
<type 'str'>
aaaaa Â aaa
6

The other odd thing is that the Â character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as Â when I
print from the simple script above.
What am I doing wrong?

Jul 18 '05 #1

Subscribe Post Reply

9786

deelan

jdonnell wrote:

I have a mysql database with characters like Â Â Â» in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

use the "hammer" recipe. i'm using it to create URL-friendly
fragment from latin-1 album titles:

<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871>
(check the last comment, "a cleaner solution"
for a better implementation).

it basically hammers down accented chars like à and Â
to the most near ASCII representation.

since you receive string data as str from mysql
object first convert them as unicode with:

u = unicode('Â', 'latin-1')

then feed u to the hammer function (the fix_unicode at the
end).

HTH,
deelan

--
"Però è bello sapere che, di questi tempi spietati, almeno
un mistero sopravvive: l'età di Afef Jnifen." -- dagospia.com

Jul 18 '05 #2

Claudio Grondi

>>s = 'aaaaa Â aaa'

What am I doing wrong?

First get rid of characters not allowed
in Python code.
Replace Â with appropriate escape
sequence: /x## where ## is the
hexadecimal code of the ASCII
character.

Claudio

Jul 18 '05 #3

Claudio Grondi

"Claudio Grondi" <cl************@freenet.de> schrieb im Newsbeitrag
news:3a*************@individual.net...

s = 'aaaaa Â aaa'
What am I doing wrong?

First get rid of characters not allowed
in Python code.
Replace Â with appropriate escape
sequence: /x## where ## is the (should be \x##)
hexadecimal code of the ASCII
character.

Claudio

i.e. probably instead of 'aaaaa Â aaa'
'aaaaa \xC2 aaa'
In my ASCII table 'Â' is '\xC2'

Claudio

Jul 18 '05 #4

Do Re Mi chel La Si Do

aaaaa Â aaa'
0123456
It's OK

Jul 18 '05 #5

Do Re Mi chel La Si Do

And this run OK for me :

s = 'aaaaa Â aaa'
print s
print s.replace('Â', '')

Jul 18 '05 #6

Marc 'BlackJack' Rintsch

In <11*********************@o13g2000cwo.googlegroups. com>, jdonnell wrote:

I have a mysql database with characters like Ã‚ Ã‚ Ã‚Â» in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

[...]

The other odd thing is that the Ã‚ character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as Ã‚ when I
print from the simple script above.
What am I doing wrong?

Is it possible that your DB stores strings UTF-8 encoded? The
byte sequence '\xc2\xa0' which displays as 'Ã‚ ' in latin-1 encoding is a
non breakable space character.

Ciao,
Marc 'BlackJack' Rintsch

Jul 18 '05 #7

John Roth

I had this problem recently. It turned out that something
had encoded a unicode string into utf-8. When I found
the culprit and fixed the underlying design issue, it went away.

John Roth

"jdonnell" <ja********@gmail.com> wrote in message
news:11*********************@o13g2000cwo.googlegro ups.com...
I have a mysql database with characters like Â Â Â» in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('Â', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa Â aaa'
print type(s)
print s
print s.find('Â')

This returns
<type 'str'>
aaaaa Â aaa
6

The other odd thing is that the Â character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as Â when I
print from the simple script above.
What am I doing wrong?

Jul 18 '05 #8

Bengt Richter

On Tue, 22 Mar 2005 20:09:55 -0600, "John Roth" <ne********@jhrothjr.com> wrote:

I had this problem recently. It turned out that something
had encoded a unicode string into utf-8. When I found
the culprit and fixed the underlying design issue, it went away.

John Roth

"jdonnell" <ja********@gmail.com> wrote in message
news:11*********************@o13g2000cwo.googlegr oups.com...
I have a mysql database with characters like Â Â Â» in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('Â', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'aaaaa Â aaa'
print type(s)
print s
print s.find('Â')

This returns
<type 'str'>
aaaaa Â aaa
6

The other odd thing is that the Â character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as Â when I
print from the simple script above.
What am I doing wrong?

What encodings are involved?

This is from idle on windows, which seems to display latin-1 source ok:
----

"Latin-1:Â»\n".decode('latin-1') u'Latin-1:\xc2\xbb\n' "Latin-1:Â»\n".decode('latin-1').encode('cp437', 'replace') 'Latin-1:?\xaf\n' "Latin-1:Â»\n".decode('latin-1').encode('cp437', 'ignore') 'Latin-1:\xaf\n' u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' ----
Now this is in an NT4 console windows with code page 437:

---- u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' import sys
sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','replace')) Latin-1:?»
----

Notice that the interactive output does a repr that creates the \xaf, but
the character is available and can be written non-repr'd via sys.stdout.write.

For the heck of it:
sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','xmlcharrefreplace'))

Latin-1:Â»

I don't know if this is going to get through to your screen ;-)

Regards,
Bengt Richter

Jul 18 '05 #9

jdonnell

Thanks for all the replies. I just got in to work so I haven't tried
any of them yet. I see that I wasn't as clear as I should have been so
I'll clarify a little. I'm grabbing some data from msn's rss feed.
Here's an example.
http://search.msn.com/results.aspx?q...=rss&FORM=ZZRE

The string ' all domain name extensions » Good' is where I have a
problem. The
' »' shows up as 'Â Â Â»' when I write it to a file or stick
it in mysql. I did a hex dump and this is what I see.

jay@localhost:~/scripts> cat test.txt
extensions » Good
jay@localhost:~/scripts> xxd test.txt
0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. ..
0000010: 20c2 bb20 476f 6f64 0a .. Good

One thing that jumps out is that two of the Â's are c2a0, but one of
them is c2bb. Well, those are the details since I wasn't clear before.

Jul 18 '05 #10

Marc 'BlackJack' Rintsch

In <11*********************@g14g2000cwa.googlegroups. com>, jdonnell wrote:

Thanks for all the replies. I just got in to work so I haven't tried
any of them yet. I see that I wasn't as clear as I should have been so
I'll clarify a little. I'm grabbing some data from msn's rss feed.
Here's an example.
http://search.msn.com/results.aspx?q...=rss&FORM=ZZRE
Then you are getting UTF-8 encoded strings.
The string ' all domain name extensions Â» Good' is where I have a
problem. The
' Â»' shows up as 'Ã‚ Ã‚ Ã‚Â»' when I write it to a file or stick
it in mysql. I did a hex dump and this is what I see.

jay@localhost:~/scripts> cat test.txt
extensions Â» Good
jay@localhost:~/scripts> xxd test.txt
0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. ..
0000010: 20c2 bb20 476f 6f64 0a .. Good

One thing that jumps out is that two of the Ã‚'s are c2a0, but one of
them is c2bb. Well, those are the details since I wasn't clear before.

That are two no-break spaces and a 'Â»' character::

In [42]: import unicodedata

In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8'))
Out[43]: 'NO-BREAK SPACE'

In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8'))
Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK'

Ciao,
Marc 'BlackJack' Rintsch

Jul 18 '05 #11

jdonnell

Thanks everyone, I got it working earlier this morning using deelan's
suggestion. I modified the code in his link so that it removes rather
than replaces the characters.

Also, this was my first experience with unicode and what confused me is
that I was thinking of a unicode object as an encoding, but it's not.
It's just a series of bytes and you later tell it to use a specific
encoding like utf-8 or latin-1. Thanks again for all the help.

Jul 18 '05 #12

John Machin

On Tue, 22 Mar 2005 21:39:30 -0000, "Claudio Grondi"
<cl************@freenet.de> wrote:

In my ASCII table 'Â' is '\xC2'

You've got an *ASCII* table that includes that??

I hope you paid for it in Confederate dollars or czarist roubles --
that's about what such a table would be worth.

Jul 18 '05 #13

by: Bart Plessers $artabel$ | last post by:

Hello, I have problems with the quotation mark and strings in my asp script. I made a general FORM (myform.asp) to read out data from a dbase Some vars are defined in the FORM: SQL_DBASE...

ASP / Active Server Pages

UTF-8 problems with 4.1.1

by: Blah Blah | last post by:

i just thought i'd shoot out a quick email on problems i've been having with utf-8 in moving from 4.1.0 to 4.1.1. (please note that because i am using UTF-8 as my default character set, i compiled...

MySQL Database

Help? Spec Character Problems w/JAVASCRIPT TOOLTIP

by: Curious Angel | last post by:

Help? Spec Character Problems w/JAVASCRIPT TOOLTIP Hi, I'm experiencing bizarre problems with quote marks that previously displayed properly in a Javascript TOOLTIP I wrote a year ago . . . and...

Javascript

Problems with Firefox and font-family property

by: David | last post by:

I am having some issues with Firefox not rendering an element with the correct font. I am using the font-family style within a stylesheet class definition. I then set the element I am creating to...

HTML / CSS

some problems with the euro symbol

by: gabriel | last post by:

greetings, I am currently working on a website where I need to print the Euro symbol and some "oe" like in "oeuvre". If I choose this : <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0...

HTML / CSS

Problems with Harbison & Steele

by: Bryan Olson | last post by:

Yesterday I embarrassed myself on sci.crypt with some incorrect C code and corresponding claims about the language. My source was Harbison and Steele (H&S), /C, A Reference Manual/ and I thought...

C / C++

Problems with Replace Method

by: james | last post by:

Hi, I am loading a CSV file ( Comma Seperated Value) into a Richtext box. I have a routine that splits the data up when it hits the "," and then copies the results into a listbox. The data also...

Visual Basic .NET

Prototype, Safari and Japanese problems?

by: Doug Lerner | last post by:

I'm working on a client/server app that seems to work fine in OS Firefox and Windows IE and Firefox. However, in OS X Safari, although the UI/communications themselves work fine, if the...

Javascript

Encoding to ISO-8859-1 problems

by: ThunderMusic | last post by:

Hi, We are trying to encode to ISO-8859-1, but we have problems doing it using the encoders in .NET. We get some unknown characters in some culture which comes out fine if we post (from IE) from a...

.NET Framework

Problems character sets / special characters dot.Net <-> mssql-server

by: Klaus Herzberg | last post by:

Hi, I come from the "dark side" php/mysql and there often problems with character sets (utf-8, latin...) and storing data in datebase. Exists in the world of dot.net and ms-sql-server similiar...

Microsoft SQL Server

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

problems with Â character

Similar topics