473,399 Members | 3,888 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Help with character encodings

A_H
Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?


Jun 27 '08 #1
3 4635
A_H wrote:
Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?
Encoding it into a 'latin1' encoded string seems to work:
>>print u'\xad'.encode('latin1')
-

>
--
http://mail.python.org/mailman/listinfo/python-list
Jun 27 '08 #2
On Tue, 2008-05-20 at 08:28 -0700, Gary Herron wrote:
A_H wrote:
Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?

Encoding it into a 'latin1' encoded string seems to work:
>>print u'\xad'.encode('latin1')
-

Here's what I've found:
>>x = u'\xad'
x.replace('\xad','-')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 0:
ordinal not in range(128)
>>x.replace(u'\xad','-')
u'-'

If you replace the *string* '\xad' in the first argument to replace with
the *unicode object* u'\xad', python won't complain anymore. (Mind you,
you weren't using str.replace. You were using unicode.replace. Slight
difference, but important.) If you do the replace on a plain string, it
doesn't have to convert anything, so you don't get a UnicodeDecodeError.
>>x = x.encode('latin1')
x
'\xad'
>># Note the lack of a u before the ' above.
x.replace('\xad','-')
'-'
>>>
Cheers,
Cliff
Jun 27 '08 #3
Gary Herron wrote:
A_H wrote:
>Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?

Encoding it into a 'latin1' encoded string seems to work:
>>print u'\xad'.encode('latin1')
-
That might be what you want, but really, it was not a very well thought
answer. Here's a better answer:

Using the unicodedata module, i see that the character you have u'\xad' is

SOFT HYPHEN (codepoint 173=0xad)
If you want to replace that with the more familiar HYPHEN-MINUS
(codepoint 45) you can use the string replace, but stick will all
unicode values so you don't provoke a conversion to an ascii encoded string
>>print u'ABC\xadDEF'.replace(u'\xad','-')
ABC-DEF

But does this really solve your problem? If there is the possibility
for other unicode characters in your data, this is heading down the
wrong track, and the question (which I can't answer) becomes: What are
you going to do with the string?

If you are going to display it via a GUI that understands UTF-8, then
encode the string as utf8 and display it -- no need to convert the
hyphens.

If you are trying to display it somewhere that is not unicode (or UTF-8)
aware, then you'll have to convert it. In that case, encoding it as
latin1 is probably a good choice, but beware: That does not convert the
u'\xad' to an chr(45) (the usual HYPHEN-MINUS), but instead to chr(173)
which (on latin1 aware applications) will display as the usual hyphen.
In any case, it won't be ascii (in the strict sense that ascii is chr(0)
through chr(127)). If you *really* *really* wanted straight strict
ascii, replace chr(173) with chr(45).

Gary Herron


Jun 27 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: Nicholas Pappas | last post by:
Hello all. I am trying to write a Java3D loader for a geometry file from a game, which has Unicode characters (Korean) in it. I wrote the loader and it works in Windows, but I recently brushed...
9
by: Safalra | last post by:
The idea here is relatively simple: a java program (I'm using JDK1.4 if that makes a difference) that loads an HTML file, removes invalid characters (or replaces them in the case of common ones...
4
by: WaterWalk | last post by:
Hello, I'm currently learning string manipulation. I'm curious about what is the favored way for string manipulation in C, expecially when strings contain non-ASCII characters. For example, if...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.