473,289 Members | 1,840 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,289 software developers and data experts.

Help with character encodings

A_H
Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?


Jun 27 '08 #1
3 4585
A_H wrote:
Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?
Encoding it into a 'latin1' encoded string seems to work:
>>print u'\xad'.encode('latin1')
-

>
--
http://mail.python.org/mailman/listinfo/python-list
Jun 27 '08 #2
On Tue, 2008-05-20 at 08:28 -0700, Gary Herron wrote:
A_H wrote:
Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?

Encoding it into a 'latin1' encoded string seems to work:
>>print u'\xad'.encode('latin1')
-

Here's what I've found:
>>x = u'\xad'
x.replace('\xad','-')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 0:
ordinal not in range(128)
>>x.replace(u'\xad','-')
u'-'

If you replace the *string* '\xad' in the first argument to replace with
the *unicode object* u'\xad', python won't complain anymore. (Mind you,
you weren't using str.replace. You were using unicode.replace. Slight
difference, but important.) If you do the replace on a plain string, it
doesn't have to convert anything, so you don't get a UnicodeDecodeError.
>>x = x.encode('latin1')
x
'\xad'
>># Note the lack of a u before the ' above.
x.replace('\xad','-')
'-'
>>>
Cheers,
Cliff
Jun 27 '08 #3
Gary Herron wrote:
A_H wrote:
>Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?

Encoding it into a 'latin1' encoded string seems to work:
>>print u'\xad'.encode('latin1')
-
That might be what you want, but really, it was not a very well thought
answer. Here's a better answer:

Using the unicodedata module, i see that the character you have u'\xad' is

SOFT HYPHEN (codepoint 173=0xad)
If you want to replace that with the more familiar HYPHEN-MINUS
(codepoint 45) you can use the string replace, but stick will all
unicode values so you don't provoke a conversion to an ascii encoded string
>>print u'ABC\xadDEF'.replace(u'\xad','-')
ABC-DEF

But does this really solve your problem? If there is the possibility
for other unicode characters in your data, this is heading down the
wrong track, and the question (which I can't answer) becomes: What are
you going to do with the string?

If you are going to display it via a GUI that understands UTF-8, then
encode the string as utf8 and display it -- no need to convert the
hyphens.

If you are trying to display it somewhere that is not unicode (or UTF-8)
aware, then you'll have to convert it. In that case, encoding it as
latin1 is probably a good choice, but beware: That does not convert the
u'\xad' to an chr(45) (the usual HYPHEN-MINUS), but instead to chr(173)
which (on latin1 aware applications) will display as the usual hyphen.
In any case, it won't be ascii (in the strict sense that ascii is chr(0)
through chr(127)). If you *really* *really* wanted straight strict
ascii, replace chr(173) with chr(45).

Gary Herron


Jun 27 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: Nicholas Pappas | last post by:
Hello all. I am trying to write a Java3D loader for a geometry file from a game, which has Unicode characters (Korean) in it. I wrote the loader and it works in Windows, but I recently brushed...
9
by: Safalra | last post by:
The idea here is relatively simple: a java program (I'm using JDK1.4 if that makes a difference) that loads an HTML file, removes invalid characters (or replaces them in the case of common ones...
4
by: WaterWalk | last post by:
Hello, I'm currently learning string manipulation. I'm curious about what is the favored way for string manipulation in C, expecially when strings contain non-ASCII characters. For example, if...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: marcoviolo | last post by:
Dear all, I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.