Removing Unicode from Python?

Paradox

In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?". I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

Thanks Joey

Jul 18 '05 #1

Subscribe Post Reply

2491

Martin v. Löwis

Paradox wrote:

In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?".

Can you give an example of such string? Reporting its repr() would help.

If you want to encode arbitrary Unicode strings into byte strings, you
can use "utf-8" as the encoding.

Regards,
Martin

Jul 18 '05 #2

Neil Hodgson

Paradox:

I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string.

In VB a string is a BSTR is a Unicode string.
http://msdn.microsoft.com/library/de...lprocedure.asp

Neil

Jul 18 '05 #3

George Kinney

There is no way that a byte can not be

between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

All MS products use unicode strings. All the time. Its integral to
the OS and all its libraries.

VB and other MS offspring allow you to ignore that fact, but they
don't make it go away.

Python is just doing what it should do: handle unicode strings as unicode
strings.

Jul 18 '05 #4

Tim Roberts

Brian Quinlan <br***@sweetapp.com> wrote:

All MS products use unicode strings. All the time. Its integral to
the OS and all its libraries.
This statement is obviously false.

Not really. The core Windows 2000 and XP operating systems are exclusively
Unicode. When you call one of the ASCII APIs, it converts every string to
Unicode, calls the Unicode API which does the real work, converts any
output parameters back to ASCII, and returns them to you.

As you might imagine, all of those conversions cost time. Thus,
Microsoft's application products work natively in Unicode and use the
Unicode APIs when they are available.
But the SQL Server "text" type is not a Unicode type.

And that means, among other things, that it cannot handle international
character sets reasonably. There is no agreement as to what the character
0xBF is, whereas there IS standards-based agreement on the meaning of the
Unicode code point u00BF.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Jul 18 '05 #5

jack

On 29 Oct 2003 23:12:39 -0800, Paradox wrote:

In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?". I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

Thanks Joey

i had a simpilar problem with SQL Server. my solution was to create a
sitecustomize.py file containing:

import sys
sys.setdefaultencoding("utf-8")
this works for me and turns off unicode for everything. i was unable to
find any other solution that i could understand. (i'm not a programmer and
have only just started with python).

jack
sidelined in order to prevent discrimination on the gender front

Jul 18 '05 #6

by: sebastien.hugues | last post by:

Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name:...

Python

convert Unicode to lower/uppercase?

by: Hallvard B Furuseth | last post by:

Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner,...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...

Python

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

Shrinky-dink Python (also, non-Unicode Python build is broken)

by: Larry Hastings | last post by:

I'm an indie shareware Windows game developer. In indie shareware game development, download size is terribly important; conventional wisdom holds that--even today--your download should be 5MB or...

Python

Another n00b: Removing the space in "print 'text', var"

by: HappyHippy | last post by:

More of a minor niggle than anything but how would I remove the aforementioned space? eg. strName = 'World' print 'Hello', strName, ', how are you today?' comes out as "Hello World , how are...

Python

error messages containing unicode

by: Jim | last post by:

Hello, I'm trying to write exception-handling code that is OK in the presence of unicode error messages. I seem to have gotten all mixed up and I'd appreciate any un-mixing that anyone can...

Python

unicode

by: 7stud | last post by:

Based on this example and the error: ----- u_str = u"abc\u9999" print u_str UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128) ------

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Removing Unicode from Python?

Similar topics