473,808 Members | 2,852 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Removing Unicode from Python?

In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?". I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

Thanks Joey
Jul 18 '05 #1
5 2522
Paradox wrote:
In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?".


Can you give an example of such string? Reporting its repr() would help.

If you want to encode arbitrary Unicode strings into byte strings, you
can use "utf-8" as the encoding.

Regards,
Martin

Jul 18 '05 #2
Paradox:
I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string.


In VB a string is a BSTR is a Unicode string.
http://msdn.microsoft.com/library/de...lprocedure.asp

Neil
Jul 18 '05 #3
There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.


All MS products use unicode strings. All the time. Its integral to
the OS and all its libraries.

VB and other MS offspring allow you to ignore that fact, but they
don't make it go away.

Python is just doing what it should do: handle unicode strings as unicode
strings.
Jul 18 '05 #4
Brian Quinlan <br***@sweetapp .com> wrote:
All MS products use unicode strings. All the time. Its integral to
the OS and all its libraries.
This statement is obviously false.


Not really. The core Windows 2000 and XP operating systems are exclusively
Unicode. When you call one of the ASCII APIs, it converts every string to
Unicode, calls the Unicode API which does the real work, converts any
output parameters back to ASCII, and returns them to you.

As you might imagine, all of those conversions cost time. Thus,
Microsoft's application products work natively in Unicode and use the
Unicode APIs when they are available.
But the SQL Server "text" type is not a Unicode type.


And that means, among other things, that it cannot handle international
character sets reasonably. There is no agreement as to what the character
0xBF is, whereas there IS standards-based agreement on the meaning of the
Unicode code point u00BF.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Jul 18 '05 #5
On 29 Oct 2003 23:12:39 -0800, Paradox wrote:
In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?". I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

Thanks Joey


i had a simpilar problem with SQL Server. my solution was to create a
sitecustomize.p y file containing:

import sys
sys.setdefaulte ncoding("utf-8")
this works for me and turns off unicode for everything. i was unable to
find any other solution that i could understand. (i'm not a programmer and
have only just started with python).

jack
sidelined in order to prevent discrimination on the gender front
Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
7092
by: sebastien.hugues | last post by:
Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name: sébastien. The second character is not an ascii one and when i try to encode the path that contains this name in utf-8,
23
25964
by: Hallvard B Furuseth | last post by:
Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner, so I assume it's simplest to convert to lower/upper first. Possibly all strings will be from the latin-1 character set, so I could convert to 8-bit latin-1, map to lowercase, and convert back, but that seems rather cumbersome.
8
5281
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
8
3669
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?
2
2636
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349
10
2235
by: Larry Hastings | last post by:
I'm an indie shareware Windows game developer. In indie shareware game development, download size is terribly important; conventional wisdom holds that--even today--your download should be 5MB or less. I'd like to use Python in my games. However, python24.dll is 1.86MB, and zips down to 877k. I can't afford to devote 1/6 of my download to just the scripting interpreter; I've got music, and textures, and my own crappy code to ship. ...
6
1514
by: HappyHippy | last post by:
More of a minor niggle than anything but how would I remove the aforementioned space? eg. strName = 'World' print 'Hello', strName, ', how are you today?' comes out as "Hello World , how are you today?" Have googled, and worked my way through the first 7 chapters of Byte of
9
2945
by: Jim | last post by:
Hello, I'm trying to write exception-handling code that is OK in the presence of unicode error messages. I seem to have gotten all mixed up and I'd appreciate any un-mixing that anyone can give me. I'm used to writing code like this.
7
4028
by: 7stud | last post by:
Based on this example and the error: ----- u_str = u"abc\u9999" print u_str UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128) ------
0
9600
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10628
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10374
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10113
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7651
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6880
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
4331
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3859
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3011
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.