csv module and unicode, when or workaround?

Chris

hi,
to convert excel files via csv to xml or whatever I frequently use the
csv module which is really nice for quick scripts. problem are of course
non ascii characters like german umlauts, EURO currency symbol etc.

the current csv module cannot handle unicode the docs say, is there any
workaround or is unicode support planned for the near future? in most
cases support for characters in iso-8859-1(5) would be ok for my
purposes but of course full unicode support would be great...

obviously I am not a python pro, i did not even find the py source for
the module, it seemed to me it is a C based module?. is this also the
reason for the unicode unawareness?

thanks
chris

Jul 18 '05 #1

Subscribe Reply

3801

Skip Montanaro

Chris> the current csv module cannot handle unicode the docs say, is
Chris> there any workaround or is unicode support planned for the near
Chris> future?

True, it can't.

Chris> obviously I am not a python pro, i did not even find the py
Chris> source for the module, it seemed to me it is a C based
Chris> module?. is this also the reason for the unicode unawareness?

Look in Modules/_csv.c and Lib/csv.py. The C-ness of the underlying module
is the main issue as far as I understand. If you have some C+Unicode-fu
(this goes for anyone reading this, not just Chris), feel free to try
writing a patch. Also, check out the csv mailing list:

http://orca.mojam.com/mailman/listinfo/csv

Skip

Jul 18 '05 #2

Skip Montanaro

Chris> the current csv module cannot handle unicode the docs say, is
Chris> there any workaround or is unicode support planned for the near
Chris> future?

Skip> True, it can't.

Hmmm... I think the following should be a reasonable workaround in most
situations:

#!/usr/bin/env python

import csv

class UnicodeReader:
def __init__(self, f, dialect=csv.exc el, encoding="utf-8", **kwds):
self.reader = csv.reader(f, dialect=dialect , **kwds)
self.encoding = encoding

def next(self):
row = self.reader.nex t()
return [unicode(s, self.encoding) for s in row]

def __iter__(self):
return self

class UnicodeWriter:
def __init__(self, f, dialect=csv.exc el, encoding="utf-8", **kwds):
self.writer = csv.writer(f, dialect=dialect , **kwds)
self.encoding = encoding

def writerow(self, row):
self.writer.wri terow([s.encode("utf-8") for s in row])

def writerows(self, rows):
for row in rows:
self.writerow(r ow)

if __name__ == "__main__":
try:
oldurow = [u'\u65E5\u672C\ u8A9E',
u'Hi Mom -\u263a-!',
u'A\u2262\u0391 .']
writer = UnicodeWriter(o pen("uni.csv", "wb"))
writer.writerow (oldurow)
del writer

reader = UnicodeReader(o pen("uni.csv", "rb"))
newurow = reader.next()
print "trivial test", newurow == oldurow and "passed" or "failed"
finally:
import os
os.unlink("uni. csv")

If people don't find any egregious flaws with the concept I'll at least add
it as an example to the csv module docs. Maybe they would even work as
additions to the csv.py module, assuming the api is palatable.

Skip

Jul 18 '05 #3

Max M

Chris wrote:

the current csv module cannot handle unicode the docs say, is there any
workaround or is unicode support planned for the near future? in most
cases support for characters in iso-8859-1(5) would be ok for my
purposes but of course full unicode support would be great...

It doesn't support unicode, but you should not have problem
importing/exporting encoded strings.

I have imported utf-8 encoded string with no trouble. But I might just
have been lucky that they are inside the latin-1 range?

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

Jul 18 '05 #4

John Machin

Chris wrote:

hi,
to convert excel files via csv to xml or whatever I frequently use the csv module which is really nice for quick scripts. problem are of course non ascii characters like german umlauts, EURO currency symbol etc.
The umlauted characters should not be a problem, they're all in the
first 256 characters. What makes you say they are a problem "of
course"?
the current csv module cannot handle unicode the docs say, is there any workaround or is unicode support planned for the near future? in most cases support for characters in iso-8859-1(5) would be ok for my
purposes but of course full unicode support would be great...

Here's a perambulation through some of the alternatives:

A. If you save the file from Excel as "Unicode text", you can pretty
much DIY:

buff = file('csvtest.t xt', 'rb').read()
lines = buff.decode('ut f16').split(u'\ r\n')
lines [u'M\xfcller\t"\ u20ac1234,56"', u'M\xf6ller\t"\ u20ac9876,54"',
u'Kawasaki\t\xa 53456.78', u''] for line in lines: .... print line.split(u'\t ')
....
[u'M\xfcller', u'"\u20ac1234,5 6"']
[u'M\xf6ller', u'"\u20ac9876,5 4"']
[u'Kawasaki', u'\xa53456.78']
[u'']
All you have to do is handle (1) Excel's unnecessary quoting of the
comma in the money amounts [see first two lines above; what it quotes
is probably locale-dependent] (2) double quoting any quotes [no example
given] (3) ignore the empty "line" introduced by split().

Problem (3) is easy: if not lines[-1:]: del lines[-1:]

Hmmm ... by the time you finish this (and generalise it) you will have
done the Unicode extension to the csv module ...

Alternative B: you can do ODBC access to Excel spreadsheets; hmmm ...
yuk ... no better than CSV i.e. you get the data in your current code
page, not in Unicode:

[('M\xfcller', '\x801234,56'), ('M\xf6ller', '\x809876,54'),
('Kawasaki', '\xa53456.78')]

Alternative C: why not save your file as local-code-page .csv, use the
csv module, and DIY decode:
rdr = csv.reader(file ('csvtest.csv', 'rb'))
for row in rdr: .... print row
.... urow = [x.decode('cp125 2') for x in row]
.... print urow
....
['Name', 'Amount']
[u'Name', u'Amount']
['M\xfcller', '\x801234,56']
[u'M\xfcller', u'\u20ac1234,56 ']
['M\xf6ller', '\x809876,54']
[u'M\xf6ller', u'\u20ac9876,54 ']
['Kawasaki', '\xa53456.78']
[u'Kawasaki', u'\xa53456.78']

Looks good to me, including the euro sign.

HTH,

John

Jul 18 '05 #5

Chris

hi,
thanks for all replies, I try if I can at least get the work done.

I guess my problem mainly was the rather mindflexing (at least for me)
coding/decoding of strings...

But I guess it would be really helpful to put the UnicodeReader/Writer
in the docs

thanks a lot
chris

Jul 18 '05 #6

John Machin

Chris wrote:

hi,
thanks for all replies, I try if I can at least get the work done.

I guess my problem mainly was the rather mindflexing (at least for me) coding/decoding of strings...

But I guess it would be really helpful to put the UnicodeReader/Writer in the docs

UNFORTUNATELY the solution of saving the Excel .XLS to a .CSV doesn't
work if you have Unicode characters that are not in your Windows
code-page. Nor would it work in a CJK environment if the file was saved
in an MBCS encoding (e.g. Big5). A work-around appears possible, with
some more effort:

I have extended the previous sample XLS; there is now a last line with
IVANOV in Cyrillic letters [pardon my spelling etc etc if necessary].
My code-page is cp1252, which sure don't grok Russki :-)

I've saved it as CSV [no complaint from Excel] and as "Unicode text".

buffc = file('csvtest2. csv', 'rb').read()
buffc 'Name,Amount\r\ nM\xfcller,"\x8 01234,56"\r\nM\ xf6ller,"\x8098 76,54"\r\nKawas aki,\xa53456.78 \r\n??????,"?56 78,90"\r\n'

Thanks a lot, Bill! That's really clever.
buffu16 = file('csvtest2. txt', 'rb').read()
buffu16 '\xff\xfeN\x00a \x00m\x00e\x00\ t\x00A\x00m\x00 o\x00u\x00n\x00 t\x00\r\x00\n\x 00
[snip] \x18\x04\x12\x0 4
\x10\x04\x1d\x0 4\x1e\x04\x12\x 04\t\x00"\x00
\x045\x006\x007 \x008\x00,\x009 \x000\x00"\x00\ r\x00\n\x00' buffu = buffu16.decode( 'utf16')
buffu u'Name\tAmount\ r\nM\xfcller\t" \u20ac1234,56"\ r\nM\xf6ller\t" \u20ac9876,54"\ r\nKawasaki\t\x a53456.78\r\n\u 0418\u0412\u041 0\u041d\u041
e\u0412\t"\u042 05678,90"\r\n'

Aside: this has removed the BOM. I understood (possibly incorrectly)
from a recent thread that Python codecs left the BOM in there, but hey
I'm not complaining :-)

As expected, this looks OK. The extra step required in the work-around
is to convert the utf16 file to utf8 and feed that to the csv reader.
Why utf8? (1) Every Unicode character can be represented, not just ones
in that are in your code-page (2) ASCII characters can't appear as part
of the representation of any other character -- i.e. ones that are
significant to csv (tab, comma, quote, \r, \n) can't cause errors by
showing up as part of another character e.g. CJK characters.
buffu8 = buffu.encode('u tf8')
buffu8 'Name\tAmount\r \nM\xc3\xbcller \t"\xe2\x82\xac 1234,56"\r\nM\x c3\xb6ller\t"\x e2\x82\xac9876, 54"\r\nKawasaki \t\xc2\xa53456. 78\r\n\xd0\x
98\xd0\x92\xd0\ x90\xd0\x9d\xd0 \x9e\xd0\x92\t" \xd0\xa05678,90 "\r\n' x = file('csvtest2. u8', 'wb')
x.write(buffu8)
x.close()
import csv
rdr = csv.reader(file ('csvtest2.u8', 'rb'), delimiter='\t')
for row in rdr: .... print row
.... print [x.decode('utf8' ) for x in row]
....
['Name', 'Amount']
[u'Name', u'Amount']
['M\xc3\xbcller' , '\xe2\x82\xac12 34,56']
[u'M\xfcller', u'\u20ac1234,56 ']
['M\xc3\xb6ller' , '\xe2\x82\xac98 76,54']
[u'M\xf6ller', u'\u20ac9876,54 ']
['Kawasaki', '\xc2\xa53456.7 8']
[u'Kawasaki', u'\xa53456.78']
['\xd0\x98\xd0\x 92\xd0\x90\xd0\ x9d\xd0\x9e\xd0 \x92', '\xd0\xa05678,9 0']
[u'\u0418\u0412\ u0410\u041d\u04 1e\u0412', u'\u04205678,90 ']

Howzat?

Cheers,
John

Jul 18 '05 #7

Similar topics

1888

[perl-python] unicode study with unicodedata module

by: Xah Lee | last post by:

python has this nice unicodedata module that deals with unicode nicely. #-*- coding: utf-8 -*- # python from unicodedata import * # each unicode char has a unique name. # one can use the â€œlookupâ€ func to find it

Python

4107

UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

by: Chris Mullins | last post by:

I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework: When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column

.NET Framework

4107

PEP on path module for standard library

by: Michael Hoffman | last post by:

Many of you are familiar with Jason Orendorff's path module <http://www.jorendorff.com/articles/python/path/>, which is frequently recommended here on c.l.p. I submitted an RFE to add it to the Python standard library, and Reinhold Birkenfeld started a discussion on it in python-dev <http://mail.python.org/pipermail/python-dev/2005-June/054438.html>. The upshot of the discussion was that many python-dev'ers wanted path added to the...

Python

7561

Retrieve and display unicode data using ADO and DB2 V7.2

by: Daman | last post by:

Hi, I am currently facing difficulty displaying chinese, japanese, russian etc. characters. I am using VB 6 and ADO to query the DB2 Version 7.2 unicode database (UTF-8). The resultset that comes back contains garbage characters for Chinese, Russian etc languages. The english characters come back fine using ADO. It seems that DB2 assumes that my application is NOT Unicode compliant.

DB2 Database

2553

converting application to UNICODE

by: Sonu | last post by:

Hello everyone and thanks in advance. I have a multilingual application which has been built in MFC VC++ 6.0 (non-Unicode). It support English German Hungarian so far, which has been fine. But now I need it to work on Russian computers and I realized that the application should be converted to Unicode to work in Russian. I am totally new to .NET so I'm not sure of this, but I read somewhere that if converted my apllication to .NET...

.NET Framework

6244

Unicode problem with exec

by: Thomas Heller | last post by:

I'm using code.Interactive console but it doesn't work correctly with non-ascii characters. I think it boils down to this problem: Python 2.4.3 (#69, Mar 29 2006, 17:35:34) on win32 Type "help", "copyright", "credits" or "license" for more information. >>> print u"ä" ä >>> exec 'print u"ä"' Traceback (most recent call last): File "<stdin>", line 1, in ?

Python

1718

sax barfs on unicode filenames

by: Edward K. Ream | last post by:

Hi. Presumably this is a easy question, but anyone who understands the sax docs thinks completely differently than I do :-) Following the usual cookbook examples, my app parses an open file as follows::

Python

2160

Taint (like in Perl) as a Python module: taint.py

by: Johann C. Rocholl | last post by:

The following is my first attempt at adding a taint feature to Python to prevent os.system() from being called with untrusted input. What do you think of it? # taint.py - Emulate Perl's taint feature in Python # Copyright (C) 2007 Johann C. Rocholl <johann@rocholl.net> # # Permission is hereby granted, free of charge, to any person # obtaining a copy of this software and associated documentation files # (the "Software"), to deal in the...

Python

5210

How to get all variables of some module in that module

by: Alex Gusarov | last post by:

Hello, I have a following module and in its end I want to initalize collection of tables: Module: from sqlalchemy import * metadata = MetaData() calendars_table = Table('calendars', metadata,

Python

9431

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9255

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10014

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9844

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9819

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9689

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7226

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6514

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5119

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration