473,325 Members | 2,480 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

csv module and unicode, when or workaround?

hi,
to convert excel files via csv to xml or whatever I frequently use the
csv module which is really nice for quick scripts. problem are of course
non ascii characters like german umlauts, EURO currency symbol etc.

the current csv module cannot handle unicode the docs say, is there any
workaround or is unicode support planned for the near future? in most
cases support for characters in iso-8859-1(5) would be ok for my
purposes but of course full unicode support would be great...

obviously I am not a python pro, i did not even find the py source for
the module, it seemed to me it is a C based module?. is this also the
reason for the unicode unawareness?

thanks
chris
Jul 18 '05 #1
6 3733

Chris> the current csv module cannot handle unicode the docs say, is
Chris> there any workaround or is unicode support planned for the near
Chris> future?

True, it can't.

Chris> obviously I am not a python pro, i did not even find the py
Chris> source for the module, it seemed to me it is a C based
Chris> module?. is this also the reason for the unicode unawareness?

Look in Modules/_csv.c and Lib/csv.py. The C-ness of the underlying module
is the main issue as far as I understand. If you have some C+Unicode-fu
(this goes for anyone reading this, not just Chris), feel free to try
writing a patch. Also, check out the csv mailing list:

http://orca.mojam.com/mailman/listinfo/csv

Skip
Jul 18 '05 #2

Chris> the current csv module cannot handle unicode the docs say, is
Chris> there any workaround or is unicode support planned for the near
Chris> future?

Skip> True, it can't.

Hmmm... I think the following should be a reasonable workaround in most
situations:

#!/usr/bin/env python

import csv

class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
self.reader = csv.reader(f, dialect=dialect, **kwds)
self.encoding = encoding

def next(self):
row = self.reader.next()
return [unicode(s, self.encoding) for s in row]

def __iter__(self):
return self

class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
self.writer = csv.writer(f, dialect=dialect, **kwds)
self.encoding = encoding

def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])

def writerows(self, rows):
for row in rows:
self.writerow(row)

if __name__ == "__main__":
try:
oldurow = [u'\u65E5\u672C\u8A9E',
u'Hi Mom -\u263a-!',
u'A\u2262\u0391.']
writer = UnicodeWriter(open("uni.csv", "wb"))
writer.writerow(oldurow)
del writer

reader = UnicodeReader(open("uni.csv", "rb"))
newurow = reader.next()
print "trivial test", newurow == oldurow and "passed" or "failed"
finally:
import os
os.unlink("uni.csv")

If people don't find any egregious flaws with the concept I'll at least add
it as an example to the csv module docs. Maybe they would even work as
additions to the csv.py module, assuming the api is palatable.

Skip
Jul 18 '05 #3
Chris wrote:
the current csv module cannot handle unicode the docs say, is there any
workaround or is unicode support planned for the near future? in most
cases support for characters in iso-8859-1(5) would be ok for my
purposes but of course full unicode support would be great...


It doesn't support unicode, but you should not have problem
importing/exporting encoded strings.

I have imported utf-8 encoded string with no trouble. But I might just
have been lucky that they are inside the latin-1 range?

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
Jul 18 '05 #4

Chris wrote:
hi,
to convert excel files via csv to xml or whatever I frequently use the csv module which is really nice for quick scripts. problem are of course non ascii characters like german umlauts, EURO currency symbol etc.
The umlauted characters should not be a problem, they're all in the
first 256 characters. What makes you say they are a problem "of
course"?
the current csv module cannot handle unicode the docs say, is there any workaround or is unicode support planned for the near future? in most cases support for characters in iso-8859-1(5) would be ok for my
purposes but of course full unicode support would be great...


Here's a perambulation through some of the alternatives:

A. If you save the file from Excel as "Unicode text", you can pretty
much DIY:
buff = file('csvtest.txt', 'rb').read()
lines = buff.decode('utf16').split(u'\r\n')
lines [u'M\xfcller\t"\u20ac1234,56"', u'M\xf6ller\t"\u20ac9876,54"',
u'Kawasaki\t\xa53456.78', u''] for line in lines: .... print line.split(u'\t')
....
[u'M\xfcller', u'"\u20ac1234,56"']
[u'M\xf6ller', u'"\u20ac9876,54"']
[u'Kawasaki', u'\xa53456.78']
[u'']
All you have to do is handle (1) Excel's unnecessary quoting of the
comma in the money amounts [see first two lines above; what it quotes
is probably locale-dependent] (2) double quoting any quotes [no example
given] (3) ignore the empty "line" introduced by split().

Problem (3) is easy: if not lines[-1:]: del lines[-1:]

Hmmm ... by the time you finish this (and generalise it) you will have
done the Unicode extension to the csv module ...

Alternative B: you can do ODBC access to Excel spreadsheets; hmmm ...
yuk ... no better than CSV i.e. you get the data in your current code
page, not in Unicode:

[('M\xfcller', '\x801234,56'), ('M\xf6ller', '\x809876,54'),
('Kawasaki', '\xa53456.78')]

Alternative C: why not save your file as local-code-page .csv, use the
csv module, and DIY decode:
rdr = csv.reader(file('csvtest.csv', 'rb'))
for row in rdr: .... print row
.... urow = [x.decode('cp1252') for x in row]
.... print urow
....
['Name', 'Amount']
[u'Name', u'Amount']
['M\xfcller', '\x801234,56']
[u'M\xfcller', u'\u20ac1234,56']
['M\xf6ller', '\x809876,54']
[u'M\xf6ller', u'\u20ac9876,54']
['Kawasaki', '\xa53456.78']
[u'Kawasaki', u'\xa53456.78']

Looks good to me, including the euro sign.

HTH,

John

Jul 18 '05 #5
hi,
thanks for all replies, I try if I can at least get the work done.

I guess my problem mainly was the rather mindflexing (at least for me)
coding/decoding of strings...

But I guess it would be really helpful to put the UnicodeReader/Writer
in the docs

thanks a lot
chris
Jul 18 '05 #6

Chris wrote:
hi,
thanks for all replies, I try if I can at least get the work done.

I guess my problem mainly was the rather mindflexing (at least for me) coding/decoding of strings...

But I guess it would be really helpful to put the UnicodeReader/Writer in the docs


UNFORTUNATELY the solution of saving the Excel .XLS to a .CSV doesn't
work if you have Unicode characters that are not in your Windows
code-page. Nor would it work in a CJK environment if the file was saved
in an MBCS encoding (e.g. Big5). A work-around appears possible, with
some more effort:

I have extended the previous sample XLS; there is now a last line with
IVANOV in Cyrillic letters [pardon my spelling etc etc if necessary].
My code-page is cp1252, which sure don't grok Russki :-)

I've saved it as CSV [no complaint from Excel] and as "Unicode text".
buffc = file('csvtest2.csv', 'rb').read()
buffc 'Name,Amount\r\nM\xfcller,"\x801234,56"\r\nM\xf6ll er,"\x809876,54"\r\nKawasaki,\xa53456.78\r\n?????? ,"?5678,90"\r\n'

Thanks a lot, Bill! That's really clever.
buffu16 = file('csvtest2.txt', 'rb').read()
buffu16 '\xff\xfeN\x00a\x00m\x00e\x00\t\x00A\x00m\x00o\x00 u\x00n\x00t\x00\r\x00\n\x00
[snip] \x18\x04\x12\x04
\x10\x04\x1d\x04\x1e\x04\x12\x04\t\x00"\x00
\x045\x006\x007\x008\x00,\x009\x000\x00"\x00\r\x00 \n\x00' buffu = buffu16.decode('utf16')
buffu u'Name\tAmount\r\nM\xfcller\t"\u20ac1234,56"\r\nM\ xf6ller\t"\u20ac9876,54"\r\nKawasaki\t\xa53456.78\ r\n\u0418\u0412\u0410\u041d\u041
e\u0412\t"\u04205678,90"\r\n'

Aside: this has removed the BOM. I understood (possibly incorrectly)
from a recent thread that Python codecs left the BOM in there, but hey
I'm not complaining :-)

As expected, this looks OK. The extra step required in the work-around
is to convert the utf16 file to utf8 and feed that to the csv reader.
Why utf8? (1) Every Unicode character can be represented, not just ones
in that are in your code-page (2) ASCII characters can't appear as part
of the representation of any other character -- i.e. ones that are
significant to csv (tab, comma, quote, \r, \n) can't cause errors by
showing up as part of another character e.g. CJK characters.
buffu8 = buffu.encode('utf8')
buffu8 'Name\tAmount\r\nM\xc3\xbcller\t"\xe2\x82\xac1234, 56"\r\nM\xc3\xb6ller\t"\xe2\x82\xac9876,54"\r\nKaw asaki\t\xc2\xa53456.78\r\n\xd0\x
98\xd0\x92\xd0\x90\xd0\x9d\xd0\x9e\xd0\x92\t"\xd0\ xa05678,90"\r\n' x = file('csvtest2.u8', 'wb')
x.write(buffu8)
x.close()
import csv
rdr = csv.reader(file('csvtest2.u8', 'rb'), delimiter='\t')
for row in rdr: .... print row
.... print [x.decode('utf8') for x in row]
....
['Name', 'Amount']
[u'Name', u'Amount']
['M\xc3\xbcller', '\xe2\x82\xac1234,56']
[u'M\xfcller', u'\u20ac1234,56']
['M\xc3\xb6ller', '\xe2\x82\xac9876,54']
[u'M\xf6ller', u'\u20ac9876,54']
['Kawasaki', '\xc2\xa53456.78']
[u'Kawasaki', u'\xa53456.78']
['\xd0\x98\xd0\x92\xd0\x90\xd0\x9d\xd0\x9e\xd0\x92' , '\xd0\xa05678,90']
[u'\u0418\u0412\u0410\u041d\u041e\u0412', u'\u04205678,90']


Howzat?

Cheers,
John

Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Xah Lee | last post by:
python has this nice unicodedata module that deals with unicode nicely. #-*- coding: utf-8 -*- # python from unicodedata import * # each unicode char has a unique name. # one can use the...
12
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following...
70
by: Michael Hoffman | last post by:
Many of you are familiar with Jason Orendorff's path module <http://www.jorendorff.com/articles/python/path/>, which is frequently recommended here on c.l.p. I submitted an RFE to add it to the...
1
by: Daman | last post by:
Hi, I am currently facing difficulty displaying chinese, japanese, russian etc. characters. I am using VB 6 and ADO to query the DB2 Version 7.2 unicode database (UTF-8). The resultset that...
5
by: Sonu | last post by:
Hello everyone and thanks in advance. I have a multilingual application which has been built in MFC VC++ 6.0 (non-Unicode). It support English German Hungarian so far, which has been fine. But...
3
by: Thomas Heller | last post by:
I'm using code.Interactive console but it doesn't work correctly with non-ascii characters. I think it boils down to this problem: Python 2.4.3 (#69, Mar 29 2006, 17:35:34) on win32 Type...
9
by: Edward K. Ream | last post by:
Hi. Presumably this is a easy question, but anyone who understands the sax docs thinks completely differently than I do :-) Following the usual cookbook examples, my app parses an open file...
5
by: Johann C. Rocholl | last post by:
The following is my first attempt at adding a taint feature to Python to prevent os.system() from being called with untrusted input. What do you think of it? # taint.py - Emulate Perl's taint...
3
by: Alex Gusarov | last post by:
Hello, I have a following module and in its end I want to initalize collection of tables: Module: from sqlalchemy import * metadata = MetaData() calendars_table = Table('calendars',...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.