getting data with proper encoding to the finish | | |
Hi,
I have a little problem with encoding. Was hoping maybe anyone can
help me to solve it.
There is some amount of data in a database (PG) that must be inserted
into Excel sheet and emailed. Nothing special, everything works.
Except that non-ascii characters are not displayed properly.
The data is stored as XML into a text field. When I use pgsql it's
displayed good in the terminal. Now I run my script and print data
with "print" statement, still goed. Then I use pyXLWriter to write the
sheet, and Python email package to email it... and the resulting sheet
is not good:
Г is displayed instead of ü (for example)
The most important question that I'd like to ask, is what is the
proper way to debug it?! How can I determine where it goes wrong
(maybe by calculating character code or something?) I know very little
about how encoding works, I use it, but I don't really understand
it... any dummy-proof pointers are also appreciated :)
Thanks...
--
Ksenia | | | | re: getting data with proper encoding to the finish
Ksenia Marasanova wrote:[color=blue]
> Hi,
>
> I have a little problem with encoding. Was hoping maybe anyone can
> help me to solve it.
>
> There is some amount of data in a database (PG) that must be inserted
> into Excel sheet and emailed. Nothing special, everything works.
> Except that non-ascii characters are not displayed properly.
> The data is stored as XML into a text field.[/color]
This sentence doesn't make much sense. Explain.
[color=blue]
> When I use pgsql it's
> displayed good in the terminal. Now I run my script and print data
> with "print" statement, still goed.[/color]
Instead of "print data", do "print repr(data)" and show us what you
get. What *you* see on the screen is not much use for diagnosis; it's
the values of the bytes in the file that matter.
[color=blue]
> Then I use pyXLWriter to write the
> sheet,[/color]
Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
a Notepad window, save the Notepad file as Unicode type named (say)
"junk.u16" then at the Python interactive prompt do this:
file("junk.u16", "rb").read().decode("utf16")
and show us what you get.
[color=blue]
> and Python email package to email it... and the resulting sheet
> is not good:[/color]
E-mailed how? To whom? [I.e. what country / what cultural background /
on what machine / what operating system / viewed using what software]
[color=blue]
>
> Г is displayed instead of ü (for example)[/color]
You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
displayed instead of U+00FC (Latin small letter U with diaeresis).
OK, we'd already guessed your background from your name :-)
However, what you see isn't necessarily what you've got. How do you
know it's not U+0393 (Greek capital letter GAMMA) or something else
that looks the same? Could even be from a line-drawing set (top left
corner of a box). What you need to do is find out the ordinal of the
character being displayed.
This type of problem arises when a character is written in one encoding
and viewed using another. I've had a quick look through various
likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
encodings that would reproduce anything like your "umlauted-u becomes
gamma-or-similar" problem. Please supply more than 1 example. | | | | re: getting data with proper encoding to the finish
> > There is some amount of data in a database (PG) that must be inserted[color=blue][color=green]
> > into Excel sheet and emailed. Nothing special, everything works.
> > Except that non-ascii characters are not displayed properly.
> > The data is stored as XML into a text field.[/color]
>
> This sentence doesn't make much sense. Explain.[/color]
Sorry, I meant: I use field of the type 'text' in a Postgres table to
store my data. The data is a XML string.
[color=blue]
> Instead of "print data", do "print repr(data)" and show us what you
> get. What *you* see on the screen is not much use for diagnosis; it's
> the values of the bytes in the file that matter.[/color]
Thanks for this valuable tip. I take letter "é" as an example.
"print repr(data)" shows this:
u'\xe9'
[color=blue]
> Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
> a Notepad window, save the Notepad file as Unicode type named (say)
> "junk.u16" then at the Python interactive prompt do this:
>
> file("junk.u16", "rb").read().decode("utf16")
>
> and show us what you get.[/color]
(I am on a Mac so I used Textedit to create a UTF-16 encoded file, right?)
The result from Python is:
u'\u0439'
In Excel sheet it is shown as: й
(Russian again?!)
[color=blue]
> [color=green]
> > and Python email package to email it... and the resulting sheet
> > is not good:[/color]
>
> E-mailed how? To whom? [I.e. what country / what cultural background /
> on what machine / what operating system / viewed using what software][/color]
Emailed with Python, please see the code at the end of the message.
The receiving system is OS X with languages priority: Dutch, English,
German, Russian and Hebrew. Viewer: MS Office 2004.
[color=blue]
> You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
> displayed instead of U+00FC (Latin small letter U with diaeresis).
>
> OK, we'd already guessed your background from your name :-)[/color]
:-) [color=blue]
>
> However, what you see isn't necessarily what you've got. How do you
> know it's not U+0393 (Greek capital letter GAMMA) or something else
> that looks the same? Could even be from a line-drawing set (top left
> corner of a box). What you need to do is find out the ordinal of the
> character being displayed.
>
> This type of problem arises when a character is written in one encoding
> and viewed using another. I've had a quick look through various
> likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
> cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
> encodings that would reproduce anything like your "umlauted-u becomes
> gamma-or-similar" problem. Please supply more than 1 example.[/color]
Thank you very much for your help and explanation! The "é" letter is
what I could find till now, see examples above..
The following code fragment is used for creating Excel sheet and sending email:
# ##################
# Create Excel sheet
#
f = StringIO()
workbook = xl.Writer(f)
worksheet = workbook.add_worksheet()
bold = workbook.add_format(bold=1)
border = workbook.add_format(border=1)
worksheet.write_row('A1', ['First Name', 'Last Name'], border)
i = 1
for row in result:
print repr(row['firstname'])
datarow = [row.get('firstname'), row.get('surname')]
i += 1
worksheet.write_row('A%s' % i, datarow)
workbook.close()
####################
# Create email message
# Create the container (outer) email message.
msg = MIMEMultipart()
# Attach Excel sheet
xls = MIMEBase('application', 'vnd.ms-excel')
xls.set_payload(f.getvalue())
Encoders.encode_base64(xls)
xls.add_header('Content-Disposition', 'attachment', filename='some
file name %s-%s-%s.xls' % (today.day, today.month, today.year))
msg.attach(xls)
msg['Subject'] = subject
msg['From'] = fromaddr
msg['To'] = toaddr
# Guarantees the message ends in a newline
msg.epilogue = ''
# Send message
s = smtplib.SMTP(smtp_host)
s.sendmail(fromaddr, to_list, msg.as_string())
--
Ksenia | | | | re: getting data with proper encoding to the finish
Ksenia Marasanova wrote:[color=blue][color=green][color=darkred]
> > > There is some amount of data in a database (PG) that must be[/color][/color][/color]
inserted[color=blue][color=green][color=darkred]
> > > into Excel sheet and emailed. Nothing special, everything works.
> > > Except that non-ascii characters are not displayed properly.
> > > The data is stored as XML into a text field.[/color]
> >
> > This sentence doesn't make much sense. Explain.[/color]
>
> Sorry, I meant: I use field of the type 'text' in a Postgres table to
> store my data. The data is a XML string.
>[color=green]
> > Instead of "print data", do "print repr(data)" and show us what you
> > get. What *you* see on the screen is not much use for diagnosis;[/color][/color]
it's[color=blue][color=green]
> > the values of the bytes in the file that matter.[/color]
>
> Thanks for this valuable tip. I take letter "é" as an example.
>
> "print repr(data)" shows this:
> u'\xe9'
>
>[color=green]
> > Open the spreadsheet with Microsoft Excel, copy-and-paste some data[/color][/color]
to[color=blue][color=green]
> > a Notepad window, save the Notepad file as Unicode type named (say)
> > "junk.u16" then at the Python interactive prompt do this:
> >
> > file("junk.u16", "rb").read().decode("utf16")
> >
> > and show us what you get.[/color]
>
> (I am on a Mac so I used Textedit to create a UTF-16 encoded file,[/color]
right?)[color=blue]
> The result from Python is:
>
> u'\u0439'[/color]
So Unicode U+00E9 has become U+0439? Magic! I suspect that there is a
conversion step or two in there that you haven't mentioned. Are you
talking about the spreadsheet after it is created by your script on the
machine that created it [which is what I asked], or are you talking
about the spreadsheet on the recipient's machine, or are you talking
about the spreadsheet after the recipient has e-mailed it back to you,
hopefully untouched?
[color=blue]
>
> In Excel sheet it is shown as: й
>
> (Russian again?!)[/color]
This is probably indicative that the Latin-1 e-acute (0xE9) is being
converted to Unicode U+0439 by something that thinks it is actually in
an 8-bit Cyrillic encoding (0xE9 is "Cyrillic small letter short I" in
some 8-bit encodings) but the u-umlaut becoming GHE example doesn't fit
this story.
Please do a test where you put several different accented Latin letters
in the one field in your database. No, put ALL the non-ASCII characters
that you expect to be transmitted unchanged into test fields -- this
will make you think about what language(s)/locale(s) your database is
designed for and what language(s)/locale(s) your e-mail targets use.
Having this test data will be useful in the future for verifying that
your system works. Repeat all the above steps. Tell us what you see in
Excel on your machine and on the recipient's machine.
[color=blue]
>[color=green]
> >[color=darkred]
> > > and Python email package to email it... and the resulting sheet
> > > is not good:[/color]
> >
> > E-mailed how? To whom? [I.e. what country / what cultural[/color][/color]
background /[color=blue][color=green]
> > on what machine / what operating system / viewed using what[/color][/color]
software][color=blue]
>
> Emailed with Python, please see the code at the end of the message.
> The receiving system is OS X with languages priority: Dutch,[/color]
English,[color=blue]
> German, Russian and Hebrew. Viewer: MS Office 2004.[/color]
Sending system is ...?
[color=blue]
>
>[color=green]
> > You are saying (in effect) U+0413 (Cyrillic upper case letter GHE)[/color][/color]
is[color=blue][color=green]
> > displayed instead of U+00FC (Latin small letter U with diaeresis).
> >
> > OK, we'd already guessed your background from your name :-)[/color]
>
> :-)[color=green]
> >
> > However, what you see isn't necessarily what you've got. How do you
> > know it's not U+0393 (Greek capital letter GAMMA) or something else
> > that looks the same? Could even be from a line-drawing set (top[/color][/color]
left[color=blue][color=green]
> > corner of a box). What you need to do is find out the ordinal of[/color][/color]
the[color=blue][color=green]
> > character being displayed.
> >
> > This type of problem arises when a character is written in one[/color][/color]
encoding[color=blue][color=green]
> > and viewed using another. I've had a quick look through various
> > likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
> > cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
> > encodings that would reproduce anything like your "umlauted-u[/color][/color]
becomes[color=blue][color=green]
> > gamma-or-similar" problem. Please supply more than 1 example.[/color]
>
> Thank you very much for your help and explanation! The "é" letter is
> what I could find till now, see examples above..
>
>
>
>
> The following code fragment is used for creating Excel sheet and[/color]
sending email:[color=blue]
> # ##################
> # Create Excel sheet
> #
> f = StringIO()
> workbook = xl.Writer(f)
> worksheet = workbook.add_worksheet()
> bold = workbook.add_format(bold=1)
> border = workbook.add_format(border=1)
>
> worksheet.write_row('A1', ['First Name', 'Last Name'], border)
> i = 1
> for row in result:
> print repr(row['firstname'])
> datarow = [row.get('firstname'), row.get('surname')][/color]
You print the repr() of row['firstname'] but pass row.get('firstname')
to the XLS writer -- do they have the same value? For believability,
print what you pass!!
[color=blue]
> i += 1
> worksheet.write_row('A%s' % i, datarow)[/color]
Check the documentation for the XLS writer module to see if it is doing
an implicit conversion here.
[color=blue]
> workbook.close()
> ####################
> # Create email message
>
> # Create the container (outer) email message.
> msg = MIMEMultipart()
> # Attach Excel sheet
> xls = MIMEBase('application', 'vnd.ms-excel')
> xls.set_payload(f.getvalue())
> Encoders.encode_base64(xls)
> xls.add_header('Content-Disposition', 'attachment', filename='some
> file name %s-%s-%s.xls' % (today.day, today.month, today.year))
> msg.attach(xls)
>
> msg['Subject'] = subject
> msg['From'] = fromaddr
> msg['To'] = toaddr
> # Guarantees the message ends in a newline
> msg.epilogue = ''
>
> # Send message
> s = smtplib.SMTP(smtp_host)
> s.sendmail(fromaddr, to_list, msg.as_string())
>
> --
> Ksenia[/color] | | | | re: getting data with proper encoding to the finish
Ksenia Marasanova wrote:[color=blue]
> Sorry, I meant: I use field of the type 'text' in a Postgres table to
> store my data. The data is a XML string.
>[color=green]
> > Instead of "print data", do "print repr(data)" and show us what you
> > get. What *you* see on the screen is not much use for diagnosis;[/color][/color]
it's[color=blue][color=green]
> > the values of the bytes in the file that matter.[/color]
>
> Thanks for this valuable tip. I take letter "" as an example.
>
> "print repr(data)" shows this:
> u'\xe9'[/color]
That doesn't look like an "XML string" to me. Show the WHOLE contents
of the field.
Have you read the docs of the Perl module of which pyXLWrtiter is a
docs-free port? Right down the end it mutters something about XML
parsers returning UTF8 which will jam up the works if fed into an Excel
spreadsheet ... | | | | re: getting data with proper encoding to the finish
John Machin wrote:[color=blue]
> Ksenia Marasanova wrote:[color=green]
>> Sorry, I meant: I use field of the type 'text' in a Postgres table to
>> store my data. The data is a XML string.
>>[color=darkred]
>>> Instead of "print data", do "print repr(data)" and show us what you
>>> get. What *you* see on the screen is not much use for diagnosis;
>>> it's the values of the bytes in the file that matter.[/color]
>>
>> Thanks for this valuable tip. I take letter "" as an example.
>>
>> "print repr(data)" shows this:
>> u'\xe9'[/color]
>
> That doesn't look like an "XML string" to me. Show the WHOLE contents
> of the field.
>
> Have you read the docs of the Perl module of which pyXLWrtiter is a
> docs-free port? Right down the end it mutters something about XML
> parsers returning UTF8 which will jam up the works if fed into an
> Excel spreadsheet ...[/color]
Looking at the following function in pyXLWriter
def _asc2ucs(s):
"""Convert ascii string to unicode."""
return "\x00".join(s) + "\x00"
I can guess several things:
a) pyXLWriter author is an ascii guy :)
b) unicode strings are not supported by pyXLWriter
c) excel keeps unicode text in utf-16le
Ksenia, try encoding unicode strings in utf-16le before passing them to
pyXLWriter . If that doesn't work that means pyXLWriter requires
changes to support unicode strings.
Serge. | | | | re: getting data with proper encoding to the finish
Serge Orlov wrote:[color=blue]
> Looking at the following function in pyXLWriter
> def _asc2ucs(s):
> """Convert ascii string to unicode."""
> return "\x00".join(s) + "\x00"
>
> I can guess several things:
> a) pyXLWriter author is an ascii guy :)[/color]
Shrewd guess :-)
[color=blue]
> b) unicode strings are not supported by pyXLWriter[/color]
But that _asc2ucs() is used ONLY in write_url* methods ... so there's
hope yet.
[color=blue]
> c) excel keeps unicode text in utf-16le[/color]
Uh-huh. MS-everything is LE.
[color=blue]
>
> Ksenia, try encoding unicode strings in utf-16le before passing them[/color]
to[color=blue]
> pyXLWriter . If that doesn't work that means pyXLWriter requires
> changes to support unicode strings.
>
> Serge.[/color] | | | | re: getting data with proper encoding to the finish
John, Serge, thanks for your help!
utf-16le encoding didn't help. I had however to solve it yesterday,
so I used csv module to create CSV file and then import it in Excel.
Excel still had troubles with accented characters, but this is another
story: it seems that Office 2004 Excel (for Mac, but I assume the PC
version is no better) cannot import UTF-8 encoded text files. Encoding
CSV file with Latin1 encoding finally did work.
Now back to the Excel story, I also think that there is something
wrong with pyExcelWriter or the way I use it. CSV file generation was
okay, so I think there is nothing wrong with my data, or XML parser.
I will resume in a few days with pyExcelWriter and will post the
results here, but anyway, many thanks for your time and explanation!
--
Ksenia | | | | re: getting data with proper encoding to the finish
Ksenia Marasanova wrote:[color=blue]
> John, Serge, thanks for your help!
>
> utf-16le encoding didn't help. I had however to solve it yesterday,
> so I used csv module to create CSV file and then import it in Excel.
> Excel still had troubles with accented characters, but this is[/color]
another[color=blue]
> story: it seems that Office 2004 Excel (for Mac, but I assume the PC
> version is no better) cannot import UTF-8 encoded text files.[/color]
Right, I tried on windows xp, utf-8 csv file is imported as garbadge.
However, csv file saved in utf-16 encoding is imported correctly.
[color=blue]
> Encoding
> CSV file with Latin1 encoding finally did work.
>
> Now back to the Excel story, I also think that there is something
> wrong with pyExcelWriter or the way I use it. CSV file generation was
> okay, so I think there is nothing wrong with my data, or XML parser.
>
> I will resume in a few days with pyExcelWriter and will post the
> results here, but anyway, many thanks for your time and explanation![/color]
I believe Microsoft Office has gone through byte strings to unicode
strings transformation between 1995 and 1997. I still remember times
when you could receive Microsoft Office file and couldn't view it.
I suspect pyExcelWriter writes strings in that old format so utf-16le
trick didn't work. You can try to contact pyExcelWriter author
and ask him about unicode support.
Serge. | | | | re: getting data with proper encoding to the finish
Ksenia Marasanova wrote:[color=blue]
> John, Serge, thanks for your help![/color]
Thank *you* for having interesting problems :-)
[color=blue]
>
> utf-16le encoding didn't help. I had however to solve it yesterday,
> so I used csv module to create CSV file and then import it in Excel.
> Excel still had troubles with accented characters, but this is[/color]
another[color=blue]
> story: it seems that Office 2004 Excel (for Mac, but I assume the PC
> version is no better) cannot import UTF-8 encoded text files.[/color]
Encoding[color=blue]
> CSV file with Latin1 encoding finally did work.[/color]
Yes, Excel appears not to understand UTF-8. It interprets CSV files
according to the current locale / codepage / whatever -- the "old bad
way" that Unicode is meant to save us from.
An alternative, if you need to represent more than one codepage, or
want a "new good way" of doing it: Excel allows "Save As" to "Unicode
Text" format. It uses Unicode tab u'\t' as delimiter. It quotes tabs,
quotes quotes by doubling them, and [weirdly] also quotes cells which
have only a comma [maybe locale-dependent] in them. It quite happily
opens such files without data loss. You should be able to make such
files easily with Python.
Here's a dump of such a file created by Excel 2002 on Windows -- pls
pardon any Cyrillic spelling errors :-)
[color=blue][color=green][color=darkred]
>>> file('C:/junk/km_u16.txt', 'rb').read().decode('utf16')[/color][/color][/color]
u'\u041c\u0430\u0440\u0430\u0441\u0430\u043d\u043e \u0432\u0430\t\u041a\u0441\u0435\u043d\u044f\r\n"c omma,
comma, comma"\t\r\n"quote ""Hello UniWorld""
unquote"\t\r\n"tab\ttab"\t\r\n'
[color=blue][color=green][color=darkred]
>>> print file('C:/junk/km_u16.txt', 'rb').read().decode('utf16')[/color][/color][/color]
Марасанова Ксеня
"comma, comma, comma"
"quote ""Hello UniWorld"" unquote"
"tab tab"
To make such a file, you would need a quoter function something like
this; you would apply it to each field:
[color=blue][color=green][color=darkred]
>>> def unitextquoter(s, quote=u'"', alt_delim=u','):[/color][/color][/color]
! if quote in s:
! return quote + s.replace(quote, quote+quote) + quote
! if alt_delim in s or u'\t' in s:
! return quote + s + quote
! return s
[color=blue][color=green][color=darkred]
>>> unitextquoter(u'comma, comma, comma')[/color][/color][/color]
u'"comma, comma, comma"'[color=blue][color=green][color=darkred]
>>> unitextquoter(u'tab\ttab')[/color][/color][/color]
u'"tab\ttab"'[color=blue][color=green][color=darkred]
>>> unitextquoter(u'quote "Hello UniWorld" unquote')[/color][/color][/color]
u'"quote ""Hello UniWorld"" unquote"'[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]
Then you would do u'\t'.join(fields) , add on u'\r\n' [or whatever is
needed in your environment], .encode('utf16') and .write() to your 'wb'
file.
[color=blue]
>
> Now back to the Excel story, I also think that there is something
> wrong with pyExcelWriter or the way I use it. CSV file generation was
> okay, so I think there is nothing wrong with my data, or XML parser.
>
> I will resume in a few days with pyExcelWriter and will post the
> results here, but anyway, many thanks for your time and explanation![/color]
I've been reading the source and looking at the Excel file specs
[available from openoffice.org if you're very short of reading
material!]. Apparently pyXLWriter doesn't handle Unicode at all.
Although Unicode came in with Excel 1997 (BIFF8 format file),
pyXLWriter appears to support only Excel 5(?) (BIFF5 format file). As
Serge suggested, appeal to the porter to appeal to the author of the
Perl module it's ported from; but don't hold your breath in the
meantime.
Cheers,
John |  | | | | /bytes/about
We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights.
Get the best answers to your questions from over 226,223 network members.
|