By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,898 Members | 1,183 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,898 IT Pros & Developers. It's quick & easy.

ascii to unicode line endings

P: n/a
The code:

import codecs

udlASCII = file("c:\\temp\\CSVDB.udl",'r')
udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16")

udlUNI.write(udlASCII.read())

udlUNI.close()
udlASCII.close()

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A

I have tried various 2 byte unicode encoding but it doesn't seem to
make a difference. I have also tried modifying the code to read and
convert a line at a time, but that didn't make any difference either.

I have tried to understand the unicode docs but nothing seems to
indicate why an seemingly incorrect conversion is being done.
Obviously I am missing something blindingly obvious here, any help
much appreciated.

Dom

May 2 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a
On 2 May, 17:29, Jean-Paul Calderone <exar...@divmod.comwrote:
On 2 May 2007 09:19:25 -0700, f...@clara.co.uk wrote:
The code:
import codecs
udlASCII = file("c:\\temp\\CSVDB.udl",'r')
udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16")
udlUNI.write(udlASCII.read())
udlUNI.close()
udlASCII.close()
This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A
I have tried various 2 byte unicode encoding but it doesn't seem to
make a difference. I have also tried modifying the code to read and
convert a line at a time, but that didn't make any difference either.
I have tried to understand the unicode docs but nothing seems to
indicate why an seemingly incorrect conversion is being done.
Obviously I am missing something blindingly obvious here, any help
much appreciated.

Consider this simple example:
>>import codecs
>>f = codecs.open('test-newlines-file', 'w', 'utf16')
>>f.write('\r\n')
>>f.close()
>>f = file('test-newlines-file')
>>f.read()
'\xff\xfe\r\x00\n\x00'
>>>

And how it differs from your example. Are you sure you're examining
the resulting output properly?

By the way, "\r\0\n\0" isn't a "unicode line ending", it's just the UTF-16
encoding of "\r\n".

Jean-Paul
I am not sure what you are driving at here, since I started with an
ascii file, whereas you just write a unicode file to start with. I
guess the direct question is "is there a simple way to convert my
ascii file to a utf16 file?". I thought either string.encode() or
writing to a utf16 file would do the trick but it probably isn't that
simple!

I used a binary file editor I have used a great deal for all sorts of
things to get the hex values.

Dom

May 3 '07 #2

P: n/a
On 2 May 2007 09:19:25 -0700, fi***@clara.co.uk <fi***@clara.co.ukwrote:
The code:

import codecs

udlASCII = file("c:\\temp\\CSVDB.udl",'r')
udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16")
udlUNI.write(udlASCII.read())
udlUNI.close()
udlASCII.close()

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A
That code (using my own local files, of course) basically works for me.

If I open my input file with mode 'r', as you did above, my '\r\n'
pairs get transformed to '\n' when I read them in and are written to
my output file as 0x00 0x0A. If I open the input file in binary mode
'rb' then my output file shows the expected sequence of 0x00 0x0D 0x00
0x0A.

Perhaps there's a quirk of your version of python or your platform? I'm running
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32

--
Jerry
May 3 '07 #3

P: n/a
On 3 May, 13:00, Jean-Paul Calderone <exar...@divmod.comwrote:
On 3 May 2007 04:30:37 -0700, f...@clara.co.uk wrote:
On 2 May, 17:29, Jean-Paul Calderone <exar...@divmod.comwrote:
On 2 May 2007 09:19:25 -0700, f...@clara.co.uk wrote:
The code:
import codecs
udlASCII = file("c:\\temp\\CSVDB.udl",'r')
udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16")
udlUNI.write(udlASCII.read())
udlUNI.close()
udlASCII.close()
This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A
I have tried various 2 byte unicode encoding but it doesn't seem to
make a difference. I have also tried modifying the code to read and
convert a line at a time, but that didn't make any difference either.
I have tried to understand the unicode docs but nothing seems to
indicate why an seemingly incorrect conversion is being done.
Obviously I am missing something blindingly obvious here, any help
much appreciated.
Consider this simple example:
>>import codecs
>>f = codecs.open('test-newlines-file', 'w', 'utf16')
>>f.write('\r\n')
>>f.close()
>>f = file('test-newlines-file')
>>f.read()
'\xff\xfe\r\x00\n\x00'
And how it differs from your example. Are you sure you're examining
the resulting output properly?
By the way, "\r\0\n\0" isn't a "unicode line ending", it's just the UTF-16
encoding of "\r\n".
Jean-Paul
I am not sure what you are driving at here, since I started with an
ascii file, whereas you just write a unicode file to start with. I
guess the direct question is "is there a simple way to convert my
ascii file to a utf16 file?". I thought either string.encode() or
writing to a utf16 file would do the trick but it probably isn't that
simple!

There's no such thing as a unicode file. The only difference between
the code you posted and the code I posted is that mine is self-contained
and demonstrates that the functionality works as you expected it to work,
whereas the code you posted is requires external resources which are not
available to run and produces external results which are not available to
be checked regarding their correctness.

So what I'm driving at is that both your example and mine are doing it
correctly (because they are doing the same thing), and mine demonstrates
that it is correct, but we have to take your word on the fact that yours
doesn't work. ;)

Jean-Paul
Thanks for the advice. I cannot prove what is going on. The following
code seems to work fine as far as console output goes, but the actual
bit patterns of the files on disk are not what I am expecting (or
expected as input by the ultimate user of the converted file). Which I
can't prove of course.
>>import codecs
testASCII = file("c:\\temp\\test1.txt",'w')
testASCII.write("\n")
testASCII.close()
testASCII = file("c:\\temp\\test1.txt",'r')
testASCII.read()
'\n'
Bit pattern on disk : \0x0D\0x0A
>>testASCII.seek(0)
testUNI = codecs.open("c:\\temp\\test2.txt",'w','utf16')
testUNI.write(testASCII.read())
testUNI.close()
testUNI = file("c:\\temp\\test2.txt",'r')
testUNI.read()
'\xff\xfe\n\x00'
Bit pattern on disk:\0xff\0xfe\0x0a\0x00
Bit pattern I was expecting:\0xff\0xfe\0x0d\0x00\0x0a\0x00
>>testUNI.close()
Dom

May 3 '07 #4

P: n/a
On 3 May, 13:39, "Jerry Hill" <malaclyp...@gmail.comwrote:
On 2 May 2007 09:19:25 -0700, f...@clara.co.uk <f...@clara.co.ukwrote:
The code:
import codecs
udlASCII = file("c:\\temp\\CSVDB.udl",'r')
udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16")
udlUNI.write(udlASCII.read())
udlUNI.close()
udlASCII.close()
This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A

That code (using my own local files, of course) basically works for me.

If I open my input file with mode 'r', as you did above, my '\r\n'
pairs get transformed to '\n' when I read them in and are written to
my output file as 0x00 0x0A. If I open the input file in binary mode
'rb' then my output file shows the expected sequence of 0x00 0x0D 0x00
0x0A.

Perhaps there's a quirk of your version of python or your platform? I'm running
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32

--
Jerry
Thanks very much! Not sure if you intended to fix my whole problem,
but changing the read mode to 'rb' has done the trick :)

Dom

May 3 '07 #5

P: n/a
In <11**********************@p77g2000hsh.googlegroups .com>, fidtz wrote:
>>>import codecs
testASCII = file("c:\\temp\\test1.txt",'w')
testASCII.write("\n")
testASCII.close()
testASCII = file("c:\\temp\\test1.txt",'r')
testASCII.read()
'\n'
Bit pattern on disk : \0x0D\0x0A
>>>testASCII.seek(0)
testUNI = codecs.open("c:\\temp\\test2.txt",'w','utf16')
testUNI.write(testASCII.read())
testUNI.close()
testUNI = file("c:\\temp\\test2.txt",'r')
testUNI.read()
'\xff\xfe\n\x00'
Bit pattern on disk:\0xff\0xfe\0x0a\0x00
Bit pattern I was expecting:\0xff\0xfe\0x0d\0x00\0x0a\0x00
>>>testUNI.close()
Files opened with `codecs.open()` are always opened in binary mode. So if
you want '\n' to be translated into a platform specific character sequence
you have to do it yourself.

Ciao,
Marc 'BlackJack' Rintsch
May 3 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.