ascii to unicode line endings

fidtz

The code:

import codecs

udlASCII = file("c:\\temp\ \CSVDB.udl",'r' )
udlUNI = codecs.open("c: \\temp\\CSVDB2. udl",'w',"utf_1 6")

udlUNI.write(ud lASCII.read())

udlUNI.close()
udlASCII.close( )

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A

I have tried various 2 byte unicode encoding but it doesn't seem to
make a difference. I have also tried modifying the code to read and
convert a line at a time, but that didn't make any difference either.

I have tried to understand the unicode docs but nothing seems to
indicate why an seemingly incorrect conversion is being done.
Obviously I am missing something blindingly obvious here, any help
much appreciated.

Dom

May 2 '07 #1

Subscribe Reply

2379

fidtz

On 2 May, 17:29, Jean-Paul Calderone <exar...@divmod .comwrote:

On 2 May 2007 09:19:25 -0700, f...@clara.co.u k wrote:

The code:

import codecs

udlASCII = file("c:\\temp\ \CSVDB.udl",'r' )
udlUNI = codecs.open("c: \\temp\\CSVDB2. udl",'w',"utf_1 6")

udlUNI.write(ud lASCII.read())

udlUNI.close()
udlASCII.close( )

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A

I have tried various 2 byte unicode encoding but it doesn't seem to
make a difference. I have also tried modifying the code to read and
convert a line at a time, but that didn't make any difference either.

I have tried to understand the unicode docs but nothing seems to
indicate why an seemingly incorrect conversion is being done.
Obviously I am missing something blindingly obvious here, any help
much appreciated.

Consider this simple example:

>>import codecs
>>f = codecs.open('te st-newlines-file', 'w', 'utf16')
>>f.write('\r\n ')
>>f.close()
>>f = file('test-newlines-file')
>>f.read()

'\xff\xfe\r\x00 \n\x00'

>>>

And how it differs from your example. Are you sure you're examining
the resulting output properly?

By the way, "\r\0\n\0" isn't a "unicode line ending", it's just the UTF-16
encoding of "\r\n".

Jean-Paul

I am not sure what you are driving at here, since I started with an
ascii file, whereas you just write a unicode file to start with. I
guess the direct question is "is there a simple way to convert my
ascii file to a utf16 file?". I thought either string.encode() or
writing to a utf16 file would do the trick but it probably isn't that
simple!

I used a binary file editor I have used a great deal for all sorts of
things to get the hex values.

Dom

May 3 '07 #2

Jerry Hill

On 2 May 2007 09:19:25 -0700, fi***@clara.co. uk <fi***@clara.co .ukwrote:

The code:

import codecs

udlASCII = file("c:\\temp\ \CSVDB.udl",'r' )
udlUNI = codecs.open("c: \\temp\\CSVDB2. udl",'w',"utf_1 6")
udlUNI.write(ud lASCII.read())
udlUNI.close()
udlASCII.close( )

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A

That code (using my own local files, of course) basically works for me.

If I open my input file with mode 'r', as you did above, my '\r\n'
pairs get transformed to '\n' when I read them in and are written to
my output file as 0x00 0x0A. If I open the input file in binary mode
'rb' then my output file shows the expected sequence of 0x00 0x0D 0x00
0x0A.

Perhaps there's a quirk of your version of python or your platform? I'm running
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32

--
Jerry

May 3 '07 #3

fidtz

On 3 May, 13:00, Jean-Paul Calderone <exar...@divmod .comwrote:

On 3 May 2007 04:30:37 -0700, f...@clara.co.u k wrote:

On 2 May, 17:29, Jean-Paul Calderone <exar...@divmod .comwrote:
On 2 May 2007 09:19:25 -0700, f...@clara.co.u k wrote:

The code:

import codecs

udlASCII = file("c:\\temp\ \CSVDB.udl",'r' )
udlUNI = codecs.open("c: \\temp\\CSVDB2. udl",'w',"utf_1 6")

udlUNI.write(ud lASCII.read())

udlUNI.close()
udlASCII.close( )

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A

I have tried various 2 byte unicode encoding but it doesn't seem to
make a difference. I have also tried modifying the code to read and
convert a line at a time, but that didn't make any difference either.

I have tried to understand the unicode docs but nothing seems to
indicate why an seemingly incorrect conversion is being done.
Obviously I am missing something blindingly obvious here, any help
much appreciated.

Consider this simple example:

>>import codecs
>>f = codecs.open('te st-newlines-file', 'w', 'utf16')
>>f.write('\r\n ')
>>f.close()
>>f = file('test-newlines-file')
>>f.read()
'\xff\xfe\r\x00 \n\x00'

And how it differs from your example. Are you sure you're examining
the resulting output properly?

By the way, "\r\0\n\0" isn't a "unicode line ending", it's just the UTF-16
encoding of "\r\n".

Jean-Paul

I am not sure what you are driving at here, since I started with an
ascii file, whereas you just write a unicode file to start with. I
guess the direct question is "is there a simple way to convert my
ascii file to a utf16 file?". I thought either string.encode() or
writing to a utf16 file would do the trick but it probably isn't that
simple!

There's no such thing as a unicode file. The only difference between
the code you posted and the code I posted is that mine is self-contained
and demonstrates that the functionality works as you expected it to work,
whereas the code you posted is requires external resources which are not
available to run and produces external results which are not available to
be checked regarding their correctness.

So what I'm driving at is that both your example and mine are doing it
correctly (because they are doing the same thing), and mine demonstrates
that it is correct, but we have to take your word on the fact that yours
doesn't work. ;)

Jean-Paul

Thanks for the advice. I cannot prove what is going on. The following
code seems to work fine as far as console output goes, but the actual
bit patterns of the files on disk are not what I am expecting (or
expected as input by the ultimate user of the converted file). Which I
can't prove of course.

>>import codecs
testASCII = file("c:\\temp\ \test1.txt",'w' )
testASCII.wri te("\n")
testASCII.clo se()
testASCII = file("c:\\temp\ \test1.txt",'r' )
testASCII.rea d()

'\n'
Bit pattern on disk : \0x0D\0x0A

>>testASCII.see k(0)
testUNI = codecs.open("c: \\temp\\test2.t xt",'w','utf16' )
testUNI.write (testASCII.read ())
testUNI.close ()
testUNI = file("c:\\temp\ \test2.txt",'r' )
testUNI.read( )

'\xff\xfe\n\x00 '
Bit pattern on disk:\0xff\0xfe \0x0a\0x00
Bit pattern I was expecting:\0xff \0xfe\0x0d\0x00 \0x0a\0x00

>>testUNI.close ()

Dom

May 3 '07 #4

fidtz

On 3 May, 13:39, "Jerry Hill" <malaclyp...@gm ail.comwrote:

On 2 May 2007 09:19:25 -0700, f...@clara.co.u k <f...@clara.co. ukwrote:

The code:

import codecs

udlASCII = file("c:\\temp\ \CSVDB.udl",'r' )
udlUNI = codecs.open("c: \\temp\\CSVDB2. udl",'w',"utf_1 6")
udlUNI.write(ud lASCII.read())
udlUNI.close()
udlASCII.close( )

This doesn't seem to generate the correct line endings. Instead of
converting 0x0D/0x0A to 0x0D/0x00/0x0A/0x00, it leaves it as 0x0D/
0x0A

That code (using my own local files, of course) basically works for me.

If I open my input file with mode 'r', as you did above, my '\r\n'
pairs get transformed to '\n' when I read them in and are written to
my output file as 0x00 0x0A. If I open the input file in binary mode
'rb' then my output file shows the expected sequence of 0x00 0x0D 0x00
0x0A.

Perhaps there's a quirk of your version of python or your platform? I'm running
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32

--
Jerry

Thanks very much! Not sure if you intended to fix my whole problem,
but changing the read mode to 'rb' has done the trick :)

Dom

May 3 '07 #5

Marc 'BlackJack' Rintsch

In <11************ **********@p77g 2000hsh.googleg roups.com>, fidtz wrote:

>>>import codecs
testASCII = file("c:\\temp\ \test1.txt",'w' )
testASCII.wr ite("\n")
testASCII.cl ose()
testASCII = file("c:\\temp\ \test1.txt",'r' )
testASCII.re ad()

'\n'
Bit pattern on disk : \0x0D\0x0A

>>>testASCII.se ek(0)
testUNI = codecs.open("c: \\temp\\test2.t xt",'w','utf16' )
testUNI.writ e(testASCII.rea d())
testUNI.clos e()
testUNI = file("c:\\temp\ \test2.txt",'r' )
testUNI.read ()

'\xff\xfe\n\x00 '
Bit pattern on disk:\0xff\0xfe \0x0a\0x00
Bit pattern I was expecting:\0xff \0xfe\0x0d\0x00 \0x0a\0x00

>>>testUNI.clos e()

Files opened with `codecs.open()` are always opened in binary mode. So if
you want '\n' to be translated into a platform specific character sequence
you have to do it yourself.

Ciao,
Marc 'BlackJack' Rintsch

May 3 '07 #6

Similar topics

10704

unicode to ascii converting

by: Peter Wilkinson | last post by:

Hello tlistmembers, I am using the encoding function to convert unicode to ascii. At one point this code was working just fine, however, now it has broken. I am reading a text file that has is in unicode (I am unsure of which flavour or bit depth). as I read in the file one line at a time (readlines()) it converts to ascii. Simple enough. At the same time I am copressing to bz2 with the bz2 module but that works just fine. The code...

Python

6072

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning

Python

6203

Old-fashioned Style (ASCII/ANSI) & Console Applications in c#

by: Martín Marconcini | last post by:

Hello there, I'm writting (or trying to) a Console Application in C#. I has to be console. I remember back in the old days of Cobol (Unisys), Clipper and even Basic, I used to use a program (its name i cannot recall now...) where I designed the "screen" using this "program" and then saved it into an ASCII file. (thus, using 'extended' ASCII's like Lines, Corners, etc. and making screens look nicer and more professional). Then reading a...

C# / C Sharp

3555

ascii or binary

by: greg | last post by:

Hello, I'm searching to know if a local file is ascii or binary. I couldn't find it in the manual, is there a way to know that ? thanks, -- greg

PHP

4492

Unicode line endings

by: jdbartlett | last post by:

After switching text editors, my code started causing mysterious PHP errors. I narrowed the problem down to the Unicode line endings I started using with the new text editor: when I save documents using unicode line endings, PHP no longer registers the line endings, meaning that: <?php echo "Hello World!";

PHP

6211

Ascii Encoding Error with UTF-8 encoder

by: Mike Currie | last post by:

Can anyone explain why I'm getting an ascii encoding error when I'm trying to write out using a UTF-8 encoder? Thanks Python 2.4.3 (#69, Mar 29 2006, 17:35:34) on win32 Type "help", "copyright", "credits" or "license" for more information. >>> filterMap = {} >>> for i in range(0,255):

Python

2357

Unicode string handling problem

by: Richard Schulman | last post by:

The following program fragment works correctly with an ascii input file. But the file I actually want to process is Unicode (utf-16 encoding). The file must be Unicode rather than ASCII or Latin-1 because it contains mixed Chinese and English characters. When I run the program below I get an attribute_count of zero, which is incorrect for the input file, which should give a value of fifteen or sixteen. In other words, the count...

Python

3338

Unicode/ascii encoding nightmare

by: Thomas W | last post by:

I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word "fødselsdag". I stored the string as "fødselsdag" but somewhere in my code it got translated into the mess above and I cannot get the original string back. It cannot be printed in the console or written a plain text-file. I've tried to convert it using

Python

5378

Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

by: Oleg Parashchenko | last post by:

Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128) I spent two hours fixing it, and I hope it's done. The solution is one

Python

9620

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9454

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10261

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9912

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7460

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6715

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5482

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4007

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2850

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General