converting to and from octal escaped UTF--8

Michael Goerz

Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Ã" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
vice versa?

I know I can get the code point by doing

>>"Ã".decode('utf-8').encode('unicode_escape')

but there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael

Dec 3 '07 #1

Subscribe Post Reply

11509

Michael Goerz

Michael Goerz wrote:

Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Ã" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
vice versa?

I know I can get the code point by doing

>>>"Ã".decode('utf-8').encode('unicode_escape')

but there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael

I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_________

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = ""
while len(hexstring) 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result

def unescape(s):
result = ""
while len(s) 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape("\303\215")
print unescape('adf\\303\\215adf')

Dec 3 '07 #2

MonkeeSage

On Dec 2, 8:38 pm, Michael Goerz <answer...@8439.e4ward.comwrote:

Michael Goerz wrote:
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?

I know I can get the code point by doing
>>"Í".decode('utf-8').encode('unicode_escape')
but there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael

I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_________

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = ""
while len(hexstring) 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result

def unescape(s):
result = ""
while len(s) 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape("\303\215")
print unescape('adf\\303\\215adf')

Looks like escape() can be a bit simpler...

def escape(s):
result = []
for char in s:
result.append("\%o" % ord(char))
return ''.join(result)

Regards,
Jordan

Dec 3 '07 #3

Michael Goerz

MonkeeSage wrote:

Looks like escape() can be a bit simpler...

def escape(s):
result = []
for char in s:
result.append("\%o" % ord(char))
return ''.join(result)

Regards,
Jordan

Very neat! Thanks a lot...
Michael

Dec 3 '07 #4

Michael Spencer

Michael Goerz wrote:

Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Ã" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
vice versa?

Perhaps something along the lines of:

>>def encode(source):

... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...

>>def decode(encoded):

... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...

>>encode(u"Ã")

'\\303\\215'

>>print decode(_)

>>>

HTH
Michael

Dec 3 '07 #5

MonkeeSage

On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:

Michael Goerz wrote:
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?

Perhaps something along the lines of:

>>def encode(source):

... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...

>>def decode(encoded):

... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...

>>encode(u"Í")

'\\303\\215'

>>print decode(_)

Í

>>>

HTH
Michael

Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

Dec 3 '07 #6

MonkeeSage

On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gmail.comwrote:

On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:

Michael Goerz wrote:
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Í" to "\303\215"and
vice versa?

Perhaps something along the lines of:

>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"Í")
'\\303\\215'
>>print decode(_)
Í

HTH
Michael

Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

Dec 3 '07 #7

Michael Goerz

MonkeeSage wrote:

On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gmail.comwrote:
>On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:

>>Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"Í")
'\\303\\215'
>>print decode(_)
Í
HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) 128):
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')
orig = u"blaÍblub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec

Dec 3 '07 #8

Piet van Oostrum

>>>>Michael Goerz <an*******@8439.e4ward.com(MG) wrote:

>MG if (ord(character) < 32) or (ord(character) 128):

If you encode chars < 32 it seems more appropriate to also encode 127.

Moreover your code is quadratic in the size of the string so if you use
long strings it would be better to use join.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org

Dec 4 '07 #9

MonkeeSage

On Dec 3, 8:10 am, Michael Goerz <answer...@8439.e4ward.comwrote:

MonkeeSage wrote:
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gmail.comwrote:
On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:

>Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"Í")
'\\303\\215'
>>print decode(_)
Í
HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) 128):
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')

orig = u"blaÍblub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec

An optimization...in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...

def decode(encoded):
decoded = encoded.encode('utf-8')
matches = {}
for octc in re.findall(r'\\(\d{3})', decoded):
matches[octc] = None
for octc in matches:
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')

Untested...

Regards,
Jordan

Dec 4 '07 #10

Similar topics

converting numbers

by: Sverre Bakke | last post by:

Hi I am using the base_convert() function to convert numbers between binary, hexadecimal, decimal, etc... It works great, but I have problems converting numbers with .'s Like this number: ...

PHP

converting octal strings to unicode

by: flamingivanova | last post by:

I have several ascii files that contain '\ooo' strings which represent the octal value for a character. I want to convert these files to unicode, and I came up with the following script. But it...

Python

Read UTF8 (mixed byte) file & convert to Unicode

by: hunterb | last post by:

I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...

.NET Framework

converting ansi to utf8 format - is there anything wrong with it ? urgently requires help

by: James | last post by:

this is a console program to convert ANSI to UTF8 format. Although in notepad i open the source file (which is ansi), and after running the program below, and re-open in notepad (it shows utf8...

Visual Basic .NET

latin1 to utf8

by: ranjithkumar | last post by:

I am using mysql and have some data in my application in the latin1 charset. I have a necessity to support the utf 8 charset. Now I want to migrate the data between these two charset. The normal...

MySQL Database

Converting negative integer to octal/hexadecimal

by: jaks.maths | last post by:

How to convert negative integer to hexadecimal or octal number? Ex: -568 What is the equivalent hexadecimal and octal number??

C / C++

Converting ASCII to UTF-8

by: Alci | last post by:

I am getting some Korean characters data from MS SQL server. These data were submitted as UTF-8 into the database, but stored as normal varchars. So, when I getting them out of database by using...

ASP.NET

How do i convert a decimal number to octal?

by: HaifaCarina | last post by:

Here's the code is used but...but still something is wrong... i need help... /*CONVERTING DECIMAL TO OCTAL*/ String inputDeci,octal=""; int deci,count=0, i, h; ...

Java

Re: Converting between binary, decimal, hexadecimal, octal

by: Terry Reedy | last post by:

A. Joseph wrote: These are number representation systems that can be applied to or used with integral, rational (numberator,denominator), and 'point' numbers. Try Wikipedia or any search...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA