472,127 Members | 1,598 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,127 software developers and data experts.

converting to and from octal escaped UTF--8

Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?

I know I can get the code point by doing
>>"Í".decode('utf-8').encode('unicode_escape')
but there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael
Dec 3 '07 #1
9 11086
Michael Goerz wrote:
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?

I know I can get the code point by doing
>>>"Í".decode('utf-8').encode('unicode_escape')
but there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael
I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_________

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = ""
while len(hexstring) 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result

def unescape(s):
result = ""
while len(s) 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape("\303\215")
print unescape('adf\\303\\215adf')
Dec 3 '07 #2
On Dec 2, 8:38 pm, Michael Goerz <answer...@8439.e4ward.comwrote:
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215" and
vice versa?
I know I can get the code point by doing
>>"".decode('utf-8').encode('unicode_escape')
but there doesn't seem to be any similar method for getting the octal
escaped version.
Thanks,
Michael

I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_________

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = ""
while len(hexstring) 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result

def unescape(s):
result = ""
while len(s) 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape("\303\215")
print unescape('adf\\303\\215adf')
Looks like escape() can be a bit simpler...

def escape(s):
result = []
for char in s:
result.append("\%o" % ord(char))
return ''.join(result)

Regards,
Jordan
Dec 3 '07 #3
MonkeeSage wrote:
Looks like escape() can be a bit simpler...

def escape(s):
result = []
for char in s:
result.append("\%o" % ord(char))
return ''.join(result)

Regards,
Jordan
Very neat! Thanks a lot...
Michael
Dec 3 '07 #4
Michael Goerz wrote:
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"Í")
'\\303\\215'
>>print decode(_)
Í
>>>
HTH
Michael

Dec 3 '07 #5
On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215" and
vice versa?

Perhaps something along the lines of:
>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"")
'\\303\\215'
>>print decode(_)
>>>

HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan
Dec 3 '07 #6
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gmail.comwrote:
On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215"and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"")
'\\303\\215'
>>print decode(_)
HTH
Michael

Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan
err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')
Dec 3 '07 #7
MonkeeSage wrote:
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gmail.comwrote:
>On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:
>>Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"")
'\\303\\215'
>>print decode(_)

HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')
Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) 128):
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')
orig = u"blablub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec

Dec 3 '07 #8
>>>>Michael Goerz <an*******@8439.e4ward.com(MG) wrote:
>MG if (ord(character) < 32) or (ord(character) 128):
If you encode chars < 32 it seems more appropriate to also encode 127.

Moreover your code is quadratic in the size of the string so if you use
long strings it would be better to use join.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org
Dec 4 '07 #9
On Dec 3, 8:10 am, Michael Goerz <answer...@8439.e4ward.comwrote:
MonkeeSage wrote:
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gmail.comwrote:
On Dec 2, 11:46 pm, Michael Spencer <m...@telcopartners.comwrote:
>Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>encode(u"")
'\\303\\215'
>>print decode(_)

HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...
def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')
This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".
Regards,
Jordan
err...
def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) 128):
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')

orig = u"blablub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec
An optimization...in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...

def decode(encoded):
decoded = encoded.encode('utf-8')
matches = {}
for octc in re.findall(r'\\(\d{3})', decoded):
matches[octc] = None
for octc in matches:
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')

Untested...

Regards,
Jordan
Dec 4 '07 #10

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Sverre Bakke | last post: by
2 posts views Thread by flamingivanova | last post: by
2 posts views Thread by ranjithkumar | last post: by
15 posts views Thread by jaks.maths | last post: by
reply views Thread by Alci | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.