Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
I know I can get the code point by doing
>>"Í".decode(' utf-8').encode('uni code_escape')
but there doesn't seem to be any similar method for getting the octal
escaped version.
Thanks,
Michael 9 11539
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
I know I can get the code point by doing
>>>"Í".decode( 'utf-8').encode('uni code_escape')
but there doesn't seem to be any similar method for getting the octal
escaped version.
Thanks,
Michael
I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?
Michael
_________
import binascii
def escape(s):
hexstring = binascii.b2a_he x(s)
result = ""
while len(hexstring) 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte , 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result
def unescape(s):
result = ""
while len(s) 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte , 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result
print escape("\303\21 5")
print unescape('adf\\ 303\\215adf')
On Dec 2, 8:38 pm, Michael Goerz <answer...@8439 .e4ward.comwrot e:
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215" and
vice versa?
I know I can get the code point by doing
>>"".decode('u tf-8').encode('uni code_escape')
but there doesn't seem to be any similar method for getting the octal
escaped version.
Thanks,
Michael
I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?
Michael
_________
import binascii
def escape(s):
hexstring = binascii.b2a_he x(s)
result = ""
while len(hexstring) 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte , 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result
def unescape(s):
result = ""
while len(s) 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte , 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result
print escape("\303\21 5")
print unescape('adf\\ 303\\215adf')
Looks like escape() can be a bit simpler...
def escape(s):
result = []
for char in s:
result.append(" \%o" % ord(char))
return ''.join(result)
Regards,
Jordan
MonkeeSage wrote:
Looks like escape() can be a bit simpler...
def escape(s):
result = []
for char in s:
result.append(" \%o" % ord(char))
return ''.join(result)
Regards,
Jordan
Very neat! Thanks a lot...
Michael
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
...
>>def decode(encoded) :
... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
... return bytes.decode('u tf8')
...
>>encode(u"Í ")
'\\303\\215'
>>print decode(_)
Í
>>>
HTH
Michael
On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
...
>>def decode(encoded) :
... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
... return bytes.decode('u tf8')
...
>>encode(u"" )
'\\303\\215'
>>print decode(_)
>>>
HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...
def decode(encoded) :
for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)):
encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode( 'utf8')
This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146"
as well as "adf\\303\\215a df".
Regards,
Jordan
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gma il.comwrote:
On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "" to "\303\215"a nd
vice versa?
Perhaps something along the lines of:
>>def encode(source):
... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
...
>>def decode(encoded) :
... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
... return bytes.decode('u tf8')
...
>>encode(u"" )
'\\303\\215'
>>print decode(_)
HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...
def decode(encoded) :
for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)):
encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode( 'utf8')
This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146"
as well as "adf\\303\\215a df".
Regards,
Jordan
err...
def decode(encoded) :
for octc in re.findall(r'\\ (\d{3})', encoded):
encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode( 'utf8')
MonkeeSage wrote:
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gma il.comwrote:
>On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
>>Michael Goerz wrote: Hi, I am writing unicode stings into a special text file that requires to have non-ascii characters as as octal-escaped UTF-8 codes. For example, the letter "" (latin capital I with acute, code point 205) would come out as "\303\215". I will also have to read back from the file later on and convert the escaped characters back into a unicode string. Does anyone have any suggestions on how to go from "" to "\303\215" and vice versa? Perhaps something along the lines of: >>def encode(source): ... return "".join("\% o" % ord(c) for c in source.encode(' utf8')) ... >>def decode(encoded) : ... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:]) ... return bytes.decode('u tf8') ... >>encode(u"" ) '\\303\\215' >>print decode(_) HTH Michael
Nice one. :) If I might suggest a slight variation to handle cases where the "encoded" string contains plain text as well as octal escapes...
def decode(encoded) : for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)): encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8))) return encoded.decode( 'utf8')
This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146" as well as "adf\\303\\215a df".
Regards, Jordan
err...
def decode(encoded) :
for octc in re.findall(r'\\ (\d{3})', encoded):
encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode( 'utf8')
Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.
Thanks a lot,
Michael
____________
import re
def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) 128):
for byte in character.encod e('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode( 'utf-8')
def decode(encoded) :
decoded = encoded.encode( 'utf-8')
for octc in re.findall(r'\\ (\d{3})', decoded):
decoded = decoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode( 'utf8')
orig = u"blablub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec
>>>>Michael Goerz <an*******@8439 .e4ward.com(MG) wrote:
>MG if (ord(character) < 32) or (ord(character) 128):
If you encode chars < 32 it seems more appropriate to also encode 127.
Moreover your code is quadratic in the size of the string so if you use
long strings it would be better to use join.
--
Piet van Oostrum <pi**@cs.uu.n l>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C 4]
Private email: pi**@vanoostrum .org
On Dec 3, 8:10 am, Michael Goerz <answer...@8439 .e4ward.comwrot e:
MonkeeSage wrote:
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gma il.comwrote:
On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
>Michael Goerz wrote: Hi, I am writing unicode stings into a special text file that requires to have non-ascii characters as as octal-escaped UTF-8 codes. For example, the letter "" (latin capital I with acute, code point205) would come out as "\303\215". I will also have to read back from the file later on and convert the escaped characters back into a unicode string. Does anyone have any suggestions on how to go from "" to "\303\215" and vice versa? Perhaps something along the lines of: >>def encode(source): ... return "".join("\% o" % ord(c) for c in source.encode(' utf8')) ... >>def decode(encoded) : ... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:]) ... return bytes.decode('u tf8') ... >>encode(u"" ) '\\303\\215' >>print decode(_) HTH Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...
def decode(encoded) :
for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)):
encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode( 'utf8')
This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146"
as well as "adf\\303\\215a df".
Regards,
Jordan
err...
def decode(encoded) :
for octc in re.findall(r'\\ (\d{3})', encoded):
encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode( 'utf8')
Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.
Thanks a lot,
Michael
____________
import re
def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) 128):
for byte in character.encod e('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode( 'utf-8')
def decode(encoded) :
decoded = encoded.encode( 'utf-8')
for octc in re.findall(r'\\ (\d{3})', decoded):
decoded = decoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode( 'utf8')
orig = u"blablub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec
An optimization... in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...
def decode(encoded) :
decoded = encoded.encode( 'utf-8')
matches = {}
for octc in re.findall(r'\\ (\d{3})', decoded):
matches[octc] = None
for octc in matches:
decoded = decoded.replace (r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode( 'utf8')
Untested...
Regards,
Jordan This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Sverre Bakke |
last post by:
Hi
I am using the base_convert() function to convert numbers between binary,
hexadecimal, decimal, etc... It works great, but I have problems converting
numbers with .'s
Like this number:
15.20 decimal will be 1111 as binary...
|
by: flamingivanova |
last post by:
I have several ascii files that contain '\ooo' strings which represent
the octal value for a character. I want to convert these files to
unicode, and I came up with the following script. But it seems to me
that there must be a much simpler way to do it. Could someone more
experienced suggest some improvements?
I want to convert a file eg....
|
by: hunterb |
last post by:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but...
|
by: James |
last post by:
this is a console program to convert ANSI to UTF8 format. Although in
notepad i open the source file (which is ansi), and after running the
program below, and re-open in notepad (it shows utf8 encoding), does it mean
that it has been correctly converted ?
Pls let me know what i have done wrong in conversion ...
Module Module1
|
by: ranjithkumar |
last post by:
I am using mysql and have some data in my application in the latin1
charset. I have a necessity to support the utf 8 charset. Now I want to
migrate the data between these two charset.
The normal way I do migration is as follows:
Taking a dump of the data with the currently running mysql
converting the necessary parameters in the mysql...
| |
by: jaks.maths |
last post by:
How to convert negative integer to hexadecimal or octal number?
Ex: -568
What is the equivalent hexadecimal and octal number??
|
by: Alci |
last post by:
I am getting some Korean characters data from MS SQL server. These
data were submitted as UTF-8 into the database, but stored as normal
varchars. So, when I getting them out of database by using Gridview
+SqlDataSource, they are actually ASCII format, but I couldn't just
convert the encoding of the page to get the proper UTF-8 format Korean...
|
by: HaifaCarina |
last post by:
Here's the code is used but...but still something is wrong... i need help...
/*CONVERTING DECIMAL TO OCTAL*/
String inputDeci,octal="";
int deci,count=0, i, h;
inputDeci = JOptionPane.showInputDialog("Enter a decimal number ");
deci = Integer.parseInt(inputDeci);
|
by: Terry Reedy |
last post by:
A. Joseph wrote:
These are number representation systems that can be applied to or used
with integral, rational (numberator,denominator), and 'point' numbers.
Try Wikipedia or any search engine.
There are standard algorithms for converting between representations.
See above. Good programmer calculators have these built in.
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it. ...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |