473,386 Members | 1,694 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Detecteing Unicode encodings

Hi.

Is it possible to decode a UTF-8 (with or without a BOM), UTF-16 (BE or
LE with a BOM), or UTF-32 (BE or LE with a BOM) byte stream without
knowing what encoding the stream is in?

I know how to use the codecs module to get StreamReader classes that can
decode a specific encoding but I have to know what that enocding is
before hand.

If I read up to four bytes from the byte stream, I can figure out what
encoding the stream is in but that has problems for UTF-8 streams
without BOMs--I would have just eaten one or more bytes that might need
to be decoded by the StreamReader. I could seek back to the beginning of
the stream but what if the file-like object I was reading from didn't
support seeking?

Thanks.

-- Jason
Jul 18 '05 #1
2 1409
On Sat, 21 Aug 2004 10:57:34 -0700, rumours say that Jason Diamond
<ja***@injektilo.org> might have written:
If I read up to four bytes from the byte stream, I can figure out what
encoding the stream is in but that has problems for UTF-8 streams
without BOMs--I would have just eaten one or more bytes that might need
to be decoded by the StreamReader. I could seek back to the beginning of
the stream but what if the file-like object I was reading from didn't
support seeking?


Two options pop up instantly:

1. "Programmers do it byte by byte" (mainly a joke, so go to option 2 :)

2. wrap your file-like object in a custom object, which implements a
pushback method and its read method returns first from the push-back
buffer. If you read data that you shouldn't, push them back and give
your custom object to the StreamReader.
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
Jul 18 '05 #2
Christos TZOTZIOY Georgiou wrote:
2. wrap your file-like object in a custom object, which implements a
pushback method and its read method returns first from the push-back
buffer. If you read data that you shouldn't, push them back and give
your custom object to the StreamReader.


Thanks for the suggestion.

Instead of a pushback method, I added a peek method. Below is what I
came up with.

-- Jason

class PeekableFile:

def __init__(self, source):
self.source = source
self.buffer = None

def peek(self, size):
if self.buffer:
n = len(self.buffer)
if size > n:
self.buffer += self.source.read(size - n)
else:
self.buffer = self.source.read(size)
return self.buffer[:size]

def read(self, size=-1):
if self.buffer:
if size >= 0:
n = len(self.buffer)
if size < n:
s = self.buffer[:size]
self.buffer = self.buffer[size:]
elif size == n:
s = self.buffer
self.buffer = None
else:
s = self.buffer + self.source.read(size - n)
self.buffer = None
else:
s = self.buffer + self.source.read()
self.buffer = None
else:
s = self.source.read(size)
return s

def main():

import StringIO
import unittest

class PeekableFileTests(unittest.TestCase):

def setUp(self):
f = StringIO.StringIO('abc')
self.pf = PeekableFile(f)

def testPeek0(self):
self.failUnlessEqual(self.pf.peek(0), '')

def testPeek1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')

def testPeek1Read1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'a')

def testPeek1Read2(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(2), 'ab')

def testPeek1ReadAll(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(), 'abc')

def testPeek1Read1Read1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'b')

def testPeek1Read1ReadAll(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.read(), 'bc')

def testPeek1Peek1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.peek(1), 'a')

def testPeek1Peek2(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.peek(2), 'ab')

def testPeek2Peek1(self):
self.failUnlessEqual(self.pf.peek(2), 'ab')
self.failUnlessEqual(self.pf.peek(1), 'a')

def testPeek2Read1Peek1(self):
self.failUnlessEqual(self.pf.peek(2), 'ab')
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.peek(1), 'b')

def testRead0(self):
self.failUnlessEqual(self.pf.read(0), '')

def testRead1(self):
self.failUnlessEqual(self.pf.read(1), 'a')

def testReadAll(self):
self.failUnlessEqual(self.pf.read(), 'abc')

def testRead1Peek1(self):
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.peek(1), 'b')

def testReadAllPeek1(self):
self.failUnlessEqual(self.pf.read(), 'abc')
self.failUnlessEqual(self.pf.peek(1), '')

unittest.TextTestRunner().run(unittest.makeSuite(P eekableFileTests))

if __name__ == '__main__':
main()
Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: Nicholas Pappas | last post by:
Hello all. I am trying to write a Java3D loader for a geometry file from a game, which has Unicode characters (Korean) in it. I wrote the loader and it works in Windows, but I recently brushed...
3
by: Antioch | last post by:
Ok, so Im a newb python programmer and I'm trying to create a simple python web-application. The program is simply going to read in pairs of words, parse them into a dictionary file, then randomly...
5
by: F. GEIGER | last post by:
I'm on WinXP, Python 2.3. I don't have problems with umlauts (ä, ö, ü and their uppercase instances) in my wxPython-GUIs, when displayed as static texts. But when filling controls with text...
9
by: Safalra | last post by:
The idea here is relatively simple: a java program (I'm using JDK1.4 if that makes a difference) that loads an HTML file, removes invalid characters (or replaces them in the case of common ones...
4
by: Greg | last post by:
I'm trying to write a basic tool to convert strings to unicode encodings. Should be easy enough, I can do the encoding bit with the various encoding tools in C#, but what I can't seem to do is...
1
by: Kenneth McDonald | last post by:
I am going to demonstrate my complete lack of understanding as to going back and forth between character encodings, so I hope someone out there can shed some light on this. I have always...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
6
by: Raphael.Benedet | last post by:
Hello, For my application, I would like to execute an SQL query like this: self.dbCursor.execute("INSERT INTO track (name, nbr, idartist, idalbum, path) VALUES ('%s', %s, %s, %s, '%s')" %...
4
by: tinkerbarbet | last post by:
Hi I've read around quite a bit about Unicode and python's support for it, and I'm still unclear about how it all fits together in certain scenarios. Can anyone help clarify? * When I say "#...
13
by: mario | last post by:
Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.