472,127 Members | 2,029 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,127 software developers and data experts.

Pure python implementation of string-like class


Hi all.

I would like to ask how I can implement string-like class using tuple
or list. Does anyone know about some example codes of pure python
implementation of string-like class?

Because I am trying to use Python for a text processing which is
composed of a large character set. As the character set is wider than
UTF-16(U+10FFFF), I can't use Python's native unicode string class.

So I want to prepare my own string class, which provides convenience
string methods such as split, join, find and others like usual string
class, but it uses a sequence of integer as a internal representation
instead of a native string. Obviously, subclassing of str doesn't
help.

The implementation of each string methods in the Python source
tree(stringobject.c) is far from python code, so I have started from
scratch, like below:

def startswith(self, prefix, start=-1, end=-1):
assert start < 0, "not implemented"
assert end < 0, "not implemented"
if isinstance(prefix, (str, unicode)):
prefix = MyString(prefix)
n = len(prefix)
return self[0:n] == prefix

but I found it's not a trivial task for myself to achive correctness
and completeness. It smells "reinventing the wheel" also, though I
can't find any hints in google and/or Python cookbook.

I don't care efficiency as a starting point. Any comments are welcome.
Thanks.

-- kayama
Feb 25 '06 #1
4 2000
Maybe you can create your class using an array of 'L' with the array
standard module.

Bye,
bearophile

Feb 25 '06 #2
Akihiro KAYAMA wrote:
As the character set is wider than UTF-16(U+10FFFF), I can't use
Python's native unicode string class.


Have you tried using Python compiled in Wide Unicode mode
(--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
which should be enough for most purposes.

--
And Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/

Feb 25 '06 #3
Hi bearophile.

In article <11**********************@i40g2000cwc.googlegroups .com>,
be************@lycos.com writes:

bearophileHUGS> Maybe you can create your class using an array of 'L' with the array
bearophileHUGS> standard module.

Thanks for your suggestion. I'm currently using an usual list as a
internal representation. According to my understanding, as compared to
list, array module offers efficiency but no convenient function to
implement various string methods. As Python's list is already enough
fast, I want to speed up my coding work first.

-- kayama
Feb 25 '06 #4
Hi And.

In article <11**********************@u72g2000cwu.googlegroups .com>,
an********@doxdesk.com writes:

and-google> Akihiro KAYAMA wrote:
and-google> > As the character set is wider than UTF-16(U+10FFFF), I can't use
and-google> > Python's native unicode string class.
and-google>
and-google> Have you tried using Python compiled in Wide Unicode mode
and-google> (--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
and-google> which should be enough for most purposes.
From my quick survey, Python's Unicode support is restricted to UTF-16 range(U+0000...U+10FFFF) intentionally, regardless of
--enable-unicode=ucs4 option.
Python 2.4.1 (#2, Sep 3 2005, 22:35:47)
[GCC 2.95.4 20020320 [FreeBSD]] on freebsd4
Type "help", "copyright", "credits" or "license" for more information.
u"\U0010FFFF" u'\U0010ffff' len(u"\U0010FFFF") 1 u"\U00110000"

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character


Simple patch to unicodeobject.c which disables unicode range checking
could solve this, but I don't want to maintenance specialized Python
binary for my project.

-- kayama
Feb 25 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Kurt B. Kaiser | last post: by
1 post views Thread by David Mertz, Ph.D. | last post: by
3 posts views Thread by andrew.fabbro | last post: by
5 posts views Thread by Fuzzyman | last post: by
17 posts views Thread by Johann C. Rocholl | last post: by
4 posts views Thread by Gerardo Herzig | last post: by
12 posts views Thread by betabrain.honshu | last post: by
8 posts views Thread by Roy Smith | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.