473,725 Members | 1,781 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Prothon should not borrow Python strings!

I skimmed the tutorial and something alarmed me.

"Strings are a powerful data type in Prothon. Unlike many languages,
they can be of unlimited size (constrained only by memory size) and can
hold any arbitrary data, even binary data such as photos and movies.They
are of course also good for their traditional role of storing and
manipulating text."

This view of strings is about a decade out of date with modern
programmimg practice. From the programmer's point of view, a string
should be a list of characters. Characters are logical objects that have
properties defined by Unicode. This is the model used by Java,
Javascript, XML and C#.

Characters are an extremely important logical concept for human beings
(computers are supposed to serve human beings!) and they need
first-class representation. It is an accident of history that the
language you grew up with has so few characters that they can have a
one-to-one correspondance with bytes.

I can understand why you might be afraid to tackle all of Unicode for
version 1.0. Don't bother. All you need to do today to avoid the dead
end is DO NOT ALLOW BINARY DATA IN STRINGS. Have a binary data type.
Have a character string type. Give them a common "prototype" if you
wish. Let them share methods. But keep them separate in your code. The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string. These are as different as the bits
that represent an integer in a particular file format and a logical integer.

Even if your character data type is today limited to characters between
0 and 255, you can easily extend that later. But once you have megabytes
of code that makes no distinction between characters and bytes it will
be too late. It would be like trying to tease apart integers and floats
after having treated them as indistinguishab le. (which brings me to my
next post)

Paul Prescod

Jul 18 '05 #1
16 2426
"Paul Prescod" <pa**@prescod.n et> wrote
I can understand why you might be afraid to tackle all of Unicode for
version 1.0. Don't bother. All you need to do today to avoid the dead
end is DO NOT ALLOW BINARY DATA IN STRINGS. Have a binary data type.
Have a character string type. Give them a common "prototype" if you
wish. Let them share methods. But keep them separate in your code. The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string. These are as different as the bits
that represent an integer in a particular file format and a logical

integer.

This is very timely. I would like to resolve issues like this by July and
that deadline is coming up very fast.

We have had discussions on the Prothon mailing list about how to handle
Unicode properly but no one pointed this out. It makes perfect sense to me.

Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don't want to steal
Java's streams. I remember hating them with a passion.
Jul 18 '05 #2
Mark Hahn wrote:
"Paul Prescod" <pa**@prescod.n et> wrote


...

Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don't want to steal
Java's streams. I remember hating them with a passion.


I don't consider myself an expert: there are just some big mistakes that
I can recognize. But I'll give you as much guidance as I can.

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

Summary:

"""It does not make sense to have a string without knowing what encoding
it uses. You can no longer stick your head in the sand and pretend that
"plain" text is ASCII.

There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you
have to know what encoding it is in or you cannot interpret it or
display it to users correctly."""

One thing I should have told you is that it is just as important to get
your internal APIs right as your syntax. If you embed the "ASCII
assumption" into your APIs you will have a huge legacy of third party
modules that expect all characters to be <255 and you'll be stuck in the
same cul de sac as Python.

I would define macros like

#define PROTHON_CHAR int

and functions like

Prothon_String_ As_UTF8
Prothon_String_ As_ASCII // raises error if there are high characters

Obviously I can't think through the whole API. Look at Python,
JavaScript and JNI, I guess.

http://java.sun.com/docs/books/jni/h...ypes.html#4001

The gist is that extensions should not poke into the character string
data structure expecting the data to be a "char *" of ASCII bytes.
Rather it should ask you to decode the data into a new buffer. Maybe you
could do some tricky buffer reuse if the encoding they ask for happens
to be the same as your internal structure (look at the Java "isCopy"
stuff). But if you promise users the ability to directly fiddle with the
internal data then you may have to break that promise one day.

To get from a Prothon string to a C string requires encoding because
_there ain't no such thing as a plain string_. If the C programmer
doesn't tell you how they want the data encoded, how will you know?

If you get the APIs right, it will be much easier to handle everything
else later.

Choosing an internal encoding is actually pretty tricky because there
are space versus time tradeoffs and you need to make some guesses about
how often particular characters are likely to be useful to your users.

==

On the question of types: there are two models that seem to work okay in
practice. Python's split between byte strings and Unicode strings is
actually not bad except that the default string literal is a BYTE string
(for historical reasons) rather than a character string.
a = "a \u1234"
b = u"ab\u1234"
a 'a \\u1234' b u'ab\u1234' len(a) 8 len(b) 3

Here's what Javascript does (i.e. better):

<script>
str = "a \u1234"
alert(str.lengt h) // 3
</script>

===

By the way, if you have the courage to distance yourself from every
other language under the sun, I would propose that you throw an
exception on unknown escape sequences. It is very easy in Python to
accidentally used an escape sequence that is incorrect as above. Plus,
it is near impossible to add new escape sequences to Python because they
may break some code somewhere. I don't understand why this case is
special enough to break the usual Python commitment to "not guess" what
programmers mean in the face of ambiguity. This is another one of those
things you have to get right at the beginning because it is tough to
change later! Also, I totally hate how character numbers are not
delimited. It should be \u{1} or \u{1234} or \u{12345}. I find Python
totally weird:
u"\1" u'\x01' u"\12" u'\n' u"\123" u'S' u"\1234" u'S4' u"\u1234" u'\u1234' u"\u123" UnicodeDecodeEr ror: 'unicodeescape' codec can't decode bytes in position
0-4: end of string in escape sequence

====

So anyhow, the Python model is that there is a distinction between
character strings (which Python calls "unicode strings") and byte
strings (called 8-bit strings). If you want to decode data you are
reading from a file, you can just:

file("filename" ).read().decode ("ascii")

or

file("filename" ).read().decode ("utf-8")

Here's an illustration of a clean split between character strings and
byte strings:
file("filename" ).read() <bytestring ['a', 'b', 'c'...]> file("filename" ).read().decode ("ascii")

"abc"

Now the Javascript model, which also seems to work, is a little bit
different. There is only one string type, but each character can take
values up to 2^16 (more on this number later).

http://www.mozilla.org/js/language/e...on.html#string

If you read binary data in JavaScript, the implementations seem to just
map each byte to a corresponding Unicode code point (another way of
saying that is that they default to the latin-1 encoding). This should
work in most browsers:

<SCRIPT language = "Javascript ">
datafile = "http://www.python.org/pics/pythonHi.gif"

httpconn = new XMLHttpRequest( );
httpconn.open(" GET",datafile,f alse);
httpconn.send(n ull);
alert(httpconn. responseText);
</SCRIPT>
<BODY></BODY>
</HTML>

(ignore the reference to "Xml" above. For some reason Microsoft decided
to conflate XML and HTTP in their APIs. In this case we are doing
nothing with XML whatsoever)

I was going to write that Javascript also has a function that allows you
to explicitly decode. That would be logical. You could imagine that you
could do as many levels of decoding as you like:

objXml.decode(" utf-8").decode("lat in-1").decode(" utf-8").decode("koi 8-r")

This model is a little bit "simpler" in that there is only one string
object and the programmer just keeps straight in their head whether it
has been decoded already (or how many times it has been decoded, if for
some strange reason it were double or triple-encoded).

But it turns out that I can't find a Javascript Unicode decoding
function through Google. More evidence that Javascript is brain-dead I
suppose.

Anyhow, that describes two models: one where byte (0-255) and character
(0-2**16 or 2**32) strings are strictly separated and one where byte
strings are just treated as a subset of character strings. What you
absolutely do not want is to leave character handling totally in the
domain of the application programmer as C and early and versions of
Python did.

On to character ranges. Strictly speaking, the Unicode cap is 2^20
characters. You'll notice that this is just beyond 2^16, which is a much
more convenient (and space efficient) number. There are three basic ways
of dealing with this situation.

1. You can use two bytes per character and simply ignore the issue.
"Those characters are not available. Deal with it!" That isn't as crazy
as it sounds because the high characters are not in common use yet.

2. You could directly use 3 (or more likely 4) bytes per character.
"Memory is cheap. Deal with it!"

3. You could do tricks where you sort of page switch from two-byte to
four-byte mode using "surrogates ".[1] This is actually not that far from
"1" if you leave the manipulation of the surrogates entirely in
application code. I believe this is the strategy used by Java[2] and
Javascript.[3]

[1] http://www.i18nguy.com/surrogates.html

[2] "The methods that only accept a char value cannot support
supplementary characters. They treat char values from the surrogate
ranges as undefined characters."

http://java.sun.com/j2se/1.5.0/docs/...Character.html

"Characters are single Unicode 16-bit code points. We write them
enclosed in single quotes ‘ and ’. There are exactly 65536 characters:
‘«u0000»’, ‘«u0001»’, ...,‘A’, ‘B’, ‘C’, ...,‘«uFFFF»’ (see also
notation for non-ASCII characters). Unicode surrogates are considered to
be pairs of characters for the purpose of this specification."

[3] http://www.mozilla.org/js/language/j.../notation.html

From a correctness point of view, 4-byte chars is obviously
Unicode-correct. From a performance point of view, most language
designers people have chosen to sweep the issue under the table and hope
that 16 bits per char continue to be enough "most of the time" and that
those who care about more will explicitly write their own code to deal
with high characters.

Paul Prescod

Jul 18 '05 #3
Mark Hahn wrote:
Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don't want to steal
Java's streams. I remember hating them with a passion.


Java's bytes being signed also caused no end of annoyance for me.
In our protocol marshalling code (thankfully mostly auto generated)
there was lots of code just to turn the signed bytes back into
unsigned bytes.

(I also *very* strongly agree with Paul.)

Roger
Jul 18 '05 #4
> Choosing an internal encoding is actually pretty tricky because there
are space versus time tradeoffs and you need to make some guesses about
how often particular characters are likely to be useful to your users.
There are two ways to deal with it. One is to convert to an internal
"UNICODE" format such as utf8, or using arrays of 16 or 32 bit integers.

You also have to decide if you are going to normalise the string.
For example you can have characters followed by a combining accent.
On display they are one character, but often there is a codepoint
for the single character combined with the accent, so you could
reduce the two down to one. There are also other characters such as
those that specify the direction of the following text which are
considered noise in some contexts.

The other way of dealing with things is to keep the text as it
was given, and not do any conversion or normalisation on it.
This is generally more future proof, but does burden other code
with having to deal with conversion issues (for example NT/2K/XP
only uses 16 bits for codepoints which is less than the full
range now).

If you want to score extra bonus points, you should also store
the locale of the string along with the encoding. I won't elaborate
here why.

Another design to consider is to allow tags that cover character
ranges and then assign properties to those tags (such as locale,
encoding), but importantly allow multiple tags per character.
(If you have used the Tk text widget you'll understand what I
am thinking of).
By the way, if you have the courage to distance yourself from every
other language under the sun, I would propose that you throw an
exception on unknown escape sequences.


Perl did that first :-) It didn't distinguish between arrays of
bytes and arrays of characters so you easily end up with humunguous
amounts of warnings about invalid UTF8 stuff when dealing with
bytes. (I have no idea what goes on under the hood - you just
see it when installing Perl stuff like SpamAssassin).

In addition to all the excellent notes from Paul, I would recommend
you consult with someone familiar with the locale and encoding
issues for Hebrew, Arabic and various oriental languages such
as Japanese, Korean, Vietnamese and Tibetan. Bonus points for
Tamil :-)

Just to make life even more interesting, you should realise that
there is more than one system of digits. You can see how Java
handles the issue here:

http://java.sun.com/j2se/1.4.2/docs/...ricShaper.html

Since you are doing new language design, I also think there would
be great value in forcing things so that you do not have
strings embedded in the program, and they have to come from
external resource files. This also gives you the opportunity to
deal with string interpolation issues and get them right.
(It also means that "Hello, World" remains one line, but also
requires an external file with the message, or some other
mechanism).

The other Java i18n pages make for interesting reading:

http://java.sun.com/j2se/corejava/intl/index.jsp

Roger
Jul 18 '05 #5
Paul Prescod wrote:
The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string.


What if the file you're reading is a text file?

--
Greg Ewing, Computer Science Dept,
University of Canterbury,
Christchurch, New Zealand
http://www.cosc.canterbury.ac.nz/~greg

Jul 18 '05 #6
Roger Binns wrote:
In addition to all the excellent notes from Paul, I would recommend
you consult with someone familiar with the locale and encoding
issues for Hebrew, Arabic and various oriental languages such
as Japanese, Korean, Vietnamese and Tibetan. Bonus points for
Tamil :-)
I sure hope you are kidding. If not you are scaring me away from doing
anything.

I want to do the best thing. I want someone who knows what's best and that
I can trust to help out and tell me what to do. I want to develop Prothon,
not become an expert on glyphs and international character coding.
Since you are doing new language design, I also think there would
be great value in forcing things so that you do not have
strings embedded in the program, and they have to come from
external resource files. This also gives you the opportunity to
deal with string interpolation issues and get them right.
(It also means that "Hello, World" remains one line, but also
requires an external file with the message, or some other
mechanism).


Do you mean for the interpreter or some enabling tool for the Prothon
programs? Doing this for the interpreter is on the to-do list.

Thanks for all the tips. I'd like for you and Paul to help out with Prothon
in this area if you could. At least let me bounce the plans off you two as
I go.
Jul 18 '05 #7
Greg Ewing wrote:
What if the file you're reading is a text file?


if you don't know what encoding a file uses, "text files" contains chunks of
binary data separated by newlines (and/or carriage return characters).

http://www.python.org/peps/pep-0320.html mentions a textfile(filena me,
mode, encoding) constructor that hides the ugly "U" flag, and sets up proper
codecs, if necessary.

since you don't always know the encoding until you've looked inside the
file (cf. emacs encoding directive, Python, XML, etc), it would also be nice
to have a "setencodin g" method (or a writable "encoding" attribute). but
adding that to existing file-like objects may turn out to be a lot of work;
easy-to-find-and-use stream wrappers are probably a better idea.

</F>


Jul 18 '05 #8
Greg Ewing wrote:
What if the file you're reading is a text file?


On Windows, Linux and Mac (and most other operating systems)
it is stored as a sequence of bytes. To convert the bytes
to a sequence of characters (ie text) you have to know
what the encoding was that produced the sequence of
bytes.

This can be non-trivial, but pretending that the issue
doesn't exist leads you down the path and issues present
today in Python and several other languages.

Roger
Jul 18 '05 #9
il Mon, 24 May 2004 21:43:12 -0700, "Mark Hahn" <ma**@prothon.o rg> ha
scritto::
Roger Binns wrote:
In addition to all the excellent notes from Paul, I would recommend
you consult with someone familiar with the locale and encoding
issues for Hebrew, Arabic and various oriental languages such
as Japanese, Korean, Vietnamese and Tibetan. Bonus points for
Tamil :-)


I sure hope you are kidding. If not you are scaring me away from doing
anything.


Sorry if It's kind of OT, but a huge thread about this appeared in
comp.lang.ruby some time ago.
Quoting a little for you:

"""
|As far as I can see, currently 20 bits are sufficient :-)
|http://www.unicode.org/charts/
|
|And anything after "Special" looks really quite special to me. At least
|western languages as well as Kanji, Hiragana and Katakana are supported.
|IMHO pragmatically 16 bits are good enough.

I assume you're saying that there's no more than 65536 characters on
earth in daily use, even including Asian ideograms (Kanjis).

You are right, if we can live in the idealistic world.

The problems are:

* Japan, China, Korea and Taiwan have characters from same origin,
but with different glyph (appearance). Due to Han unification,
Unicode assigns same character code number to those characters.
We used to use encodings to switch country information (script) in
internationaliz ed applications. Unicode does not allow this
approach. We need to implement another layer to switch script.

* Due to historical reason and unification, some characters do not
round trip through conversion from/to Unicode. Sometimes we loose
information by implicit Unicode conversion.

* Asian people have used multibyte encoding (EUC-JP for example) for
long time. We have gigabytes of legacy encoding files. The cost
of code conversion is not negligible. We also have to care about
the round trip problem.

* There are some huge set of characters little known to western
world. For example, the TRON code contains 170,000 characters.
They are important to researchers, novelists, and people who care
characters.
"""
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1400
by: Mark Hahn | last post by:
I would like to announce a new interpreted object-oriented language very closely based on Python, that is Prototype-based, like Self (http://research.sun.com/research/self/language.html) instead of class-based like Python. I have named the language Prothon, short for PROtotype pyTHON. You can check it out at http://prothon.org. The prototype scheme makes object oriented computing very simple and complicated things like meta-classes...
0
1310
by: Mark Hahn | last post by:
Ben Collins and I have developed a new interpreted object-oriented language very closely based on Python, that is Prototype-based, like Self (http://research.sun.com/research/self/language.html) instead of class-based like Python. I have named the language Prothon, short for PROtotype pyTHON. You can check it out at http://prothon.org. The prototype scheme makes object oriented computing very simple and complicated things like...
145
6316
by: David MacQuigg | last post by:
Playing with Prothon today, I am fascinated by the idea of eliminating classes in Python. I'm trying to figure out what fundamental benefit there is to having classes. Is all this complexity unecessary? Here is an example of a Python class with all three types of methods (instance, static, and class methods). # Example from Ch.23, p.381-2 of Learning Python, 2nd ed. class Multi:
29
22183
by: Mark Hahn | last post by:
We are considering switching to the dollar sign ($) for self, instead of the period ( . ) we are using now in Prothon. Ruby uses the at-sign (@) for self, but our new usage of self also includes replacing the period for some attribute references, as in obj$func() versus obj.func(), and too many programs treat that as an email address and screw it up. Also the S in the symbol $ reminds one of the S in $elf. Can people from outside the...
28
3298
by: David MacQuigg | last post by:
I'm concerned that with all the focus on obj$func binding, &closures, and other not-so-pretty details of Prothon, that we are missing what is really good - the simplification of classes. There are a number of aspects to this simplification, but for me the unification of methods and functions is the biggest benefit. All methods look like functions (which students already understand). Prototypes (classes) look like modules. This will...
25
1833
by: Mark Hahn | last post by:
There is a new release of Prothon that I think is worth mentioning here. Prothon version 0.1.0 has changed almost beyond recognition compared to what was discussed here before. For example: the "Perl-like" symbols are gone and the "self" keyword is back, replacing the period. Prothon has gotten more "python-like", simpler, and more powerful, all at the same time. There is a new tutorial that covers Prothon completely without assuming...
22
2236
by: Paul Prescod | last post by:
I think that in this case, Python is demonstrably better than Prothon. C:\temp\prothon\Prothon>python ActivePython 2.3.2 Build 232 (ActiveState Corp.) based on Python 2.3.2 (#49, Nov 13 2003, 10:34:54) on win32 Type "help", "copyright", "credits" or "license" for more information. >>> print 2**65 36893488147419103232
16
1766
by: Mark Hahn | last post by:
Can users with international keyboards tell me if they have problems typing the left-quote ( ` ) character? It isn't used much in Python but we are thinking of giving it a slightly more important role in Prothon. (I'm not going to say what that role is here to avoid starting another 300 message thread like I did last time :-) If you are curious you can go to the Prothon mailing lists at http://prothon.org).
20
1807
by: Mark Hahn | last post by:
Prothon is pleased to announce another major release of the language, version 0.1.2, build 710 at http://prothon.org. This release adds many new features and demonstrates the level of maturity that Prothon has reached. The next release after this one in approximately a month will be the first release to incorporate the final set of frozen Prothon 1.0 language features and will be the Alpha release. You can see the set of features still...
0
8747
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9392
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9246
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9162
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9091
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6694
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4773
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3211
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2619
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.