473,856 Members | 1,439 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Proposal: require 7-bit source str's

Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.

An environment variable or command line option to set this for all
files would also be very useful (and -*- str7bit:False -*- to override
it), so one can easily check someone else's code for trouble spots.

Possibly an s'' syntax or something would also be useful for non-
Unicode strings that intentionally contain national characters.

I dislike the '7bit' part of the name - it's misleading both because
one can get 8-bit strings e.g. with the '\x<hex>' notation (a feature,
not a bug) and because some 'valid' characters will be 8bit in
character sets like EBCDIC. However, I can't think of a better name.

Comments?
Has it been discussed before?

--
Hallvard
Jul 18 '05 #1
30 2229
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.


Could

# -*- coding: ascii -*-

be sufficient? Why would you reintroduce ambiguity with your s-prefixed
strings? The long-term goal would be unicode throughout, IMHO.

Peter
Jul 18 '05 #2

"Hallvard B Furuseth" <h.**********@u sit.uio.no> wrote in message
news:HB******** ******@bombur.u io.no...
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.

An environment variable or command line option to set this for all
files would also be very useful (and -*- str7bit:False -*- to override
it), so one can easily check someone else's code for trouble spots.

Possibly an s'' syntax or something would also be useful for non-
Unicode strings that intentionally contain national characters.

I dislike the '7bit' part of the name - it's misleading both because
one can get 8-bit strings e.g. with the '\x<hex>' notation (a feature,
not a bug) and because some 'valid' characters will be 8bit in
character sets like EBCDIC. However, I can't think of a better name.

Comments?
Has it been discussed before?
Is this even an issue? If you specify utf-8 as the character
set, I can't see how non-unicode strings could have
anything other than 7-bit ascii, for the simple reason that
the interpreter wouldn't know which encoding to use.
(of course, hex escapes would still be legal, as well as
constructed strings and strings read in and so forth.)

On the other hand, I don't know that it actually does it this
way, and PEP 263 seems to be completely uninformative
on the issue.

John Roth
--
Hallvard

Jul 18 '05 #3
John Roth wrote:
"Hallvard B Furuseth" <h.**********@u sit.uio.no> wrote in message
news:HB******* *******@bombur. uio.no...
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.
(...)


Is this even an issue? If you specify utf-8 as the character
set, I can't see how non-unicode strings could have
anything other than 7-bit ascii, for the simple reason that
the interpreter wouldn't know which encoding to use.


Sorry, I should have included an example.

# -*- coding:iso-8859-1; str7bit:True; -*-

A = u'hør' # ok
B = 'hør' # error because of str7bit.
print B

The 'coding' directive ensures this source code is translated correctly
to Unicode. However, string B is then translated back to the source
character set so it can be stored as a str object and not a unicode
object.

The print statement just outputs the bytes in B, it doesn't do any
character set handling. So if your terminal uses latin-2, it will
output the 'ø' as Latin small letter r with caron.

coding:utf-8 wouldn't help. B would remain a plain string, not a
Unicode string. The raw utf-8 bytes would be output.

--
Hallvard
Jul 18 '05 #4
Peter Otten wrote:
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.
Could

# -*- coding: ascii -*-

be sufficient?


No. It would be used together with coding: <non-ascii charset>. The
point is to ensure that all non-ASCII strings are u'' strings instead
of plain strings.
Why would you reintroduce ambiguity with your s-prefixed
strings?
For programs that work with non-Unicode output devices or files and
know which character set they use. Which is quite a lot of programs.
The long-term goal would be unicode throughout, IMHO.


Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.

In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.

--
Hallvard
Jul 18 '05 #5
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.


I doubt this helps as much as you'd like. You will need to change every
source file with that annotation. While you are at it, you could just
as well check every source file directly.

So if anything, I think this should be a global option. Or, better yet,
external checkers like pychecker could check for that.

Regards,
Martin
Jul 18 '05 #6
Peter Otten wrote:
Could

# -*- coding: ascii -*-

be sufficient?


No. He still wants to allow non-ASCII in Unicode literals and
comments.

Regards,
Martin
Jul 18 '05 #7
Hallvard B Furuseth wrote:
Peter Otten wrote:
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.
Could

# -*- coding: ascii -*-

be sufficient?


No. It would be used together with coding: <non-ascii charset>. The
point is to ensure that all non-ASCII strings are u'' strings instead
of plain strings.


OK.
Why would you reintroduce ambiguity with your s-prefixed
strings?


For programs that work with non-Unicode output devices or files and
know which character set they use. Which is quite a lot of programs.


I'd say a lot of programs work with non-unicode, but many don't know what
they are doing - i. e. you cannot move them into an environment with a
different encoding (if you do they won't notice).
The long-term goal would be unicode throughout, IMHO.


Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.


Here we disagree. Showing the right image for a character should be the job
of the OS and should safely work cross-platform. Why shouldn't I be able to
store a file with a greek or chinese name? I wasn't able to quote Martin's
surname correctly for the Python-URL. That's a mess that should be cleaned
up once per OS rather than once per user. I don't see how that can happen
without unicode (only). Even NASA blunders when they have to deal with
meters and inches.
In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.


Well, Python's integers already work like they 'should be done'. I'm no
expert, but I think Java is closer to the 'real thing' concerning strings.
Perl 6 is going for unicode, if only to overcome the limititations of their
operator set (they want the yen symbol as a zipping operator because it
looks like a zipper :-).
You have to make compromises and I think an external checker would be the
way to go in your case. If I were to add a switch to Python's string
handling it would be "all-unicode". But it may well be that I would curse
it after the first real-world use...

Peter

Jul 18 '05 #8
Martin v. Löwis wrote:
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.
I doubt this helps as much as you'd like. You will need to change every
source file with that annotation.


perl -i.bak -pe '
/\bstr7bit\b/ or
s/^(\s*#.*?-\*-.*?coding[=:]\s*[\w.-]+)(?=[;\s])/$1;str7bit:True/
' `find . -name '*.py' | xargs grep -l 'coding[=:]'`
While you are at it, you could just
as well check every source file directly.
True at first pass, but if Python catches it, a file will stay
clean once it has been cleaned up and marked as str7bit. That's
particularly useful when several people are working on the source.

A fix to your objection would be to instead warn about the
offending strings _unless_ the file is marked with str7bit:False,
but I figure that's a bit too drastic for the time being:-)
So if anything, I think this should be a global option.
-W::str7bitWarni ng?

Come to think of it, that would also make it possible for a Python
program to reject add-ons (modules, execfile etc) which contain
unmarked 8-bit strings.
Or, better yet,
external checkers like pychecker could check for that.


Well, I don't think that's better, but if it's rejected for Python
that'll be my next stop.

--
Hallvard
Jul 18 '05 #9
Hallvard B Furuseth wrote:
Well, I don't think that's better, but if it's rejected for Python
that'll be my next stop.


I can see how the global warning is provided (which then can be
configured into an error through the warnings module). However,
I don't want to introduce additional magic comments. I already
dislike the coding declarations for being comments, and would
have preferred if they had been spelled as

directive encoding "utf-8"

The coding declaration was only acceptable because
- a statement would have to go before the doc string, in
which case it would not have been a docstring anymore, and
- there was prior art (Emacs and VIM) for declaring encodings
to editors, inside comments

Your proposed annotation has no prior art. As it has
effects on the syntax of the language, it should not be in
a comment.

Regards,
Martin
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

15
2602
by: Ralf W. Grosse-Kunstleve | last post by:
****************************************************************************** This posting is also available in HTML format: http://cci.lbl.gov/~rwgk/python/adopt_init_args_2005_07_02.html ****************************************************************************** Hi fellow Python coders, I often find myself writing:: class grouping:
13
2074
by: Ian Hickson | last post by:
A group of us have been unofficially working on a proposal of extensions to HTML4's Forms chapter, and would like to get input from a wider range of people now that we think our draft proposal is reaching a stable stage: http://www.whatwg.org/specs/web-forms/2004-06-27-call-for-comments/ Some of the features we are proposing include new input control types for dates, times, e-mail addresses, and numbers; a new client-side validation...
18
2292
by: Ralf W. Grosse-Kunstleve | last post by:
My initial proposal (http://cci.lbl.gov/~rwgk/python/adopt_init_args_2005_07_02.html) didn't exactly get a warm welcome... And Now for Something Completely Different: class autoinit(object): def __init__(self, *args, **keyword_args): self.__dict__.update(
4
2739
by: wkaras | last post by:
I would like to propose the following changes to the C++ Standard, the goal of which are to provide an improved ability to specify the constraints on type parameters to templates. Let me say from the start that my knowledge of compiler implementation is very limited. Therefore, my suggestions may have to be rejected because they are difficult or impossible to implement. The proposal is based on the concept of "type similarity". Type...
47
3396
by: Pierre Barbier de Reuille | last post by:
Please, note that I am entirely open for every points on this proposal (which I do not dare yet to call PEP). Abstract ======== This proposal suggests to add symbols into Python. Symbols are objects whose representation within the code is more important than their actual value. Two symbols needs only to be
17
2458
by: Steve R. Hastings | last post by:
I have been studying Python recently, and I read a comment on one web page that said something like "the people using Python for heavy math really wish they could define their own operators". The specific example was to define an "outer product" operator for matrices. (There was even a PEP, number 211, about this.) I gave it some thought, and Googled for previous discussions about this, and came up with this suggestion: User-defined...
23
6433
by: Shane Hathaway | last post by:
Here's a heretical idea. I'd like a way to import modules at the point where I need the functionality, rather than remember to import ahead of time. This might eliminate a step in my coding process. Currently, my process is I change code and later scan my changes to make matching changes to the import statements. The scan step is error prone and time consuming. By importing inline, I'd be able to change code without the extra scan...
9
2040
by: corey.coughlin | last post by:
Alright, so I've been following some of the arguments about enhancing parallelism in python, and I've kind of been struck by how hard things still are. It seems like what we really need is a more pythonic approach. One thing I've been seeing suggested a lot lately is that running jobs in separate processes, to make it easy to use the latest multiprocessor machines. Makes a lot of sense to me, those processors are going to be more and...
23
5353
by: Kaz Kylheku | last post by:
I've been reading the recent cross-posted flamewar, and read Guido's article where he posits that embedding multi-line lambdas in expressions is an unsolvable puzzle. So for the last 15 minutes I applied myself to this problem and come up with this off-the-wall proposal for you people. Perhaps this idea has been proposed before, I don't know. The solutions I have seen all assume that the lambda must be completely inlined within the...
14
3284
by: cody | last post by:
I got a similar idea a couple of months ago, but now this one will require no change to the clr, is relatively easy to implement and would be a great addition to C# 3.0 :) so here we go.. To make things simpler and better readable I'd make all default parameters named parameters so that you can decide for yourself which one to pass and which not, rather than relying on massively overlaoded methods which hopefully provide the best...
0
9766
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11066
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10702
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10391
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9536
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7934
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5763
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4584
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3203
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.