PEP 358 and operations on bytes

Gerrit Holl

Hi,

In Python 3, reading from a file gives bytes rather than characters.
Some operations currently performed on strings also make sense when
performed on bytes, either if it's binary data or if it's text of
unknown or mixed encoding. Those include of course slicing and other
operators that exist in lists, but also other operations that aren't
currently defined in PEP 358, like:

- str methods endswith, find, partition, replace, split(lines),
startswith,
- Regular expressions

I think those can be useful on a bytes type. Perhaps bytes and str could
share a common parent class? They certainly share a lot of properties
and possible operations one might want to perform.

kind regards,
Gerrit Holl.

--
My first English-language post ever was made to this newsgroup:
http://groups.google.com/group/comp....57acf785ddfb71 :)

Oct 3 '06 #1

Subscribe Post Reply

1381

John Machin

Gerrit Holl wrote:

Hi,

In Python 3, reading from a file gives bytes rather than characters.
Some operations currently performed on strings also make sense when
performed on bytes, either if it's binary data or if it's text of
unknown or mixed encoding. Those include of course slicing and other
operators that exist in lists, but also other operations that aren't
currently defined in PEP 358, like:

- str methods endswith, find, partition, replace, split(lines),
startswith,
- Regular expressions

I think those can be useful on a bytes type. Perhaps bytes and str could
share a common parent class? They certainly share a lot of properties
and possible operations one might want to perform.

I look at it this way::
Processing text? Use unicode.
Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
bytes.
Nostalgic for confused mixed-use? Don't upgrade.

IMHO, core dev time would be better used on:

* making /relevant/ modules (e.g. struct) work with bytes -- this topic
is not mentioned in the PEP.
* ensuring it covers everything that array.array('B', ...) does.
* being able to initialise a bytes array to (typically) all zeroes
without having to instantiate an initialiser e.g. record =
bytes(size=996, fill=0) instead of record = bytes(996 * [0])

than on starts(ends)with etc, and regexes.

Cheers,
John

Oct 4 '06 #2

Gerrit Holl

On 2006-10-04 05:10:32 +0200, John Machin wrote:

- str methods endswith, find, partition, replace, split(lines),
startswith,
- Regular expressions

I think those can be useful on a bytes type. Perhaps bytes and str could
share a common parent class? They certainly share a lot of properties
and possible operations one might want to perform.

I look at it this way::
Processing text? Use unicode.
Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
bytes.

But can I use regular expressions on bytes?
Regular expressions are not limited to text.

Gerrit.

Oct 4 '06 #3

John Machin

Gerrit Holl wrote:

On 2006-10-04 05:10:32 +0200, John Machin wrote:

- str methods endswith, find, partition, replace, split(lines),
startswith,
- Regular expressions
>
I think those can be useful on a bytes type. Perhaps bytes and str could
share a common parent class? They certainly share a lot of properties
and possible operations one might want to perform.
>
I look at it this way::
Processing text? Use unicode.
Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
bytes.

But can I use regular expressions on bytes?
Regular expressions are not limited to text.

So why haven't you been campaigning for regular expression support for
sequences of int, and for various array.array subtypes?

Oct 4 '06 #4

Paul Rubin

"John Machin" <sj******@lexicon.netwrites:

So why haven't you been campaigning for regular expression support for
sequences of int, and for various array.array subtypes?

regexps work on byte arrays.

Oct 4 '06 #5

John Machin

Paul Rubin wrote:

"John Machin" <sj******@lexicon.netwrites:
So why haven't you been campaigning for regular expression support for
sequences of int, and for various array.array subtypes?

regexps work on byte arrays.

But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

Oct 4 '06 #6

Paul Rubin

"John Machin" <sj******@lexicon.netwrites:

But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

I think the underlying regexp C library isn't written that way. I can
see reasons to want a higher-level regexp library that works on
arbitrary sequences, calling a user-supplied function to classify
sequence elements, the way current regexps use the character code to
classify characters.

Oct 4 '06 #7

bearophileHUGS

Paul Rubin:

I think the underlying regexp C library isn't written that way. I can
see reasons to want a higher-level regexp library that works on
arbitrary sequences, calling a user-supplied function to classify
sequence elements, the way current regexps use the character code to
classify characters.

To begin with something concrete some days ago I was starting to write
a simple RE engine that works on lists/tuples/arrays and uses Psyco in
a good way (but then I have stopped developing it). Once and only once
some good uses has being found, later someone can translate the code to
C, if necessary.
It seems an interesting thing, but can you find some uses for it?

Bye,
bearophile

Oct 4 '06 #8

Paul Rubin

be************@lycos.com writes:

I think the underlying regexp C library isn't written that way. I can
see reasons to want a higher-level regexp library that works on
arbitrary sequences, calling a user-supplied function to classify
sequence elements, the way current regexps use the character code to
classify characters.
...It seems an interesting thing, but can you find some uses for it?

Yes, I want something like that all the time for file scanning without
having to resort to parser modules or hand coded automata.

Oct 4 '06 #9

Fredrik Lundh

John Machin wrote:

But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

they do:

import re, array

data = [0, 1, 1, 2]

array_type = "IH"[re.sre_compile.MAXCODE == 0xffff]

a = array.array(array_type, data)

m = re.search(r"\x01+", a)

if m:
print m.span()
print m.group()

</F>

Oct 4 '06 #10

bearophileHUGS

A simple RE engine written in Python can be short, this is a toy:
http://paste.lisp.org/display/24849
If you can't live without the usual syntax:
http://paste.lisp.org/display/24872

Paul Rubin:

Yes, I want something like that all the time for file scanning without
having to resort to parser modules or hand coded automata.

Once read a file is a string or unicode. On them you can use normal
REs. If you need list-REs you probably slit the data in some parts. Can
you show one or more examples where you think simple list-REs can be
useful?

Bye,
bearophile

Oct 4 '06 #11

John Machin

Fredrik Lundh wrote:

John Machin wrote:

But not on other integer subtypes. If regexps should not be restricted
to text, they should work on domains whose number of symbols is greater
than 256, shouldn't they?

they do:

import re, array

data = [0, 1, 1, 2]

array_type = "IH"[re.sre_compile.MAXCODE == 0xffff]

a = array.array(array_type, data)

m = re.search(r"\x01+", a)

if m:
print m.span()
print m.group()

Very minor nit: re.sre_compile doesn't exist before Python 2.5.
Presumably sys.maxunicode can substitute for re.sre_compile.MAXCODE.

That aside, I'd like to nominate myself as UGPOTM (utterly gobsmacked
poster of the month). Not only does that work, but so does this, all
the way back to 2.1 at least:

import re, array
data = [0, 1, 1, 2, 257, 257, 258]
# array_type = "IH"[re.sre_compile.MAXCODE == 0xffff] # Python 2.5
array_type = "H"
a = array.array(array_type, data)
for q in (r"\x01+", ur"\u0101+"):
m = re.search(q, a)
if m:
print m.span()
print m.group()

produces:

(1, 3)
array('H', [1, 1])
(4, 6)
array('H', [257, 257])

Now, scurrying back towards Gerrit's original point: this feature is
not documented, even for array.array('B', ...). Should it be left as a
happy accident of duck-typing, accessible only to those who stumble
over it, or should it be supported? Should it be included in Python 3?

Cheers,
John

Oct 4 '06 #12

by: Eric Wichterich | last post by:

Hello Pythonistas, I use Python shelves to store results from MySQL-Queries (using Python for web scripting). One script searches the MySQL-database and stores the result, the next script reads...

Python

struct.unpack() and bit operations

by: mikeSpindler | last post by:

THANKS FOR THE HELP ON MY LAST INQUIRY! AWESOME ANSWERS. I am reading in from a binary file data that is formatted as 32 * 16-bit words. So when I read it in I'm apparently not getting or...

Python

bit operations with python?

by: Jason | last post by:

I am going through the Geek Challenges on the Open Source Institute Website: http://www.osix.net/modules/geek/ The instructions for Level 4 are: "This challenge requires you to use some of...

Python

Question on byteorder operations

by: usenet | last post by:

My experience has been that **ON A GIVEN SYSTEM** htonl and ntohl always evaluate to the same thing ----- either both of them are no-op or both of them swap the bytes of the long value that is...

C / C++

confused abt file operations

by: siliconwafer | last post by:

Hi All, If I open a binary file in text mode and use text functions to read it then will I be reading numbers as characters or actual values? What if I open a text file and read it using binary...

C / C++

About PostgreSQL's limit on arithmetic operations

by: Devrim GUNDUZ | last post by:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, We were performing some tests on PostgreSQL and found that it fails on the following query: SELECT 512*18014398509481984 AS result;

PostgreSQL Database

bitwise operations

by: bill | last post by:

All, I have not visited Access for a while, and I am drawing a blank on how to search/sort a column of data (integer) for/on a particular bit pattern. Actually, a SQL example would be great. ...

Microsoft Access / VBA

NetworkStream Read/Write operations - # bytes read/written.

by: Charles | last post by:

Hi, Is there a way to find out how many bytes NetworkStream Read/Write actually read/wrote when there is an exception (i.e. socket read/write timeout). Or, can I assume that if there is an...

C# / C Sharp

Bit Operations

by: Gianmaria Iaculo - NVENTA | last post by:

Hi there, I'm so new to python (coming from .net so excuse me for the stupid question) and i'm tring to do a very simple thing,with bytes. My problem is this: i've a byte that naturally is...

Python

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

PEP 358 and operations on bytes

Similar topics