interpreter vs. compiled

castironpi

I'm curious about some of the details of the internals of the Python
interpreter:

I understand C from a hardware perspective.

x= y+ 1;

Move value from place y into a register
Add 1 to the value in the register
Move the addition's result to place x

The Python disassembly is baffling though.

>>y= 3
dis.dis('x=y+1')

0 SETUP_LOOP 31037 (to 31040)
3 STORE_SLICE+3
4 <49>

What are SETUP_LOOP and STORE_SLICE? What are these instructions?

Jul 17 '08 #1

Subscribe Post Reply

2284

I V

On Thu, 17 Jul 2008 15:08:17 -0700, castironpi wrote:

The Python disassembly is baffling though.

>>>y= 3
dis.dis('x=y+1')

You can't disassemble strings of python source (well, you can, but, as
you've seen, the results are not meaningful). You need to compile the
source first:

>>code = compile('y=x+1','-', 'single')
dis.dis(code)

1 0 LOAD_NAME 0 (x)
3 LOAD_CONST 0 (1)
6 BINARY_ADD
7 STORE_NAME 1 (y)
10 LOAD_CONST 1 (None)
13 RETURN_VALUE

You may well find these byte codes more meaningful. Note that there is a
list of opcodes at http://docs.python.org/lib/bytecodes.html

Jul 17 '08 #2

castironpi

On Jul 17, 5:37*pm, I V <ivle...@gmail.comwrote:

On Thu, 17 Jul 2008 15:08:17 -0700, castironpi wrote:
The Python disassembly is baffling though.

>>y= 3
dis.dis('x=y+1')

You can't disassemble strings of python source (well, you can, but, as
you've seen, the results are not meaningful). You need to compile the
source first:

>code = compile('y=x+1','-', 'single')
dis.dis(code)

* 1 * * * * * 0 LOAD_NAME * * * * * * * *0 (x)
* * * * * * * 3 LOAD_CONST * * * * * * * 0 (1)
* * * * * * * 6 BINARY_ADD * * * * *
* * * * * * * 7 STORE_NAME * * * * * * * 1 (y)
* * * * * * *10 LOAD_CONST * * * * * * * 1 (None)
* * * * * * *13 RETURN_VALUE

You may well find these byte codes more meaningful. Note that there is a
list of opcodes athttp://docs.python.org/lib/bytecodes.html

Oh. How is the stack represented? Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers? Does stack
manipulation consume processor cycles? Here is what I'm thinking:

LOAD_NAME: stack= [ x ]
reg0: x
tos: reg0
LOAD_CONST: stack= [ 1, x ]
reg0: x
reg1: 1
tos: reg1
BINARY_ADD: stack= [ x+ 1, x ]
reg0: x
reg1: x+ 1
tos: reg1
STORE_NAME: y= [ x+ 1], stack= same
reg0: x
reg1: x+ 1
tos: reg1

I may be totally off.

Jul 17 '08 #3

Terry Reedy

castironpi wrote:

On Jul 17, 5:37 pm, I V <ivle...@gmail.comwrote:
Oh. How is the stack represented?

As usual, as successive locations in memory.
I have the impression that CPython uses the same stack C does.
While conceptually, CPython may put objects on the stack, I am pretty
sure it actually stacks references (C pointers) to objects in heap memory.

Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers?

I am sure they are not in registers, just normal memory.
The C code that implements bytecodes to act on stack values will use
registers just like any other C code. So using registers for the stack
would get in the way. Of course, the C code might load pointers on the
stack into address registers when actually needed. But this depends on
the address scheme of a particular processor and how the C code is
compiled to its object code.

Does stack manipulation consume processor cycles?

Of course. For much more, you should peruse the CPython source.

Jul 18 '08 #4

Kay Schluehr

On 18 Jul., 01:15, castironpi <castiro...@gmail.comwrote:

On Jul 17, 5:37 pm, I V <ivle...@gmail.comwrote:

On Thu, 17 Jul 2008 15:08:17 -0700, castironpi wrote:
The Python disassembly is baffling though.

>>>y= 3
>>>dis.dis('x=y+1')

You can't disassemble strings of python source (well, you can, but, as
you've seen, the results are not meaningful). You need to compile the
source first:

>>code = compile('y=x+1','-', 'single')
>>dis.dis(code)

1 0 LOAD_NAME 0 (x)
3 LOAD_CONST 0 (1)
6 BINARY_ADD
7 STORE_NAME 1 (y)
10 LOAD_CONST 1 (None)
13 RETURN_VALUE

You may well find these byte codes more meaningful. Note that there is a
list of opcodes athttp://docs.python.org/lib/bytecodes.html

Oh. How is the stack represented?

As a pointer to a pointer of PyObject structs.

Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers? Does stack
manipulation consume processor cycles?

Python does not store values in registers. It stores locals in arrays
and accesses them by position ( you can see the positional index in
the disassembly right after the opcode name ) and globals / object
attributes in dicts.

For more information you might just download the source distribution
and look for src/Python/ceval.c. This file contains the main
interpreter loop.

Jul 18 '08 #5

castironpi

On Jul 17, 11:39*pm, Kay Schluehr <kay.schlu...@gmx.netwrote:

On 18 Jul., 01:15, castironpi <castiro...@gmail.comwrote:

On Jul 17, 5:37 pm, I V <ivle...@gmail.comwrote:

On Thu, 17 Jul 2008 15:08:17 -0700, castironpi wrote:
The Python disassembly is baffling though.

>>y= 3
>>dis.dis('x=y+1')

You can't disassemble strings of python source (well, you can, but, as
you've seen, the results are not meaningful). You need to compile the
source first:

>code = compile('y=x+1','-', 'single')
>dis.dis(code)

* 1 * * * * * 0 LOAD_NAME * * * * * * * *0 (x)
* * * * * * * 3 LOAD_CONST * * * * * * * 0 (1)
* * * * * * * 6 BINARY_ADD
* * * * * * * 7 STORE_NAME * * * * * * * 1 (y)
* * * * * * *10 LOAD_CONST * * * * * * * 1 (None)
* * * * * * *13 RETURN_VALUE

You may well find these byte codes more meaningful. Note that there is a
list of opcodes athttp://docs.python.org/lib/bytecodes.html

Oh. *How is the stack represented?

As a pointer to a pointer of PyObject structs.

Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers? *Does stack
manipulation consume processor cycles?

Python does not store values in registers. It stores locals in arrays
and accesses them by position ( you can see the positional index in
the disassembly right after the opcode name ) and globals / object
attributes in dicts.

For more information you might just download the source distribution
and look for src/Python/ceval.c. This file contains the main
interpreter loop.

Oh. I was interpreting, no pun, that the column of numbers to the
left indicated how many processor cycles were consumed in each
operation. It doesn't quite make sense, unless BINARY_ADD can refer
to memory outside of the registers, which I doubt on the basis that
two addresses would have to fit into a single operation, plus the
architecture opcode. Given that, what does that column indicate?

I'm intimidated by the source but I may look.

Jul 18 '08 #6

m.liu.jin

On Jul 18, 2:31*pm, castironpi <castiro...@gmail.comwrote:

Given that, what does that column indicate?

Offset.

Jul 18 '08 #7

castironpi

On Jul 17, 11:39*pm, Kay Schluehr <kay.schlu...@gmx.netwrote:

On 18 Jul., 01:15, castironpi <castiro...@gmail.comwrote:

On Jul 17, 5:37 pm, I V <ivle...@gmail.comwrote:

On Thu, 17 Jul 2008 15:08:17 -0700, castironpi wrote:
The Python disassembly is baffling though.

>>y= 3
>>dis.dis('x=y+1')

You can't disassemble strings of python source (well, you can, but, as
you've seen, the results are not meaningful). You need to compile the
source first:

>code = compile('y=x+1','-', 'single')
>dis.dis(code)

* 1 * * * * * 0 LOAD_NAME * * * * * * * *0 (x)
* * * * * * * 3 LOAD_CONST * * * * * * * 0 (1)
* * * * * * * 6 BINARY_ADD
* * * * * * * 7 STORE_NAME * * * * * * * 1 (y)
* * * * * * *10 LOAD_CONST * * * * * * * 1 (None)
* * * * * * *13 RETURN_VALUE

You may well find these byte codes more meaningful. Note that there is a
list of opcodes athttp://docs.python.org/lib/bytecodes.html

Oh. *How is the stack represented?

As a pointer to a pointer of PyObject structs.

Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers? *Does stack
manipulation consume processor cycles?

Python does not store values in registers. It stores locals in arrays
and accesses them by position ( you can see the positional index in
the disassembly right after the opcode name ) and globals / object
attributes in dicts.

For more information you might just download the source distribution
and look for src/Python/ceval.c. This file contains the main
interpreter loop.

Ah, found it. The parts that are making sense are:

register PyObject **stack_pointer;
#define TOP() (stack_pointer[-1])
#define BASIC_POP() (*--stack_pointer)

...(line 1159)...
w = POP();
v = TOP();
if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
/* INLINE: int + int */
register long a, b, i;
a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);
i = a + b;
if ((i^a) < 0 && (i^b) < 0)
goto slow_add;
x = PyInt_FromLong(i);

... Which is more than I was picturing was involved. I understand it
is also specific to CPython. Thanks for the pointer to the code.

My basic question was, what is the difference between compilers and
interpreters, and why are interpreters slow? I'm looking at some of
the answer right now in "case BINARY_ADD:".

Jul 18 '08 #8

Dan

On Jul 18, 2:17 pm, castironpi <castiro...@gmail.comwrote:

On Jul 17, 11:39 pm, Kay Schluehr <kay.schlu...@gmx.netwrote:

On 18 Jul., 01:15, castironpi <castiro...@gmail.comwrote:

On Jul 17, 5:37 pm, I V <ivle...@gmail.comwrote:

On Thu, 17 Jul 2008 15:08:17 -0700, castironpi wrote:
The Python disassembly is baffling though.

>>>y= 3
>>>dis.dis('x=y+1')

You can't disassemble strings of python source (well, you can, but, as
you've seen, the results are not meaningful). You need to compile the
source first:

>>code = compile('y=x+1','-', 'single')
>>dis.dis(code)

1 0 LOAD_NAME 0 (x)
3 LOAD_CONST 0 (1)
6 BINARY_ADD
7 STORE_NAME 1 (y)
10 LOAD_CONST 1 (None)
13 RETURN_VALUE

You may well find these byte codes more meaningful. Note that there is a
list of opcodes athttp://docs.python.org/lib/bytecodes.html

Oh. How is the stack represented?

As a pointer to a pointer of PyObject structs.

Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers? Does stack
manipulation consume processor cycles?

Python does not store values in registers. It stores locals in arrays
and accesses them by position ( you can see the positional index in
the disassembly right after the opcode name ) and globals / object
attributes in dicts.

For more information you might just download the source distribution
and look for src/Python/ceval.c. This file contains the main
interpreter loop.

Ah, found it. The parts that are making sense are:

register PyObject **stack_pointer;
#define TOP() (stack_pointer[-1])
#define BASIC_POP() (*--stack_pointer)

...(line 1159)...
w = POP();
v = TOP();
if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
/* INLINE: int + int */
register long a, b, i;
a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);
i = a + b;
if ((i^a) < 0 && (i^b) < 0)
goto slow_add;
x = PyInt_FromLong(i);

... Which is more than I was picturing was involved. I understand it
is also specific to CPython. Thanks for the pointer to the code.

My basic question was, what is the difference between compilers and
interpreters, and why are interpreters slow? I'm looking at some of
the answer right now in "case BINARY_ADD:".

The basic difference between a (traditional) compiler and an
interpreter is that a compiler emits (assembly) code for a specific
machine. Therefore it must know the specifics of the machine (how many
registers, memory addressing modes, etc), whereas interpreters
normally define themselves by their conceptual state, that is, a
virtual machine. The instructions (bytecode) of the virtual machine
are generally more high-level than real machine instructions, and the
semantics of the bytecode are implemented by the interpreter, usually
in a sort-of high level language like C. This means the interpreter
can run without detailed knowledge of the machine as long as a C
compiler exists. However, the trade off is that the interpreter
semantics are not optimized for that machine.

This all gets a little more hairy when you start talking about JITs,
runtime optimizations, and the like. For a real in-depth look at the
general topic of interpretation and virtual machines, I'd recommend
Virtual Machines by Smith and Nair (ISBN:1-55860910-5).

-Dan

Jul 18 '08 #9

castironpi

On Jul 18, 2:13*pm, Dan <thermos...@gmail.comwrote:

On Jul 18, 2:17 pm, castironpi <castiro...@gmail.comwrote:

On Jul 17, 11:39 pm, Kay Schluehr <kay.schlu...@gmx.netwrote:

On 18 Jul., 01:15, castironpi <castiro...@gmail.comwrote:

On Jul 17, 5:37 pm, I V <ivle...@gmail.comwrote:

On Thu, 17 Jul 2008 15:08:17 -0700, castironpi wrote:
The Python disassembly is baffling though.

>>y= 3
>>dis.dis('x=y+1')

You can't disassemble strings of python source (well, you can, but, as
you've seen, the results are not meaningful). You need to compilethe
source first:

>code = compile('y=x+1','-', 'single')
>dis.dis(code)

* 1 * * * * * 0 LOAD_NAME * * * * * * **0 (x)
* * * * * * * 3 LOAD_CONST * * * * * * * 0 (1)
* * * * * * * 6 BINARY_ADD
* * * * * * * 7 STORE_NAME * * * * * * * 1 (y)
* * * * * * *10 LOAD_CONST * * * * * * * 1 (None)
* * * * * * *13 RETURN_VALUE

You may well find these byte codes more meaningful. Note that there is a
list of opcodes athttp://docs.python.org/lib/bytecodes.html

Oh. *How is the stack represented?

As a pointer to a pointer of PyObject structs.

Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers? *Does stack
manipulation consume processor cycles?

Python does not store values in registers. It stores locals in arrays
and accesses them by position ( you can see the positional index in
the disassembly right after the opcode name ) and globals / object
attributes in dicts.

For more information you might just download the source distribution
and look for src/Python/ceval.c. This file contains the main
interpreter loop.

Ah, found it. *The parts that are making sense are:

register PyObject **stack_pointer;
#define TOP() * * * * * (stack_pointer[-1])
#define BASIC_POP() * * (*--stack_pointer)

...(line 1159)...
w = POP();
v = TOP();
if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
* * * * /* INLINE: int + int */
* * * * register long a, b, i;
* * * * a = PyInt_AS_LONG(v);
* * * * b = PyInt_AS_LONG(w);
* * * * i = a + b;
* * * * if ((i^a) < 0 && (i^b) < 0)
* * * * * * * * goto slow_add;
* * * * x = PyInt_FromLong(i);

... Which is more than I was picturing was involved. *I understand it
is also specific to CPython. *Thanks for the pointer to the code.

My basic question was, what is the difference between compilers and
interpreters, and why are interpreters slow? *I'm looking at some of
the answer right now in "case BINARY_ADD:".

The basic difference between a (traditional) compiler and an
interpreter is that a compiler emits (assembly) code for a specific
machine. Therefore it must know the specifics of the machine (how many
registers, memory addressing modes, etc), whereas interpreters
normally define themselves by their conceptual state, that is, a
virtual machine. The instructions (bytecode) of the virtual machine
are generally more high-level than real machine instructions, and the
semantics of the bytecode are implemented by the interpreter, usually
in a sort-of high level language like C. This means the interpreter
can run without detailed knowledge of the machine as long as a C
compiler exists. However, the trade off is that the interpreter
semantics are not optimized for that machine.

This all gets a little more hairy when you start talking about JITs,
runtime optimizations, and the like. For a real in-depth look at the
general topic of interpretation and virtual machines, I'd recommend
Virtual Machines by Smith and Nair (ISBN:1-55860910-5).

-Dan

You're saying the VM can't compile code. That makes sense, it's not a
compiler. Do I understand correctly that JIT does compile to native
code in some cases?

Python: x= y+ 1
Python VM: push, push, add, store
Assembly: load, load, add, store

Except, the assembly doesn't contain the type-checking that
PyInt_AS_LONG does. But that's not the only thing that stops python
from precompiling to assembly directly. GNU doesn't come with
Python. What sorts of minimal information would be necessary to take
from the GNU libs for the user's specific processor, (the one they're
downloading their version of Python for), to move Python to the
further step of outputting the machine code?

Jul 21 '08 #10

Tim Roberts

castironpi <ca********@gmail.comwrote:

>
You're saying the VM can't compile code. That makes sense, it's not a
compiler.

I wouldn't say "can't". The current CPython VM does not compile code. It
COULD. The C#/.NET VM does. IronPython, for example, is an implementation
of Python that uses .NET. In that case, the code *IS* JIT compiled to
assembly when the program starts.

>Do I understand correctly that JIT does compile to native
code in some cases?

VMs that use JIT do, yes.

>But that's not the only thing that stops python
from precompiling to assembly directly. GNU doesn't come with
Python.

Do you mean Linux?

>What sorts of minimal information would be necessary to take
from the GNU libs for the user's specific processor, (the one they're
downloading their version of Python for), to move Python to the
further step of outputting the machine code?

I don't know why you think GNU has anything to do with this. There's
nothing that prevents the Python run-time from JIT compiling the code.
IronPython does this. CPython does not. It's an implementation decision.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Jul 23 '08 #11

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

>Oh. How is the stack represented?

>
As usual, as successive locations in memory.
I have the impression that CPython uses the same stack C does.

Actually, it doesn't (at least not for the evaluation stack).

In CPython, when a Python function starts, the maximum depth of the
evaluation stack is known, but it depends on the specific function
(of course). So Python needs to allocate an array for the evaluation
stack with known size, but can't do so on the C stack (at least not
portably), since you can't allocate dynamically-sized array as
a local variable in C.

So instead, pymalloc is used to allocate the evaluation stack, and
it is part of the frame object (so the entire frame object is allocated
in one chunk, and then split up into local variables and evaluation
stack.

While conceptually, CPython may put objects on the stack, I am pretty
sure it actually stacks references (C pointers) to objects in heap memory.

Correct.

>Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers?

I am sure they are not in registers, just normal memory.

Correct. As discussed above, they are located on the heap (making
Python's frame stack a spaghetti stack).

Regards,
Martin

Jul 23 '08 #12

castironpi

On Jul 23, 12:08*pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:

Oh. *How is the stack represented?

As usual, as successive locations in memory.
I have the impression that CPython uses the same stack C does.

Actually, it doesn't (at least not for the evaluation stack).

In CPython, when a Python function starts, the maximum depth of the
evaluation stack is known, but it depends on the specific function
(of course). So Python needs to allocate an array for the evaluation
stack with known size, but can't do so on the C stack (at least not
portably), since you can't allocate dynamically-sized array as
a local variable in C.

So instead, pymalloc is used to allocate the evaluation stack, and
it is part of the frame object (so the entire frame object is allocated
in one chunk, and then split up into local variables and evaluation
stack.

While conceptually, CPython may put objects on the stack, I am pretty
sure it actually stacks references (C pointers) to objects in heap memory.

Correct.

Does it keep track of which stack
positions (TOS, TOS1, etc.) are in what registers?

I am sure they are not in registers, just normal memory.

Correct. As discussed above, they are located on the heap (making
Python's frame stack a spaghetti stack).

Regards,
Martin

Martin,

I am curious and pursuing it as an independent project. I'd like to
write a specialized function to allocate memory from an memory-mapped
file instead of the heap. On Windows, to use CreateFileMapping and
MapViewOfFile. The companion function, premalloc, would re-open an
existing Python object from a handle. (It would need a name or index -

offset look-up.)

svn.python.org is down, so I can't tell if Python already implements
its own memory management, and if so how that would extrapolate to a
byte-array allocated specifically.

Jul 24 '08 #13

John Nagle

Tim Roberts wrote:

And I'm saying you are wrong. There is NOTHING inherent in Python that
dictates that it be either compiled or interpreted. That is simply an
implementation decision. The CPython implementation happens to interpret.
The IronPython implementation compiles the intermediate language to native
machine language.

Well, actually there are some Python language features which make
static compilation to machine code difficult. PyPy and Shed Skin
have to impose some restrictions on dynamism to make efficient
compilation feasible. The big problem is "hidden dynamism", where the code
looks static, but at run time, some external piece of code replaces a
function or adds an unexpected attribute to what looked like a simple object
or function in the defining module.

In CPython, everything is a general object internally, and all the
names are resolved over and over again at run time by dictionary lookup.
This is simple, but there's a sizable speed penalty.

John Nagle

Jul 25 '08 #14

Tim Roberts

castironpi <ca********@gmail.comwrote:

>
In CPython yes. In IronPython yes: the parts that are compiled into
machine code are the interpreter, *not user's code*.

WRONG! You are WRONG. At "compile" time, the Python code is compiled to
an intermediate language. At "run" time, the intermediate language (which
is still the user's code, just in another representation) is compiled into
machine language. It is the user's program, not the interpreter.

It's the exact same process that occurs in a C compiler. Most C compilers
translate the C program to an intermediate form before finally converting
it to machine language. The only difference is that, in a C compiler, both
steps occur within the compiler. In IronPython, the two steps are
separated in time. There is no other difference.

>Without that
step, the interpreter would be running on an interpreter, but that
doesn't get the user's statement 'a= b+ 1' into registers-- it gets
'push, push, add, pop' into registers.

You have a fundamental misunderstanding of the compilation process.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Jul 29 '08 #15

alex23

On Jul 29, 2:21*pm, castironpi <castiro...@gmail.comwrote:

On Jul 28, 5:58*pm, Fuzzyman <fuzzy...@gmail.comwrote:
Well - in IronPython user code gets compiled to in memory assemblies
which can be JIT'ed.

I don't believe so.

Uh, you're questioning someone who is not only co-author of a book on
IronPython, but also a developer on one of the first IronPython-based
commercial applications.

I know authorship isn't always a guarantee of correctness, but what
experience do you have with IronPython that makes you so unwilling to
accept the opinion of someone with substantial knowledge of the
subject?

Jul 29 '08 #16

castironpi

On Jul 29, 7:39*am, alex23 <wuwe...@gmail.comwrote:

On Jul 29, 2:21*pm, castironpi <castiro...@gmail.comwrote:

On Jul 28, 5:58*pm, Fuzzyman <fuzzy...@gmail.comwrote:
Well - in IronPython user code gets compiled to in memory assemblies
which can be JIT'ed.

I don't believe so.

Uh, you're questioning someone who is not only co-author of a book on
IronPython, but also a developer on one of the first IronPython-based
commercial applications.

I know authorship isn't always a guarantee of correctness, but what
experience do you have with IronPython that makes you so unwilling to
accept the opinion of someone with substantial knowledge of the
subject?

None, no experience, no authority, only the stated premises &
classifications, which I am generally tending to misinterpret. I'm
overstepping my bounds and trying to do it politely. (Some might call
it learning, which yes, though uncustomary, *requires questioning
authorities*, or reinventing.)

Evidently, I have a "fundamental misunderstanding of the compilation
process", which I'm trying to correct by stating what I believe. I'm
trying to elaborate, and I'm meeting with increasingly much detail.
So, perhaps I'll learn something out of this. Until then...

What I know I have is two conflicting, contradictory, inconsistent
beliefs. Maybe I've spent too much time in Python to imagine how a
dynamic language can compile.

This is from 7/22/08, same author:

I wouldn't say "can't". The current CPython VM does not compile
code. It COULD. The C#/.NET VM does.

Three big claims here that I breezed right over and didn't believe.

It COULD.

I'm evidently assuming that if it could, it would.

The current CPython VM does not compile code.

Therefore it couldn't, or the assumption is wrong. Tim says it is.
And the glaring one--

WHY NOT? Why doesn't CPython do it?

From 7/18/08, own author:

>>

#define TOP() (stack_pointer[-1])
#define BASIC_POP() (*--stack_pointer)

...(line 1159)...
w = POP();
v = TOP();
if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
/* INLINE: int + int */
register long a, b, i;
a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);
i = a + b;
<<

I am imagining that every Python implementation has something like
it. If IronPython does not, in particular, not have the 'POP();
TOP();' sequence, then it isn't running on a stack machine. Is the
IronPython code open source, and can someone link to it? I'm not
wading through it from scratch. What does it have instead? Does
dynamic typing still work?

<closing hostile remark>
If you're bluffing, bluff harder; I call. If you're not, I apologize;
teach me something. If you can ask better, teach me that too.
</hostile>

Jul 30 '08 #17

castironpi

On Jul 29, 1:46*am, Tim Roberts <t...@probo.comwrote:

castironpi <castiro...@gmail.comwrote:

In CPython yes. *In IronPython yes: *the parts that are compiled into
machine code are the interpreter, *not user's code*.

WRONG! *You are WRONG. *At "compile" time, the Python code is compiled to
an intermediate language. *At "run" time, the intermediate language (which
is still the user's code, just in another representation) is compiled into
machine language. *It is the user's program, not the interpreter.

It's the exact same process that occurs in a C compiler. *Most C compilers
translate the C program to an intermediate form before finally converting
it to machine language. *The only difference is that, in a C compiler, both
steps occur within the compiler. *In IronPython, the two steps are
separated in time. *There is no other difference.

Without that
step, the interpreter would be running on an interpreter, but that
doesn't get the user's statement 'a= b+ 1' into registers-- it gets
'push, push, add, pop' into registers.

You have a fundamental misunderstanding of the compilation process.
--
Tim Roberts, t...@probo.com
Providenza & Boekelheide, Inc.

In C, we have:

int x, y;
x= 10;
y= x+ 1;

It translates as, roughly:
8000 .data
7996 ffffffff #x
7992 ffffffff #y
7988 .end data
7984 loadi reg0 7996
7980 loadi reg1 7992
7976 loadi reg2 10
7972 loadi reg3 1
7968 storv reg2 reg0
7964 add reg0 reg1 reg2
7960 storv reg3 reg1
You are telling me that the same thing happens in IronPython. By the
time the instruction pointer gets to 'x= 10', the next 7 instructions
are the ones shown here compiled from C.

CMIIW, but the CPython implementation -does- -not-. Instead, it has,

push 10
stor x
push 1
add
stor y

which each amounts to, to give a rough figure, 5-10 machine
instructions. push 10, for example, with instruction_pointer in reg0:

loadi reg1 4 #add 4 to stack pointer (one word)
add reg0 reg1 reg2
load reg0 reg2
loadi reg2 10 #load ten
stor reg0 reg2 #store at top of stack

And this is all not to mention (i) the extra comparisons in
intobject.h,

#define PyInt_CheckExact(op) ((op)->ob_type == &PyInt_Type)

(ii) the huge case statement just to evaluate add, OR (iii)

a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);

because CPython -does- -not- -know- ahead of time which op it will be
executing, or what addresses (remember __coerce__), it will be
performing the op on. Does not know EVER, not until it gets there.

My point is, CPython takes more than seven steps. My question is,
does IronPython?

Jul 30 '08 #18

Dino Viehland

IronPython doesn't have an interpreter loop and therefore has no POP / TOP / etc... Instead what IronPython has is a method call Int32Ops.Add which looks like:

public static object Add(Int32 x, Int32 y) {
long result = (long) x + y;
if (Int32.MinValue <= result && result <= Int32.MaxValue) {
return Microsoft.Scripting.Runtime.RuntimeHelpers.Int32To Object((Int32)(result));
}
return BigIntegerOps.Add((BigInteger)x, (BigInteger)y);
}

This is the implementation of int.__add__. Note that calling int.__add__ can actually return NotImplemented and that's handled by the method binder looking at the strong typing defined on Add's signature here - and then automatically generating the NotImplemented result when the arguments aren't ints. So that's why you don't see that here even though it's the full implementation of int.__add__.

Ok, next if you define a function like:

def adder(a, b):
return a + b

this turns into a .NET method, which will get JITed, which in C# would looksomething like like:

static object adder(object a, object b) {
return $addSite.Invoke(a, b)
}

where $addSite is a dynamically updated call site.

$addSite knows that it's performing addition and knows how to do nothing other than update the call site the 1st time it's invoked. $addSite is localto the function so if you define another function doing addition it'll have its own site instance.

So the 1st thing the call site does is a call back into the IronPython runtime which starts looking at a & b to figure out what to do. Python definesthat as try __add__, maybe try __radd__, handle coercion, etc... So we golooking through finding the __add__ method - if that can return NotImplemented then we find the __radd__ method, etc... In this case we're just adding two integers and we know that the implementation of Add() won't return NotImplemented - so there's no need to call __radd__. We know we don't haveto worry about NotImplemented because the Add method doesn't have the .NETattribute indicating it can return NotImplemented.

At this point we need to do two things. We need to generate the test whichis going to see if future arguments are applicable to what we just figuredout and then we need to generate the code which is actually going to handle this. That gets combined together into the new call site delegate and it'll look something like:

static void CallSiteStub(CallSite site, object a, object b) {
if (a != null && a.GetType() == typeof(int) && b != null &&b.GetType() == typeof(int)) {
return IntOps.Add((int)a, (int)b);
}
return site.UpdateBindingAndInvoke(a, b);
}

That gets compiled down as a lightweight dynamic method which also gets JITed. The next time through the call site's Invoke body will be this method and things will go really fast if we have int's again. Also notice this islooking an awful lot like the inlined/fast-path(?) code dealing with int'sthat you quoted. If everything was awesome (currently it's not for a couple of reasons) the JIT would even inline the IntOps.Add call and it'd probably be near identical. And everything would be running native on the CPU.

So that's how 2 + 2 works... Finally if it's a user type then we'd generate a more complicated test like (and getting more and more pseudo code to keep things simple):

if (PythonOps.CheckTypeVersion(a, 42) && PythonOps.CheckTypeVersion(b, 42)){
return $callSite.Invoke(__cachedAddSlot__.__get__(a), b);
}

Here $callSite is another stub which will handle doing optimal dispatch to whatever __add__.__get__ will return. It could be a Python type, it could be a user defined function, it could be the Python built-in sum function, etc... so that's the reason for the extra dynamic dispatch.

So in summary: everything is compiled to IL. At runtime we have lots of stubs all over the place which do the work to figure out the dynamic operation and then cache the result of that calculation.

Also what I've just described is how IronPython 2.0 works. IronPython 1.0 is basically the same but mostly w/o the stubs and where we use stub methods they're much less sophisticated.

Also, IronPython is open source - www.codeplex.com/IronPython

-----Original Message-----
From: py*************************************@python.org [mailto:py*************************************@pyt hon.org] On Behalf Of castironpi
Sent: Tuesday, July 29, 2008 9:20 PM
To: py*********@python.org
Subject: Re: interpreter vs. compiled

On Jul 29, 7:39 am, alex23 <wuwe...@gmail.comwrote:

On Jul 29, 2:21 pm, castironpi <castiro...@gmail.comwrote:

On Jul 28, 5:58 pm, Fuzzyman <fuzzy...@gmail.comwrote:
Well - in IronPython user code gets compiled to in memory assemblies
which can be JIT'ed.

I don't believe so.

Uh, you're questioning someone who is not only co-author of a book on
IronPython, but also a developer on one of the first IronPython-based
commercial applications.

I know authorship isn't always a guarantee of correctness, but what
experience do you have with IronPython that makes you so unwilling to
accept the opinion of someone with substantial knowledge of the
subject?

I wouldn't say "can't". The current CPython VM does not compile
code. It COULD. The C#/.NET VM does.

Three big claims here that I breezed right over and didn't believe.

It COULD.

I'm evidently assuming that if it could, it would.

The current CPython VM does not compile code.

Therefore it couldn't, or the assumption is wrong. Tim says it is.
And the glaring one--

WHY NOT? Why doesn't CPython do it?

>From 7/18/08, own author:
>>

#define TOP() (stack_pointer[-1])
#define BASIC_POP() (*--stack_pointer)

....(line 1159)...
w = POP();
v = TOP();
if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
/* INLINE: int + int */
register long a, b, i;
a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);
i = a + b;
<<

I am imagining that every Python implementation has something like
it. If IronPython does not, in particular, not have the 'POP();
TOP();' sequence, then it isn't running on a stack machine. Is the
IronPython code open source, and can someone link to it? I'm not
wading through it from scratch. What does it have instead? Does
dynamic typing still work?

<closing hostile remark>
If you're bluffing, bluff harder; I call. If you're not, I apologize;
teach me something. If you can ask better, teach me that too.
</hostile>
--
http://mail.python.org/mailman/listinfo/python-list

Jul 30 '08 #19

castironpi

I note that IronPython and Python's pickle.dumps do not return the
same value. Perhaps this relates to the absence of interpreter loop.

>>p.dumps( { 'a': True, 'b': set( ) } )

IPy: '(dp0\nVb\np1\nc__builtin__\nset\np3\n((lp4\ntp5\n Rp2\nsVa
\np6\nI01\ns.'
CPy: "(dp0\nS'a'\np1\nI01\nsS'b'\np2\nc__builtin__\ nset
\np3\n((lp4\ntp5\nRp6\ns."

You make me think of a more elaborate example.

for k in range( 100 ):
i= j()
g= h+ i
e= f+ g
c= d+ e
a= b+ c

Here, j creates a new class dynamically, and returns an instance of
it. Addition is defined on it but the return type from it varies.

If I read you correctly, IPy can leave hundreds of different addition
stubs laying around at the end of the for-loop, each of which only
gets executed once or twice, each of which was compiled for the exact
combination of types it was called for.

I might construe this to be a degenerate case, and the majority of
times, you'll reexecute stubs enough to outweigh the length of time
the compilation step takes. If you still do the bounds checking, it
takes extra instructions (C doesn't), but operation switch-case
BINARY_ADD, (PyInt_CheckExact(v) && PyInt_CheckExact(w)), and POP and
TOP, are all handled by the selection of stubs from $addSite.

I'm read from last April:

>>The most interesting cases to me are the 5 tests where CPython is more than 3x faster than IronPython and the other 5 tests where IronPython is more than 3x faster than CPython. CPython's strongest performance is in dictionaries with integer and string keys, list slicing, small tuples and code that actually throws and catches exceptions. IronPython's strongest performance is in calling builtin functions, if/then/else blocks, calling python functions, deep recursion, and try/except blocks that don't actually catch an exception.

<<< http://lists.ironpython.com/pipermai...il/004773.html

It's interesting that CPython can make those gains still by using a
stack implementation.

I'll observe that IronPython has the additional dependency of the
full .NET runtime. (It was my point 7/18 about incorporating the GNU
libs, that to compile to machine-native, as a JIT does, you need the
instruction set of the machine.) Whereas, CPython can disregard
them, having already been compiled for it.

I think what I was looking for is that IronPython employs the .NET to
compile to machine instructions, once it's known what the values of
the variables are that are the operands. The trade-off is compilation
time + type checks + stub look-up.

What I want to know is, if __add__ performs an attribute look-up, is
that optimized in any way, after the IP is already in compiled code?

After all that, I don't feel so guilty about stepping on Tim's toes.

On Jul 30, 12:12*am, Dino Viehland <di...@exchange.microsoft.com>
wrote:

IronPython doesn't have an interpreter loop and therefore has no POP / TOP / etc... * Instead what IronPython has is a method call Int32Ops.Add which looks like:

* * * * public static object Add(Int32 x, Int32 y) {
* * * * * * long result = (long) x + y;
* * * * * * if (Int32.MinValue <= result && result <= Int32.MaxValue) {
* * * * * * * * return Microsoft.Scripting.Runtime.RuntimeHelpers.Int32To Object((Int32)(result));
* * * * * * }
* * * * * * return BigIntegerOps.Add((BigInteger)x, (BigInteger)y);
* * * * }

This is the implementation of int.__add__. *Note that calling int.__add__ can actually return NotImplemented and that's handled by the method binder looking at the strong typing defined on Add's signature here - and then automatically generating the NotImplemented result when the arguments aren't ints. *So that's why you don't see that here even though it's the full implementation of int.__add__.

Ok, next if you define a function like:

def adder(a, b):
* * * * return a + b

this turns into a .NET method, which will get JITed, which in C# would look something like like:

static object adder(object a, object b) {
* * return $addSite.Invoke(a, b)

}

where $addSite is a dynamically updated call site.

$addSite knows that it's performing addition and knows how to do nothing other than update the call site the 1st time it's invoked. *$addSite is local to the function so if you define another function doing addition it'llhave its own site instance.

So the 1st thing the call site does is a call back into the IronPython runtime which starts looking at a & b to figure out what to do. *Python defines that as try __add__, maybe try __radd__, handle coercion, etc... *Sowe go looking through finding the __add__ method - if that can return NotImplemented then we find the __radd__ method, etc... *In this case we're just adding two integers and we know that the implementation of Add() won't return NotImplemented - so there's no need to call __radd__. *We know we don't have to worry about NotImplemented because the Add method doesn't have the .NET attribute indicating it can return NotImplemented.

At this point we need to do two things. *We need to generate the test which is going to see if future arguments are applicable to what we just figured out and then we need to generate the code which is actually going to handle this. *That gets combined together into the new call site delegate and it'll look something like:

static void CallSiteStub(CallSite site, object a, object b) {
* * * * if (a != null && a.GetType() == typeof(int) && b != null && b.GetType() == typeof(int)) {
* * * * * * return IntOps.Add((int)a, (int)b);
* * * * }
* * * * return site.UpdateBindingAndInvoke(a, b);

}

That gets compiled down as a lightweight dynamic method which also gets JITed. *The next time through the call site's Invoke body will be this method and things will go really fast if we have int's again. *Also notice this is looking an awful lot like the inlined/fast-path(?) code dealing withint's that you quoted. *If everything was awesome (currently it's not for a couple of reasons) the JIT would even inline the IntOps.Add call and it'd probably be near identical. *And everything would be running native onthe CPU.

So that's how 2 + 2 works... *Finally if it's a user type then we'd generate a more complicated test like (and getting more and more pseudo code to keep things simple):

if (PythonOps.CheckTypeVersion(a, 42) && PythonOps.CheckTypeVersion(b, 42)) {
* * return $callSite.Invoke(__cachedAddSlot__.__get__(a), b);

}

Here $callSite is another stub which will handle doing optimal dispatch to whatever __add__.__get__ will return. *It could be a Python type, it could be a user defined function, it could be the Python built-in sum function, etc... *so that's the reason for the extra dynamic dispatch.

So in summary: everything is compiled to IL. *At runtime we have lots of stubs all over the place which do the work to figure out the dynamic operation and then cache the result of that calculation.

Also what I've just described is how IronPython 2.0 works. *IronPython 1.0 is basically the same but mostly w/o the stubs and where we use stub methods they're much less sophisticated.

Also, IronPython is open source -www.codeplex.com/IronPython

-----Original Message-----

abriged:

This is from 7/22/08, same author:
I wouldn't say "can't". *The current CPython VM does not compile
code. *It COULD. *The C#/.NET VM does.

Three big claims here that I breezed right over and didn't believe.

It COULD.

I'm evidently assuming that if it could, it would.

The current CPython VM does not compile code.

Therefore it couldn't, or the assumption is wrong. *Tim says it is.
And the glaring one--

WHY NOT? *Why doesn't CPython do it?

I am imagining that every Python implementation has something like
it. *If IronPython does not, in particular, not have the 'POP();
TOP();' sequence, then it isn't running on a stack machine. *Is the
IronPython code open source, and can someone link to it? *I'm not
wading through it from scratch. *What does it have instead? *Does
dynamic typing still work?

Jul 30 '08 #20

Dino Viehland

It looks like the pickle differences are due to two issues. First IronPython doesn't have ASCII strings so it serializes strings as Unicode. Second there are dictionary ordering differences. If you just do:

{ 'a': True, 'b': set( ) }

Cpy prints: {'a': True, 'b': set([])}
Ipy prints: {'b': set([]), 'a': True}

The important thing is that we interop - and indeed you can send either pickle string to either implementation and the correct results are deserialized (modulo getting Unicode strings).

For your more elaborate example you're right that there could be a problem here. But the DLR actually recognizes this sort of pattern and optimizes for it. All of the additions in your code are what I've been calling serially monomorphic call sites. That is they see the same types for a while, maybe even just once as in your example, and then they switch to a new type -never to return to the old one. When IronPython gives the DLR the code for the call site the DLR can detect when the code only differs by constants - in this case type version checks. It will then re-write the code turningthe changing constants into variables. The next time through when it seesthe same code again it'll re-use the existing compiled code with the new sets of constants.

That's still slower than we were in 1.x so we'll need to push on this more in the future - for example producing a general rule instead of a type-specific rule. But for the time being having the DLR automatically handle thishas been working good enough for these situations.

-----Original Message-----
From: py*************************************@python.org [mailto:py*************************************@pyt hon.org] On Behalf Of castironpi
Sent: Tuesday, July 29, 2008 11:40 PM
To: py*********@python.org
Subject: Re: interpreter vs. compiled

I note that IronPython and Python's pickle.dumps do not return the
same value. Perhaps this relates to the absence of interpreter loop.

>>p.dumps( { 'a': True, 'b': set( ) } )

>>The most interesting cases to me are the 5 tests where CPython is more than 3x faster than IronPython and the other 5 tests where IronPython is more than 3x faster than CPython. CPython's strongest performance is in dictionaries with integer and string keys, list slicing, small tuples and code that actually throws and catches exceptions. IronPython's strongest performance is in calling builtin functions, if/then/else blocks, calling python functions, deep recursion, and try/except blocks that don't actually catch an exception.

<<< http://lists.ironpython.com/pipermai...il/004773.html

It's interesting that CPython can make those gains still by using a
stack implementation.

I'll observe that IronPython has the additional dependency of the
full .NET runtime. (It was my point 7/18 about incorporating the GNU
libs, that to compile to machine-native, as a JIT does, you need the
instruction set of the machine.) Whereas, CPython can disregard
them, having already been compiled for it.

I think what I was looking for is that IronPython employs the .NET to
compile to machine instructions, once it's known what the values of
the variables are that are the operands. The trade-off is compilation
time + type checks + stub look-up.

What I want to know is, if __add__ performs an attribute look-up, is
that optimized in any way, after the IP is already in compiled code?

After all that, I don't feel so guilty about stepping on Tim's toes.

On Jul 30, 12:12 am, Dino Viehland <di...@exchange.microsoft.com>
wrote:

IronPython doesn't have an interpreter loop and therefore has no POP / TOP / etc... Instead what IronPython has is a method call Int32Ops.Add which looks like:

public static object Add(Int32 x, Int32 y) {
long result = (long) x + y;
if (Int32.MinValue <= result && result <= Int32.MaxValue){
return Microsoft.Scripting.Runtime.RuntimeHelpers.Int32To Object((Int32)(result));
}
return BigIntegerOps.Add((BigInteger)x, (BigInteger)y);
}

This is the implementation of int.__add__. Note that calling int.__add__can actually return NotImplemented and that's handled by the method binderlooking at the strong typing defined on Add's signature here - and then automatically generating the NotImplemented result when the arguments aren't ints. So that's why you don't see that here even though it's the full implementation of int.__add__.

Ok, next if you define a function like:

def adder(a, b):
return a + b

this turns into a .NET method, which will get JITed, which in C# would look something like like:

static object adder(object a, object b) {
return $addSite.Invoke(a, b)

}

where $addSite is a dynamically updated call site.

$addSite knows that it's performing addition and knows how to do nothing other than update the call site the 1st time it's invoked. $addSite is local to the function so if you define another function doing addition it'll have its own site instance.

So the 1st thing the call site does is a call back into the IronPython runtime which starts looking at a & b to figure out what to do. Python defines that as try __add__, maybe try __radd__, handle coercion, etc... So we go looking through finding the __add__ method - if that can return NotImplemented then we find the __radd__ method, etc... In this case we're just adding two integers and we know that the implementation of Add() won't returnNotImplemented - so there's no need to call __radd__. We know we don't have to worry about NotImplemented because the Add method doesn't have the .NET attribute indicating it can return NotImplemented.

At this point we need to do two things. We need to generate the test which is going to see if future arguments are applicable to what we just figured out and then we need to generate the code which is actually going to handle this. That gets combined together into the new call site delegate and it'll look something like:

static void CallSiteStub(CallSite site, object a, object b) {
if (a != null && a.GetType() == typeof(int) && b != null && b.GetType() == typeof(int)) {
return IntOps.Add((int)a, (int)b);
}
return site.UpdateBindingAndInvoke(a, b);

}

That gets compiled down as a lightweight dynamic method which also gets JITed. The next time through the call site's Invoke body will be this method and things will go really fast if we have int's again. Also notice this is looking an awful lot like the inlined/fast-path(?) code dealing with int's that you quoted. If everything was awesome (currently it's not for a couple of reasons) the JIT would even inline the IntOps.Add call and it'd probably be near identical. And everything would be running native on the CPU..

So that's how 2 + 2 works... Finally if it's a user type then we'd generate a more complicated test like (and getting more and more pseudo code to keep things simple):

if (PythonOps.CheckTypeVersion(a, 42) && PythonOps.CheckTypeVersion(b, 42)) {
return $callSite.Invoke(__cachedAddSlot__.__get__(a), b);

}

Here $callSite is another stub which will handle doing optimal dispatch to whatever __add__.__get__ will return. It could be a Python type, it could be a user defined function, it could be the Python built-in sum function,etc... so that's the reason for the extra dynamic dispatch.

So in summary: everything is compiled to IL. At runtime we have lots of stubs all over the place which do the work to figure out the dynamic operation and then cache the result of that calculation.

Also what I've just described is how IronPython 2.0 works. IronPython 1.0 is basically the same but mostly w/o the stubs and where we use stub methods they're much less sophisticated.

Also, IronPython is open source -www.codeplex.com/IronPython

-----Original Message-----

abriged:

This is from 7/22/08, same author:
I wouldn't say "can't". The current CPython VM does not compile
code. It COULD. The C#/.NET VM does.

Three big claims here that I breezed right over and didn't believe.

It COULD.

I'm evidently assuming that if it could, it would.

The current CPython VM does not compile code.

Therefore it couldn't, or the assumption is wrong. Tim says it is.
And the glaring one--

WHY NOT? Why doesn't CPython do it?

I am imagining that every Python implementation has something like
it. If IronPython does not, in particular, not have the 'POP();
TOP();' sequence, then it isn't running on a stack machine. Is the
IronPython code open source, and can someone link to it? I'm not
wading through it from scratch. What does it have instead? Does
dynamic typing still work?

--
http://mail.python.org/mailman/listinfo/python-list

Jul 30 '08 #21

Terry Reedy

castironpi wrote:

>The current CPython VM does not compile code.

CPython compiles Python code to bytecode for its CPython *hardware
independent* VM using standard compiler methdods and tools (lexer,
parser, code generator, and optimizer). That VM (interpreter) is
written in portable-as-possible C, with machine/OS #ifdefs added as
needed.

WHY NOT? Why doesn't CPython do it?

1. Portability: The Microsoft C# JIT compiler runs under Windows .NET on
x86/amd64 and maybe it64 and what else? Just porting .NET to run 0n
Linux on the same processors was/is a bit task. Does MONO have a JIT also?

There is a JIT for Python: Psyco. It originally only worked on x86. I
am not sure what else. It originated as a PhD project, working with
CPython, and was developed further as part of PyPy, but I do not know if
there is any current progress.

Python VM runs on numerous platforms.

2. Money: C#, its JIT, and IronPython were and are funded by MS.
Getting JIT right is hard and tedious.

CPython is mostly a volunteer project. It is also the Python development
platform. So it has to be simple enough for volunteers to pick up on
its innards and for experimentation to be possible. Give the PSF more
resources and

tjr

Jul 30 '08 #22

Tim Roberts

castironpi <ca********@gmail.comwrote:

>
In C, we have:

int x, y;
x= 10;
y= x+ 1;

It translates as, roughly:
>>
8000 .data
7996 ffffffff #x
7992 ffffffff #y
7988 .end data
7984 loadi reg0 7996
7980 loadi reg1 7992
7976 loadi reg2 10
7972 loadi reg3 1
7968 storv reg2 reg0
7964 add reg0 reg1 reg2
7960 storv reg3 reg1

I don't recognize that assembly language. Is that another intermediate
language?

>You are telling me that the same thing happens in IronPython.

Yes, the same process happens.

>By the
time the instruction pointer gets to 'x= 10', the next 7 instructions
are the ones shown here compiled from C.

I most certainly did NOT say that, as you well know. Different C compilers
produce different instruction sequences for a given chunk of code. Indeed,
a single C compiler will produce different instruction sequences based on
the different command-line options. It's unreasonable to expect a Python
compiler to produce exactly the same code as a C compiler.

However, that does disqualify the Python processor as a "compiler".

>CMIIW, but the CPython implementation -does- -not-.

And again, I never said that it did. CPython is an interpreter. the
user's code is never translated into machine language.

>My point is, CPython takes more than seven steps. My question is,
does IronPython?

So, if compiler B isn't as good at optimization as compiler A, does that
mean in your mind that compiler B is not a "compiler"?
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Jul 31 '08 #23

Bob Martin

in 76135 20080731 090911 Dennis Lee Bieber <wl*****@ix.netcom.comwrote:

>On Thu, 31 Jul 2008 06:17:59 GMT, Tim Roberts <ti**@probo.comdeclaimed
the following in comp.lang.python:

>And again, I never said that it did. CPython is an interpreter. the
user's code is never translated into machine language.

Using that definition, the UCSD P-code Pascal and Java are also not
"compilers" -- all three create files containing instructions for a
non-hardware virtual machine.

The only difference between Python, UCSD Pascal, and Java is that
Python foregoes the explicit "compiler" pass.

BASIC (classical microcomputer implementations -- like the one M$
supplied for TRS-80s) is an interpreter -- the pre-scan of the source
merely translated BASIC keywords into a byte index, not into opcodes for
any virtual machine.

You are confusing languages with implementations, as I pointed out earlier.
Java is a language.
I have used at least 2 Java compilers, ie they compiled Java source to native
machine language.

Jul 31 '08 #24

Duncan Booth

Terry Reedy <tj*****@udel.eduwrote:

1. Portability: The Microsoft C# JIT compiler runs under Windows .NET
on x86/amd64 and maybe it64 and what else? Just porting .NET to run
0n Linux on the same processors was/is a bit task. Does MONO have a
JIT also?

Technically there is no such thing as a Microsoft C# JIT compiler: the C#
compiler targets IL and the JIT compilers convert IL to the native machine,
but C# is just one of the frontend compilers you could use.

Microsoft do JIT compilers for .Net Compact Framework that target ARM,
MIPS, SHx and x86. The Mono JIT supports:

s390, s390x (32 and 64 bits) Linux
SPARC(32) Solaris, Linux
PowerPC Linux, Mac OSX
x86 Linux, FreeBSD, OpenBSD, NetBSD, Microsoft Windows, Solaris, OS X
x86-64: AMD64 and EM64T (64 bit) Linux, Solaris
IA64 Itanium2 (64 bit) Linux
ARM: little and big endian Linux (both the old and the new ABI)
Alpha Linux
MIPS Linux
HPPA Linux

(from http://www.mono-project.com/Supported_Platforms)
So I'd say .Net scores pretty highly on the portability stakes. (Although
of course code written for .Net might not do so well).

--
Duncan Booth http://kupuguy.blogspot.com

Jul 31 '08 #25

Terry Reedy

Duncan Booth wrote:

Terry Reedy <tj*****@udel.eduwrote:

>1. Portability: The Microsoft C# JIT compiler runs under Windows .NET
on x86/amd64 and maybe it64 and what else? Just porting .NET to run
0n Linux on the same processors was/is a bit task. Does MONO have a
JIT also?

Technically there is no such thing as a Microsoft C# JIT compiler: the C#
compiler targets IL and the JIT compilers convert IL to the native machine,
but C# is just one of the frontend compilers you could use.

Microsoft do JIT compilers for .Net Compact Framework that target ARM,
MIPS, SHx and x86. The Mono JIT supports:

s390, s390x (32 and 64 bits) Linux
SPARC(32) Solaris, Linux
PowerPC Linux, Mac OSX
x86 Linux, FreeBSD, OpenBSD, NetBSD, Microsoft Windows, Solaris, OS X
x86-64: AMD64 and EM64T (64 bit) Linux, Solaris
IA64 Itanium2 (64 bit) Linux
ARM: little and big endian Linux (both the old and the new ABI)
Alpha Linux
MIPS Linux
HPPA Linux

(from http://www.mono-project.com/Supported_Platforms)
So I'd say .Net scores pretty highly on the portability stakes. (Although
of course code written for .Net might not do so well).

Did you mean IL scores highly? In any case, scratch 1. portability as a
reason why CPython lacks JIT. That leave 2. $ in the multimillions.

More
3. Design difference1: I suspect that IL was designed with JIT
compilation in mind, whereas PyCode was certainly not.

4. Design difference2: The first 'killer ap' for Python was driving
compiled Fortran/C functions (early Numeric). If a 'Python' program
spend 95% of its time in compiled-to-machine-code extensions, reducing
the other 5% to nothing gains little. CPython *was* and has-been
designed for this.

The process continues. The relatively new itertools module was designed
and tested in Python (see itertools in Libary reference). But the
delivered module is compiled C.

tjr

Aug 1 '08 #26

castironpi

On Jul 31, 1:17*am, Tim Roberts <t...@probo.comwrote:

castironpi <castiro...@gmail.comwrote:

In C, we have:

int x, y;
x= 10;
y= x+ 1;

It translates as, roughly:

8000 .data
7996 ffffffff #x
7992 ffffffff #y
7988 .end data
7984 loadi reg0 7996
7980 loadi reg1 7992
7976 loadi reg2 10
7972 loadi reg3 1
7968 storv reg2 reg0
7964 add reg0 reg1 reg2
7960 storv reg3 reg1

I don't recognize that assembly language. *Is that another intermediate
language?

I'm looking at a system word of 1's and 0's that gets executed on a
per-cycle basis in the processor. Could easily be that the designs
are tuned to JIT's these days and I'm out of date, what with
pipelining and lookahead branching and all, but no, it's what I
remember from system architecture class.

You are telling me that the same thing happens in IronPython.

Yes, the same process happens.

By the
time the instruction pointer gets to 'x= 10', the next 7 instructions
are the ones shown here compiled from C.

I most certainly did NOT say that, as you well know. *Different C compilers
produce different instruction sequences for a given chunk of code. *Indeed,
a single C compiler will produce different instruction sequences based on
the different command-line options. *It's unreasonable to expect a Python
compiler to produce exactly the same code as a C compiler.

However, that does disqualify the Python processor as a "compiler".

CMIIW, but the CPython implementation -does- -not-.

And again, I never said that it did. *CPython is an interpreter. *the
user's code is never translated into machine language.

My point is, CPython takes more than seven steps. *My question is,
does IronPython?

So, if compiler B isn't as good at optimization as compiler A, does that
mean in your mind that compiler B is not a "compiler"?
--
Tim Roberts, t...@probo.com
Providenza & Boekelheide, Inc.

You can translate C code to machine code, for any given C code, for
any given machine.

You can translate Python code to machine code, for some Python code,
for any given machine.

Given the restrictions (or rather, freedoms) of Python, does there
exist code that necessarily cannot translate to machine code? In
other words, can you translate all Python code to machine code?
Similarly, I take it that the decision to make CPython a stack machine
+ VM was a design decision, not a necessity, favoring internal
simplicity over the extra 5%.

The output of the program is determined from the input by the Python
specification, regardless of implementation, but the answer still
isn't necessarily yes. But I think the only counterexample that comes
to me is the case of a dynamic grammar, not merely dynamic data type,
so for Python maybe it is. And furthermore, I think I'm getting
confused about what exactly constitutes an interpreter: it is whether
there is a process that runs product instructions, or the product
instructions can run standalone. I would take 'compiler' to mean,
something that outputs an .EXE executable binary file, and I don't
just mean bundling up the python.exe executable with a file. Python
needs to be present on the machine you run target code on, not so with
C binaries. Can .NET bring its targets to such a state, or are those
run-times requried? Are they just DLL's, or are they, for lack of
better word, driving?

(Of course, for futuristic abstract hardware designs, CPython may be
mostly native instructions, and the output of its processor on
function definition, can be stored and run directly, such as for a
stack machine architecture.)

But I don't want to restrict my question to a question of
optimization. Your compiler could output something like this:

read variables into registers
reorganize variable dictionaries
perform addition
do laundry
write variables into system memory
clean sink

and run when you write 'C:\>silly.exe', and so on. And still be
compiled, even if the live IronPython session, which you invoke with,
'C:\>ironpy.exe silly.py', outputs the same 7 MIPS instructions.

Aug 1 '08 #27

Paul Boddie

On 1 Aug, 07:11, castironpi <castiro...@gmail.comwrote:

>
Given the restrictions (or rather, freedoms) of Python, does there
exist code that necessarily cannot translate to machine code? *In
other words, can you translate all Python code to machine code?

Given that all valid Python code can be executed somehow and that
execution takes place as the processor performs instructions which "it
gets from somewhere", meaning that those instructions can belong
either to a general interpreter or to specific code generated for a
given user program (or a combination of these things), I think that
you have to refine your question. What you seem to be asking is this:
can you translate Python code to machine code which encodes the
behaviour of the user program in a way nearing the efficiency of code
generated from other programming languages? Rephrased, the question is
this: can Python code be efficiently represented using low-level
machine instructions?

I think you've already touched upon this when thinking about integer
operations. The apparently simple case of integer addition in Python
is not completely encoded by a few machine instructions. In other
words...

a + b # in Python

...is not sufficiently represented by...

ldr r1, a
ldr r2, b
add r3, r1, r2

...in some assembly language (and the resulting machine code), mostly
because the semantics of Python addition are more complicated. Of
course, you can generate code for those semantics, which would lead to
quite a few more machine instructions than those suggested above, but
then it might be interesting to bundle those instructions in some kind
of subroutine, and we could call this subroutine BINARY_ADD. At this
point, you'd almost be back at the stage where you're writing a
bytecode interpreter again.

Of course, it's worth considering something in between these
situations (the verbose expansion of the user program vs. a bytecode
interpreter which examines virtual instructions and jumps to
subroutines), and there are apparently a few techniques which make
virtual machines more efficient (so that the processor isn't jumping
around too much in the interpreter code, for example), and there are
also going to be techniques which permit the simplification of any
verbose machine code representation (most likely by not generating
code which is never going to be executed, due to various properties of
the program).

Obviously, CPython isn't oriented towards investigating these matters
in great depth, but that doesn't mean that other implementations can't
pursue other approaches.

Similarly, I take it that the decision to make CPython a stack machine
+ VM was a design decision, not a necessity, favoring internal
simplicity over the extra 5%.

Probably: it simplifies code generation somewhat.

Paul

Aug 1 '08 #28

Terry Reedy

castironpi wrote:

Similarly, I take it that the decision to make CPython a stack machine
+ VM was a design decision, not a necessity, favoring internal
simplicity over the extra 5%.

Years ago, someone once started a project to write a register-based
virtual machine for (C)Python. I suspect it was abandoned for some
combination of lack of time and preliminary results showing little
speedup for the increased complication. But I never saw any 'final
report'.

And furthermore, I think I'm getting
confused about what exactly constitutes an interpreter: it is whether
there is a process that runs product instructions, or the product
instructions can run standalone. I would take 'compiler' to mean,
something that outputs an .EXE executable binary file,

This is way too restrictive. Does *nix have no compilers? In any case,
the CPython compiler uses stadard compiler components: lexer, parser,
syntax tree, code generator, and peephole optimizer. The result is a
binary file (.pyc for Python compiled) executable on a PyCode machine.

Aug 1 '08 #29

castironpi

On Aug 1, 5:24*am, Paul Boddie <p...@boddie.org.ukwrote:

On 1 Aug, 07:11, castironpi <castiro...@gmail.comwrote:

Given the restrictions (or rather, freedoms) of Python, does there
exist code that necessarily cannot translate to machine code? *In
other words, can you translate all Python code to machine code?

Given that all valid Python code can be executed somehow and that
execution takes place as the processor performs instructions which "it
gets from somewhere", meaning that those instructions can belong
either to a general interpreter or to specific code generated for a
given user program (or a combination of these things), I think that
you have to refine your question. What you seem to be asking is this:
can you translate Python code to machine code which encodes the
behaviour of the user program in a way nearing the efficiency of code
generated from other programming languages? Rephrased, the question is
this: can Python code be efficiently represented using low-level
machine instructions?

I think you've already touched upon this when thinking about integer
operations. The apparently simple case of integer addition in Python
is not completely encoded by a few machine instructions. In other
words...

* a + b # in Python

...is not sufficiently represented by...

* ldr r1, a
* ldr r2, b
* add r3, r1, r2

...in some assembly language (and the resulting machine code), mostly
because the semantics of Python addition are more complicated.

No, it is not sufficiently represented. Python runs checks before and
after, to check for overflows.

test safeinteger a
test safeinteger b
ldr r1, a
ldr r2, b
add r3, r1, r2
test not overflow

However, no implementation of Python can do better, given Python's
specification.

Of
course, you can generate code for those semantics, which would lead to
quite a few more machine instructions than those suggested above, but
then it might be interesting to bundle those instructions in some kind
of subroutine, and we could call this subroutine BINARY_ADD. At this
point, you'd almost be back at the stage where you're writing a
bytecode interpreter again.

This isn't the bytecode interpreter returning, it's bounds checking,
which is part and parcel of Python, I hold.

Another factor, a and b are known to be and are always integers, in a
given C context.

int a, b;
...
a + b

The C compilation process outputs:

ldr r1, a
ldr r2, b
add r3, r1, r2

and you are correct. However, for:

string a, b;
a + b

performs a concatenation which is not that simple. The point is, C
compilation runs, and you actually have -ldr, ldr, add- lying around
in a file somewhere, which can run as three consecutive instructions
on a processor. It's already in the interpreter in Python, and you
have the -test, test, ldr, ldr, add, test- sequence somewhere in
Python.exe, specifically wherever the object code for ceval.c is
going.

Incidentally, I find 2 bytes in 16K different in a simple C program
that merely executes a + b vs. a - b. It is not in this practical
case a three-word output (12 bytes, ldr-ldr-add vs. ldr-ldr-subtr),
though it's not clear what entry points the OS requires.

Of course, it's worth considering something in between these
situations (the verbose expansion of the user program vs. a bytecode
interpreter which examines virtual instructions and jumps to
subroutines), and there are apparently a few techniques which make
virtual machines more efficient (so that the processor isn't jumping
around too much in the interpreter code, for example), and there are
also going to be techniques which permit the simplification of any
verbose machine code representation (most likely by not generating
code which is never going to be executed, due to various properties of
the program).

I think it's relevant and fair to consider -consecutive- language
instructions at this point. Working example:

int a, b, c, d;
...
a + b
c + d

The C compilation process outputs:

ldr r1, a
ldr r2, b
add r3, r1, r2
ldr r1, c
ldr r2, d
add r3, r1, r2

Whereas there is no equivalent in CPython. The actual code that runs
on the processor (summary) is:

:loop
...
:addition_sign
test safeinteger a
test safeinteger b
ldr r1, a
ldr r2, b
add r3, r1, r2
test not overflow
:goto loop

as opposed to two duplicate sections being outputted back-to-back,
even though they are -run- effectively back-to-back. For a different
type, say list concatenation, the disassembly looks like:

>>def f():

... []+[]
...

>>f()
dis.dis(f)

2 0 BUILD_LIST 0
3 BUILD_LIST 0
6 BINARY_ADD
7 POP_TOP
8 LOAD_CONST 0 (None)
11 RETURN_VALUE

the meaning of which I am not finding in ceval.c. Anyone?

Regardless, the JIT compilation process allocates a new executable
block of memory:

:loop
...
:addition_sign
output( 'ldr r1, a; ldr r1, a; ldr r2, b' )
:goto loop

which in this stage executes twice, yielding

ldr r1, a
ldr r2, b
add r3, r1, r2
ldr r1, c
ldr r2, d
add r3, r1, r2

somewhere in memory, same as C. Then it runs the block. It also has
to have already ascertained that 'a' and 'b' are necessarily integers
by the time it makes the output( ) statement. I remain unclear on the
disagreement between why JIT is not called compiling, and why it
doesn't output an executable binary, unless the practical reasons of
saving developer time and cross-platform object files, or merely
terminology, are the only difference.

Aug 2 '08 #30

Tim Roberts

Dennis Lee Bieber <wl*****@ix.netcom.comwrote:

>On Thu, 31 Jul 2008 06:17:59 GMT, Tim Roberts <ti**@probo.comdeclaimed
the following in comp.lang.python:

>And again, I never said that it did. CPython is an interpreter. the
user's code is never translated into machine language.

Using that definition, the UCSD P-code Pascal and Java are also not
"compilers" -- all three create files containing instructions for a
non-hardware virtual machine.

Right. UCSD p-code Pascal was almost always implemented as an interpreter.
I would be surprised if anyone argued that it was a compiler.

However, I thought Java was usually JIT compiled to machine language. Am I
mistaken?
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Aug 2 '08 #31

Tim Roberts

castironpi <ca********@gmail.comwrote:

>
And furthermore, I think I'm getting
confused about what exactly constitutes an interpreter: it is whether
there is a process that runs product instructions, or the product
instructions can run standalone. I would take 'compiler' to mean,
something that outputs an .EXE executable binary file, and I don't
just mean bundling up the python.exe executable with a file.

OK, let me give MY definition. I freely grant that my definition might be
different from anyone elses, but perhaps this will help you understand the
basis for my arguments.

If I run three different CPython programs, the bytes of machine language
that get executed are come from the same place: python24.dll. My user
programs are just data. That, in my mind, makes the CPython implementation
an interpreter.

If I compile and run three different C programs, the bytes of machine
language will be come from three different places. That, in my mind, makes
my C implementation a compiler.

If I compile and run three different C# programs, the JIT compiler makes
new machine language for each one. The bytes of machine language will come
from three different places. That, in my mind, makes the C# implementation
a compiler.

If I compile and run three different IronPython programs, the JIT compiler
makes new machine language for each one. The bytes of machine language
will come from three different places. That, in my mind, makes the
IronPython implementation a compiler.

All four of those scenarios require run-time library support. Even the C
progam does not run on its own. Execution starts in the run-time library,
which sets up an environment before jumping to "main". The C# and
IronPython situations are the same; it's just that there's more processing
going on before jumping to "main".
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Aug 2 '08 #32

Paul Boddie

On 2 Aug, 08:33, castironpi <castiro...@gmail.comwrote:

On Aug 1, 5:24 am, Paul Boddie <p...@boddie.org.ukwrote:

a + b # in Python

...is not sufficiently represented by...

ldr r1, a
ldr r2, b
add r3, r1, r2

...in some assembly language (and the resulting machine code), mostly
because the semantics of Python addition are more complicated.

No, it is not sufficiently represented. Python runs checks before and
after, to check for overflows.

It does more than this...

test safeinteger a
test safeinteger b
ldr r1, a
ldr r2, b
add r3, r1, r2
test not overflow

However, no implementation of Python can do better, given Python's
specification.

In fact, as was probably mentioned before, it does something more like
this:

get method __add__ on a (if possible) or jump to (1)
populate an invocation frame with a and b
call the method
test the result against the special NotImplemented value
if result is not NotImplemented then jump to (3)
(1) get method __radd__ on b (if possible) or jump to (2)
populate an invocation frame with b and a
call the method
test the result against the special NotImplemented value
if result is not NotImplemented then jump to (3)
(2) raise a TypeError exception
(3) provide the result to whatever gets evaluated next

The instructions that you list, which is based on the really simple
case which I mentioned, happens in the method that gets called (eg.
__add__), and then only for integers. Note that seemingly trivial
things like getting the methods can be quite a few instructions in any
low-level code.

[...]

Another factor, a and b are known to be and are always integers, in a
given C context.

int a, b;
...
a + b

The C compilation process outputs:

ldr r1, a
ldr r2, b
add r3, r1, r2

Right. That's why some people want to have type declarations in
Python, and others want to be able to generate specialised code at run-
time for such cases.

and you are correct. However, for:

string a, b;
a + b

performs a concatenation which is not that simple. The point is, C
compilation runs, and you actually have -ldr, ldr, add- lying around
in a file somewhere, which can run as three consecutive instructions
on a processor. It's already in the interpreter in Python, and you
have the -test, test, ldr, ldr, add, test- sequence somewhere in
Python.exe, specifically wherever the object code for ceval.c is
going.

Right. The way to think about this is that due to the mechanics of
working out what kind of operations should be performed (should it be
an integer addition, a string concatentation, something else?),
there's a lot of code executed which is really "dancing around" the
actual work, and then for short bursts of instructions, the work
actually gets done. It's like having to jet around the world, sampling
drinks in different locations, rather than just lining the different
drinks up at home.

Paul

Aug 2 '08 #33

castironpi

On Aug 2, 2:02 pm, Tim Roberts <t...@probo.comwrote:

castironpi <castiro...@gmail.comwrote:

And furthermore, I think I'm getting
confused about what exactly constitutes an interpreter: it is whether
there is a process that runs product instructions, or the product
instructions can run standalone. I would take 'compiler' to mean,
something that outputs an .EXE executable binary file, and I don't
just mean bundling up the python.exe executable with a file.

OK, let me give MY definition. I freely grant that my definition might be
different from anyone elses, but perhaps this will help you understand the
basis for my arguments.

I understand that we're having a disagreement about terminology. I
further don't understand exactly what JIT languages are, so I can't
agree on that either.

I will observe the certain amount of corporate hype behind, and worker
base morale riding on, the notion that JIT technology compiles code.
I suspect it's an exaggeration, not outright false, but I can't prove
it until I tell you what instructions run, one right after another, on
a concrete architecture I've held in my hand, like the x86 die. Nor
can I thoroughly believe that it's true either, though, until its
creators have told me what instructions they are. So I'll proclaim
ignorance and await facts... or consistent stories about them.

If I run three different CPython programs, the bytes of machine language
that get executed are come from the same place: python24.dll. My user
programs are just data. That, in my mind, makes the CPython implementation
an interpreter.

If I compile and run three different C programs, the bytes of machine
language will be come from three different places. That, in my mind, makes
my C implementation a compiler.

True. I agree on the facts and the terms.

If I compile and run three different C# programs, the JIT compiler makes
new machine language for each one. The bytes of machine language will come
from three different places. That, in my mind, makes the C# implementation
a compiler.

If I compile and run three different IronPython programs, the JIT compiler
makes new machine language for each one. The bytes of machine language
will come from three different places. That, in my mind, makes the
IronPython implementation a compiler.

I don't know enough to attest to these for a fact, and you haven't
given enough details to corroborate them as facts. But when you do,
I'll be able to take and learn your terms for them (not that I will,
of course, but I can).

All four of those scenarios require run-time library support. Even the C
progam does not run on its own.

I disagree with this, if the C program is statically linked -- the OS
copies the binary (.EXE) from disk into memory, then jumps to a
specific offset in that block / address space. It runs all its own
bytes, then jumps back to an OS-specified point of return of control.
For the other three, though, this is true.

Execution starts in the run-time library,
which sets up an environment before jumping to "main". The C# and
IronPython situations are the same; it's just that there's more processing
going on before jumping to "main".

I want to give a concrete example of 'generating machine code' per se
(as such).

I run this program: <fiction>

bin= open( 'abinary.exe', 'w' )
bin.write( '\x09\x0f\x00\x00' )
for x in range( 10 ):
bin.write( '\x04\xA0' + chr( x ) + '\x00' )
bin.write( '\x01\x20\x00\x00' )

It outputs to 'abinary.exe':

\x09\x0f\x00\x00
\x04\xa0\x00\x00
\x04\xa0\x01\x00
\x04\xa0\x02\x00
\x04\xa0\x03\x00
\x04\xa0\x04\x00
\x04\xa0\x05\x00
\x04\xa0\x06\x00
\x04\xa0\x07\x00
\x04\xa0\x08\x00
\x04\xa0\x09\x00
\x01\x20\x00\x00

Which is 12 bytes long and runs in a millisecond. What it does is set
a memory address to successive integers 0..9, then yields. Due to the
nature of program flow control, while it runs its first steps on any
x86 machine, the yield only succeeds if on Windows 98+, and crashes
the machine, or otherwise loses control if not. (That part depends on
those OSses.)

I can try something similar dynamically.

char* mem= alloc( 48 )
setpermission( mem, EXECUTE )
memcpy( mem+ 0, "\x09\x0f\x00\x00", 4 )
for( int x= 0; x< 10; ++x ) {
memcpy( mem+ 4* (x+ 1 ), '\x04\xA0\x00\x00', 4 )
mem[ 4* (x+ 1 )+ 3 ]= (char) x
memcpy( mem+ 44, '\x01\x20\x00\x01', 4 )
setjump
goto mem

Which with some imagination produces the contents of 'abinary.exe'
above (one difference, last word) in a memory block, at address 'mem',
then jumps to it, which then jumps back, and then exits. </fiction>

I'll compare a C complation to the first example, 'abinary.exe', and a
JIT compilation to the second example, 'char* mem'. If the comparison
isn't accurate, say how, because these are places I can start from...
(yes, that is, instead of just repeating the claims).

When does a JIT do this, and what does it do in the meantime?

Aug 2 '08 #34

paulo.jpinto

On Aug 3, 1:26*am, castironpi <castiro...@gmail.comwrote:

On Aug 2, 2:02 pm, Tim Roberts <t...@probo.comwrote:

castironpi <castiro...@gmail.comwrote:

>And furthermore, I think I'm getting
>confused about what exactly constitutes an interpreter: it is whether
>there is a process that runs product instructions, or the product
>instructions can run standalone. *I would take 'compiler' to mean,
>something that outputs an .EXE executable binary file, and I don't
>just mean bundling up the python.exe executable with a file.

OK, let me give MY definition. *I freely grant that my definition might be
different from anyone elses, but perhaps this will help you understand the
basis for my arguments.

I understand that we're having a disagreement about terminology. *I
further don't understand exactly what JIT languages are, so I can't
agree on that either.

I will observe the certain amount of corporate hype behind, and worker
base morale riding on, the notion that JIT technology compiles code.
I suspect it's an exaggeration, not outright false, but I can't prove
it until I tell you what instructions run, one right after another, on
a concrete architecture I've held in my hand, like the x86 die. *Nor
can I thoroughly believe that it's true either, though, until its
creators have told me what instructions they are. *So I'll proclaim
ignorance and await facts... or consistent stories about them.

If I run three different CPython programs, the bytes of machine language
that get executed are come from the same place: python24.dll. *My user
programs are just data. *That, in my mind, makes the CPython implementation
an interpreter.

If I compile and run three different C programs, the bytes of machine
language will be come from three different places. *That, in my mind,makes
my C implementation a compiler.

True. *I agree on the facts and the terms.

If I compile and run three different C# programs, the JIT compiler makes
new machine language for each one. *The bytes of machine language will come
from three different places. *That, in my mind, makes the C# implementation
a compiler.

If I compile and run three different IronPython programs, the JIT compiler
makes new machine language for each one. *The bytes of machine language
will come from three different places. *That, in my mind, makes the
IronPython implementation a compiler.

I don't know enough to attest to these for a fact, and you haven't
given enough details to corroborate them as facts. *But when you do,
I'll be able to take and learn your terms for them (not that I will,
of course, but I can).

All four of those scenarios require run-time library support. *Even the C
progam does not run on its own.

I disagree with this, if the C program is statically linked -- the OS
copies the binary (.EXE) from disk into memory, then jumps to a
specific offset in that block / address space. *It runs all its own
bytes, then jumps back to an OS-specified point of return of control.
For the other three, though, this is true.

Execution starts in the run-time library,
which sets up an environment before jumping to "main". *The C# and
IronPython situations are the same; it's just that there's more processing
going on before jumping to "main".

I want to give a concrete example of 'generating machine code' per se
(as such).

I run this program: <fiction>

bin= open( 'abinary.exe', 'w' )
bin.write( '\x09\x0f\x00\x00' )
for x in range( 10 ):
* *bin.write( '\x04\xA0' + chr( x ) + '\x00' )
bin.write( '\x01\x20\x00\x00' )

It outputs to 'abinary.exe':

\x09\x0f\x00\x00
\x04\xa0\x00\x00
\x04\xa0\x01\x00
\x04\xa0\x02\x00
\x04\xa0\x03\x00
\x04\xa0\x04\x00
\x04\xa0\x05\x00
\x04\xa0\x06\x00
\x04\xa0\x07\x00
\x04\xa0\x08\x00
\x04\xa0\x09\x00
\x01\x20\x00\x00

Which is 12 bytes long and runs in a millisecond. *What it does is set
a memory address to successive integers 0..9, then yields. *Due to the
nature of program flow control, while it runs its first steps on any
x86 machine, the yield only succeeds if on Windows 98+, and crashes
the machine, or otherwise loses control if not. *(That part depends on
those OSses.)

I can try something similar dynamically.

char* mem= alloc( 48 )
setpermission( mem, EXECUTE )
memcpy( mem+ 0, "\x09\x0f\x00\x00", 4 )
for( int x= 0; x< 10; ++x ) {
* *memcpy( mem+ 4* (x+ 1 ), '\x04\xA0\x00\x00', 4 )
* *mem[ 4* (x+ 1 )+ 3 ]= (char) x
memcpy( mem+ 44, '\x01\x20\x00\x01', 4 )
setjump
goto mem

Which with some imagination produces the contents of 'abinary.exe'
above (one difference, last word) in a memory block, at address 'mem',
then jumps to it, which then jumps back, and then exits. </fiction>

I'll compare a C complation to the first example, 'abinary.exe', and a
JIT compilation to the second example, 'char* mem'. *If the comparison
isn't accurate, say how, because these are places I can start from...
(yes, that is, instead of just repeating the claims).

When does a JIT do this, and what does it do in the meantime?

The JIT works like an assembler/linker that writes to memory. It will
load the
file(s) containing the bytecode and generate the required assembly
instructions into
memory.

In the case there are dependencies to other modules, they will be
loaded as well, and
compiled. Then the linker will take care that cross references between
modules are correct,
like memory addresses and branch targets.

A clever JIT might add instrumentation points, so that it can rewrite
the code using profile
guided optimizations, this means generating optimized code using as
input the program behaviour.

This makes JIT code usually faster than normal compiled code. Although
normal native code is
able to start executing faster, it only targets a specific set of
processors.

JIT code is independent of the processor, and a good JIT
implementation is able to explore the
processor better than a direct native compiler. There is however the
time penalty on program
startup.

--
Paulo

Aug 5 '08 #35

castironpi

On Aug 5, 9:21*am, "paulo.jpi...@gmail.com" <paulo.jpi...@gmail.com>
wrote:

On Aug 3, 1:26*am, castironpi <castiro...@gmail.comwrote:

Which is 12 bytes long and runs in a millisecond. *What it does is set
a memory address to successive integers 0..9, then yields. *Due to the
nature of program flow control, while it runs its first steps on any
x86 machine, the yield only succeeds if on Windows 98+, and crashes
the machine, or otherwise loses control if not. *(That part depends on
those OSses.)

I can try something similar dynamically.

char* mem= alloc( 48 )
setpermission( mem, EXECUTE )
memcpy( mem+ 0, "\x09\x0f\x00\x00", 4 )
for( int x= 0; x< 10; ++x ) {
* *memcpy( mem+ 4* (x+ 1 ), '\x04\xA0\x00\x00', 4 )
* *mem[ 4* (x+ 1 )+ 3 ]= (char) x
memcpy( mem+ 44, '\x01\x20\x00\x01', 4 )
setjump
goto mem

Which with some imagination produces the contents of 'abinary.exe'
above (one difference, last word) in a memory block, at address 'mem',
then jumps to it, which then jumps back, and then exits. </fiction>

I'll compare a C complation to the first example, 'abinary.exe', and a
JIT compilation to the second example, 'char* mem'. *If the comparison
isn't accurate, say how, because these are places I can start from...
(yes, that is, instead of just repeating the claims).

When does a JIT do this, and what does it do in the meantime?

The JIT works like an assembler/linker that writes to memory. It will
load the
file(s) containing the bytecode and generate the required assembly
instructions into
memory.

In the case there are dependencies to other modules, they will be
loaded as well, and
compiled. Then the linker will take care that cross references between
modules are correct,
like memory addresses and branch targets.

So far this is the same as any compilation, except the first half is
done, and the output location, which is not any bottleneck.

A clever JIT might add instrumentation points, so that it can rewrite
the code using profile
guided optimizations, this means generating optimized code using as
input the program behaviour.

This makes JIT code usually faster than normal compiled code.

Here you need an example. You are suggesting that a compiler can make
better optimizations if it knows what functions are going to carry
what loads, run how many times, etc., and it can use profile
statistics as a partial indicator to do that.

Although
normal native code is
able to start executing faster, it only targets a specific set of
processors.

JIT code is independent of the processor, and a good JIT
implementation is able to explore the
processor better than a direct native compiler. There is however the
time penalty on program
startup.

Once again, you are asserting that knowing what the program has done
so far, say in the first 5 seconds ( or .5 ), can improve
performance. In this case it can make better use of what instructions
to use on the CPU. I need an example.

Aug 5 '08 #36

Chris Mellon

On Tue, Aug 5, 2008 at 12:12 PM, castironpi <ca********@gmail.comwrote:

On Aug 5, 9:21 am, "paulo.jpi...@gmail.com" <paulo.jpi...@gmail.com>
wrote:
>On Aug 3, 1:26 am, castironpi <castiro...@gmail.comwrote:

Which is 12 bytes long and runs in a millisecond. What it does is set
a memory address to successive integers 0..9, then yields. Due to the
nature of program flow control, while it runs its first steps on any
x86 machine, the yield only succeeds if on Windows 98+, and crashes
the machine, or otherwise loses control if not. (That part depends on
those OSses.)

I can try something similar dynamically.

char* mem= alloc( 48 )
setpermission( mem, EXECUTE )
memcpy( mem+ 0, "\x09\x0f\x00\x00", 4 )
for( int x= 0; x< 10; ++x ) {
memcpy( mem+ 4* (x+ 1 ), '\x04\xA0\x00\x00', 4 )
mem[ 4* (x+ 1 )+ 3 ]= (char) x
memcpy( mem+ 44, '\x01\x20\x00\x01', 4 )
setjump
goto mem

Which with some imagination produces the contents of 'abinary.exe'
above (one difference, last word) in a memory block, at address 'mem',
then jumps to it, which then jumps back, and then exits. </fiction>

I'll compare a C complation to the first example, 'abinary.exe', and a
JIT compilation to the second example, 'char* mem'. If the comparison
isn't accurate, say how, because these are places I can start from...
(yes, that is, instead of just repeating the claims).

When does a JIT do this, and what does it do in the meantime?

The JIT works like an assembler/linker that writes to memory. It will
load the
file(s) containing the bytecode and generate the required assembly
instructions into
memory.

In the case there are dependencies to other modules, they will be
loaded as well, and
compiled. Then the linker will take care that cross references between
modules are correct,
like memory addresses and branch targets.

So far this is the same as any compilation, except the first half is
done, and the output location, which is not any bottleneck.

>A clever JIT might add instrumentation points, so that it can rewrite
the code using profile
guided optimizations, this means generating optimized code using as
input the program behaviour.

This makes JIT code usually faster than normal compiled code.

Here you need an example. You are suggesting that a compiler can make
better optimizations if it knows what functions are going to carry
what loads, run how many times, etc., and it can use profile
statistics as a partial indicator to do that.

>Although
normal native code is
able to start executing faster, it only targets a specific set of
processors.

JIT code is independent of the processor, and a good JIT
implementation is able to explore the
processor better than a direct native compiler. There is however the
time penalty on program
startup.

Once again, you are asserting that knowing what the program has done
so far, say in the first 5 seconds ( or .5 ), can improve
performance. In this case it can make better use of what instructions
to use on the CPU. I need an example.
--
http://mail.python.org/mailman/listinfo/python-list

Is there a reason why you're expecting c.l.p to be your personal tutor
for Introduction to Compilers?

It's not that I want to dissuade you in your quest for
self-betterment, but you're being extremely confrontational as well as
vastly ignorant about terminology. Defining your own terms that don't
agree with formal definitions and then demanding (not even politely
asking, for goodness sake) that people justify to you, in excruciating
detail, why simple concepts are true is simply an inexcusable way to
behave.There seems to be something of a rash of this on c.l.p lately.

JIT has been around for decades now, it's well documented, well
understood, and quite common. You'd learn enough to answer every
single one of your demands in 20 minutes with Google, and if you're
seriously going to continue to argue that JIT doesn't exist (and this
is even granting your own bizarre definition of compile, which may as
well be called "purplizing") you should be able to argue from a
position of knowledge instead of stunning, jaw dropping, soul
shattering ignorance.

Aug 5 '08 #37

paulo.jpinto

Regarding exploring processor instructions.

Lets say you compile a C program targeting x86 architecture, with
optimizations
turned on for speed, and let the compiler automatic select MMX and SSE
instructions
for numeric code.

I have now a program that executes very fast, and does what I want
very well. Now
when I execute it on a x86 processor with the new SSE 4 instructions,
it will not
matter, because it cannot take advantage of them.

With a JIT is different. Assuming that the JIT is also aware of the
SSE 4 instructions,
it might take advantage of this new set, if for a given instruction
sequence it is better
to do so.

For the usage of the profile guided optimizations, here go a few
examples.

The JIT might find out that on a given section, the vector indexes are
always correct, so
no need for bounds verification is needed. Or if the language is a OOP
one, it might come
to the conclusion that the same virtual method is always called, so
there is no need for
a VMT lookup before calling the method, thus it replaces the already
generated code by
a direct call.

Or that a small method is called enough times, so it would be better
to inline it instead.

Here are a few papers about profile guided optimizations:

http://rogue.colorado.edu/EPIC6/EPIC6-ember.pdf
http://www.cs.princeton.edu/picasso/...otOverview.pdf

Of course most of these optimizations are only visible in applications
that you use for longer
that 5m.

--
Paulo

Aug 5 '08 #38

Paul Boddie

On 5 Aug, 20:54, "Chris Mellon" <arka...@gmail.comwrote:

>
JIT has been around for decades now, it's well documented, well
understood, and quite common.

Apart from Psyco, whose status is hopefully that of being revived
somewhat [1], not quite common enough to permeate the most popular
Python implementation, it would seem.

You'd learn enough to answer every
single one of your demands in 20 minutes with Google, and if you're
seriously going to continue to argue that JIT doesn't exist (and this
is even granting your own bizarre definition of compile, which may as
well be called "purplizing") you should be able to argue from a
position of knowledge instead of stunning, jaw dropping, soul
shattering ignorance.

Well, I'd rather that we went through the process of occasional
tuition here on comp.lang.python - a process which I think has shown
progress and remained moderately on-topic - rather than have the
endless recycling of threads on syntax polishing and the usual
furniture rearrangement, punctuated by outbursts fuelled by gross
misunderstandings which never get corrected because everyone has
offended everyone else and stormed off to agitate elsewhere.

Indeed, I'd like to see such matters discussed more in the Python
community, not less, and I imagine that I'm not alone with this
opinion. Python 3000 is a prime example of how language tidying has
had complete dominance over certain practical matters like
performance. If such discussion leads people to insights that they
otherwise wouldn't have had, thus improving the situation, then I for
one am happy to entertain the inquirer's apparent ignorance.

Paul

[1] http://www.europython.org/Talks%20an...s/Abstracts#53

Aug 5 '08 #39

castironpi

On Aug 5, 1:54*pm, "Chris Mellon" <arka...@gmail.comwrote:

On Tue, Aug 5, 2008 at 12:12 PM, castironpi <castiro...@gmail.comwrote:
On Aug 5, 9:21 am, "paulo.jpi...@gmail.com" <paulo.jpi...@gmail.com>
wrote:
On Aug 3, 1:26 am, castironpi <castiro...@gmail.comwrote:

Which is 12 bytes long and runs in a millisecond. *What it does isset
a memory address to successive integers 0..9, then yields. *Due tothe
nature of program flow control, while it runs its first steps on any
x86 machine, the yield only succeeds if on Windows 98+, and crashes
the machine, or otherwise loses control if not. *(That part depends on
those OSses.)

I can try something similar dynamically.

char* mem= alloc( 48 )
setpermission( mem, EXECUTE )
memcpy( mem+ 0, "\x09\x0f\x00\x00", 4 )
for( int x= 0; x< 10; ++x ) {
* *memcpy( mem+ 4* (x+ 1 ), '\x04\xA0\x00\x00', 4 )
* *mem[ 4* (x+ 1 )+ 3 ]= (char) x
memcpy( mem+ 44, '\x01\x20\x00\x01', 4 )
setjump
goto mem

Which with some imagination produces the contents of 'abinary.exe'
above (one difference, last word) in a memory block, at address 'mem',
then jumps to it, which then jumps back, and then exits. </fiction>

I'll compare a C complation to the first example, 'abinary.exe', anda
JIT compilation to the second example, 'char* mem'. *If the comparison
isn't accurate, say how, because these are places I can start from....
(yes, that is, instead of just repeating the claims).

When does a JIT do this, and what does it do in the meantime?

The JIT works like an assembler/linker that writes to memory. It will
load the
file(s) containing the bytecode and generate the required assembly
instructions into
memory.

In the case there are dependencies to other modules, they will be
loaded as well, and
compiled. Then the linker will take care that cross references between
modules are correct,
like memory addresses and branch targets.

So far this is the same as any compilation, except the first half is
done, and the output location, which is not any bottleneck.

A clever JIT might add instrumentation points, so that it can rewrite
the code using profile
guided optimizations, this means generating optimized code using as
input the program behaviour.

This makes JIT code usually faster than normal compiled code.

Here you need an example. *You are suggesting that a compiler can make
better optimizations if it knows what functions are going to carry
what loads, run how many times, etc., and it can use profile
statistics as a partial indicator to do that.

Although
normal native code is
able to start executing faster, it only targets a specific set of
processors.

JIT code is independent of the processor, and a good JIT
implementation is able to explore the
processor better than a direct native compiler. There is however the
time penalty on program
startup.

Once again, you are asserting that knowing what the program has done
so far, say in the first 5 seconds ( or .5 ), can improve
performance. *In this case it can make better use of what instructions
to use on the CPU. *I need an example.
--
http://mail.python.org/mailman/listinfo/python-list

Is there a reason why you're expecting c.l.p to be your personal tutor
for Introduction to Compilers?

It's not that I want to dissuade you in your quest for
self-betterment, but you're being extremely confrontational as well as
vastly ignorant about terminology. Defining your own terms that don't
agree with formal definitions and then demanding (not even politely
asking, for goodness sake) that people justify to you, in excruciating
detail, why simple concepts are true is simply an inexcusable way to
behave.There seems to be something of a rash of this on c.l.p lately.

JIT has been around for decades now, it's well documented, well
understood, and quite common. You'd learn enough to answer every
single one of your demands in 20 minutes with Google, and if you're
seriously going to continue to argue that JIT doesn't exist (and this
is even granting your own bizarre definition of compile, which may as
well be called "purplizing") you should be able to argue from a
position of knowledge instead of stunning, jaw dropping, soul
shattering ignorance.

Chris,

I looked at your profile on Google Groups. I found this quote of
yours from May:

"The messiness of the real world is *why* you should use processes,
not
a reason to avoid them."

I like it and I do not hate you. But we are definitely "off" on the
wrong "foot", for relevant values of "off" and "foot".

Is there a reason why you're expecting c.l.p to be your personal tutor
for Introduction to Compilers?

Yes. There is a reason fueling this expectation. If people want me
to understand compilers, they are making progress. If they don't,
they are pretending to. Are you expecting c.l.p. not to tutor anyone
to any extent at all? Or have I merely reached my quota of free
tuition? I could see the latter, and merely would disagree; I know of
nothing that makes either of us the end-all be-all of free tuition
quotas. I am curious to pursue this knowledge as both an end and a
means, I am willing to put in my share of the time in maintaining the
dialog, and I am not forcing anyone to disclose secret information.
So there.

It's not that I want to dissuade you in your quest for
self-betterment, but you're being extremely confrontational as well as
vastly ignorant about terminology. Defining your own terms that don't
agree with formal definitions and then demanding (not even politely
asking, for goodness sake) that people justify to you, in excruciating
detail, why simple concepts are true is simply an inexcusable way to
behave.There seems to be something of a rash of this on c.l.p lately.

I am talking like I do not understand because I do not understand. If
you think I am ignorant, you are right! The degree of my
confrontationality might be unusually high to you, unusually low to
others, but is just right to me. I am frustrated that I do not
understand yet, and if it's coming across in my temper that's my
fault. Furthermore, if I am to question everyone's motives equally,
then there is no reason to expect that everyone out there, or everyone
on this thread, will cooperate to learn; and some reason to expect
that some will detract.

I do not see myself as making demands. "Why not?" is an inflammatory
question if you're lying, and an exciting one if you're eager to
learn, study, and teach. If you do not agree with or do not know the
principle of questioning authority, I subscribe to it and will share.

My old hero's fear was to be vested with authority, and then let
people down; thencefore, I will not take any that is *not* subject to
question. Self-perception is critical to self-awareness; if I have
false beliefs about myself, I can't correct them without others'
help. Therefore, it's critical for me in pursuing that goal to seek
questions from my peers and friends, and thus common reciprocal
courtesy to offer mine to them.

I am trying to correct my own beliefs in the least harmful way I know
or can conceive. If I fail, it's never permanent; if it's harmful,
that something I want to know.

Would you rather I recite from rote, "JIT exists, JIT exists", or tell
you what I don't understand until I do?

Aug 6 '08 #40

castironpi

On Aug 5, 2:09*pm, "paulo.jpi...@gmail.com" <paulo.jpi...@gmail.com>
wrote:

Regarding exploring processor instructions.

Lets say you compile a C program targeting x86 architecture, with
optimizations
turned on for speed, and let the compiler automatic select MMX and SSE
instructions
for numeric code.

I have now a program that executes very fast, and does what I want
very well. Now
when I execute it on a x86 processor with the new SSE 4 instructions,
it will not
matter, because it cannot take advantage of them.

With a JIT is different. Assuming that the JIT is also aware of the
SSE 4 instructions,
it might take advantage of this new set, if for a given instruction
sequence it is better
to do so.

For the usage of the profile guided optimizations, here go a few
examples.

The JIT might find out that on a given section, the vector indexes are
always correct, so
no need for bounds verification is needed. Or if the language is a OOP
one, it might come
to the conclusion that the same virtual method is always called, so
there is no need for
a VMT lookup before calling the method, thus it replaces the already
generated code by
a direct call.

Or that a small method is called enough times, so it would be better
to inline it instead.

Here are a few papers about profile guided optimizations:

http://rogue.colorado.edu/EPIC6/EPIC...otOverview.pdf

Of course most of these optimizations are only visible in applications
that you use for longer
that 5m.

--
Paulo

There are two things I can emphasize after reading the papers and your
post.

One is the benefit of distributing an incompletely compiled module,
which is that it makes machine-targeted specializations possible right
then and there, and does not require recompiling from the source on
the original author's/distributor's budget. Even if there's no
YOURSUITE.EXE built, the code is still in a state where you can make
one, by only needing JIT.LIB on your machine.

The second is the massive use of lightweight profiling in choosing
optimizations in JIT. One of the advantages is inlining common
function sequences, that you would either have to detect yourself, or
inline everything, to achieve. What are some others?

Aug 9 '08 #41

interpreter vs. compiled

Similar topics