469,898 Members | 1,807 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,898 developers. It's quick & easy.

writing python extensions in assembly

Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?
Jun 27 '08 #1
16 1424
inhahe schrieb:
Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?
You could write a C-extension and embed assembly. See the docs for how
to write one. If you know how to implement a C-callingconvention-based
shared library in assembly (being an assembler guru you sure know how
that works), you could mimic a C-extension.

Diez
Jun 27 '08 #2
Well the problem is that I'm actually not an assembler guru, so I don't know
how to implement a dll in asm or use a c calling convention, although I'm
sure those instructions are available on the web. I was just afraid of
trying to learn that AND making python-specific extensions at the same time.
I thought of making a c extension with embedded asm, but that just seemed
less than ideal. But if somebody thinks that's the Right Way to do it,
that's good enough..

"Diez B. Roggisch" <de***@nospam.web.dewrote in message
news:69*************@mid.uni-berlin.de...
inhahe schrieb:
>Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?

You could write a C-extension and embed assembly. See the docs for how to
write one. If you know how to implement a C-callingconvention-based shared
library in assembly (being an assembler guru you sure know how that
works), you could mimic a C-extension.

Diez

Jun 27 '08 #3
inhahe schrieb:
Well the problem is that I'm actually not an assembler guru, so I don't know
how to implement a dll in asm or use a c calling convention, although I'm
sure those instructions are available on the web. I was just afraid of
trying to learn that AND making python-specific extensions at the same time.
I thought of making a c extension with embedded asm, but that just seemed
less than ideal. But if somebody thinks that's the Right Way to do it,
that's good enough..
I think the right thing to do if you are not as fluent in assembly is do
not do anything in it at all. What do you need it for?

Diez
Jun 27 '08 #4
On Fri, 16 May 2008 11:21:39 -0400
"inhahe" <in****@gmail.comwrote:
You could be right, but here are my reasons.

I need to make something that's very CPU-intensive and as fast as possible.
The faster, the better, and if it's not fast enough it won't even work.

They say that the C++ optimizer can usually optimize better than a person
coding in assembler by hand can, but I just can't believe that, at least for
me, because when I code in assembler, I feel like I can see the best way to
do it and I just can't imagine AI would even be smart enough to do it that
way...
Perhaps. Conventional wisdom says that you shouldn't optimize until
you need to though. That's one of the benefits of the way Python
works. Here's how I would do it.

1. Write the code (call it a prototype) in pure Python. Make sure that
everything is modularized based on functionality. Try to get it split
into nice, bite size chunks. Make sure that you have unit tests for
everything that you write.

2. Once the code is functioning, benchmark it and find the
bottlenecks. Replace the problem methods with a C extension. Refactor
(and check your unit tests again) if needed to break out the problem
areas into as small a piece as possible.

3. If it is still slow, embed some assembler where it is slowing down.

One advantage of this is that you always know if your optimizations are
useful. You may be surprised to find that you hardly ever need to go
beyond step 1 leaving you with the most portable and easily maintained
code that you can have.
For portability, I'd simply write different asm routines for different
systems. How wide a variety of systems I'd support I don't know. As a bare
minimum, 32-bit x86, 64-bit x86, and one or more of their available forms of
SIMD.
Even on the same processor you may have different assemblers depending
on the OS.

--
D'Arcy J.M. Cain <da***@druid.net | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
Jun 27 '08 #5
I like to learn what I need, but I have done assembly before, I wrote a
terminal program in assembly for example, with ansi and avatar support. I'm
just not fluent in much other than the language itself, per se.

Perhaps C would be as fast as my asm would, but C would not allow me to use
SIMD, which seems like it would improve my speed a lot, I think my goals are
pretty much what SIMD was made for.
I think the right thing to do if you are not as fluent in assembly is do
not do anything in it at all. What do you need it for?

Diez

Jun 27 '08 #6
inhahe schrieb:
I like to learn what I need, but I have done assembly before, I wrote a
terminal program in assembly for example, with ansi and avatar support. I'm
just not fluent in much other than the language itself, per se.

Perhaps C would be as fast as my asm would, but C would not allow me to use
SIMD, which seems like it would improve my speed a lot, I think my goals are
pretty much what SIMD was made for.

That is not true. I've used the altivec-extensions myself on OSX and
inside C.

Besides, the parts of your program that are really *worth* optimizing
are astonishly few. Don't bother using assembler until you need to.

Diez
Jun 27 '08 #7

"D'Arcy J.M. Cain" <da***@druid.netwrote in message
news:ma***************************************@pyt hon.org...
>
2. Once the code is functioning, benchmark it and find the
bottlenecks. Replace the problem methods with a C extension. Refactor
(and check your unit tests again) if needed to break out the problem
areas into as small a piece as possible.
There's probably only 2 or 3 basic algorithms that will need to have all
that speed.
>
3. If it is still slow, embed some assembler where it is slowing down.
I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.
Although it's true I'd only have to embed it for one system to see (more or
less).
>
>For portability, I'd simply write different asm routines for different
systems. How wide a variety of systems I'd support I don't know. As a
bare
minimum, 32-bit x86, 64-bit x86, and one or more of their available forms
of
SIMD.

Even on the same processor you may have different assemblers depending
on the OS.
yeah I don't know much about that, I was figuring perhaps I could limit the
assembler parts / methodology to something I could write generically
enough.. and if all else fails write for the other OS's or only support
windows. also I think I should be using SIMD of some sort, and I'm not
sure but I highly doubt C++ compilers support SIMD.
Jun 27 '08 #8
>
yeah I don't know much about that, *I was figuring perhaps I could limitthe
assembler parts / methodology to something I could write generically
enough.. and if all else fails write for the other OS's or only support
windows. * also I think I should be using SIMD of some sort, and I'm not
sure but I highly doubt C++ compilers support SIMD.
You're wrong.

Maybe we could help you better if you told us what task are you
trying to achieve (or which algorithms do you think need optimization).
Jun 27 '08 #9
On May 16, 11:24*am, "inhahe" <inh...@gmail.comwrote:
"D'Arcy J.M. Cain" <da...@druid.netwrote in messagenews:ma************************************ ***@python.org...
2. Once the code is functioning, benchmark it and find the
bottlenecks. *Replace the problem methods with a C extension. *Refactor
(and check your unit tests again) if needed to break out the problem
areas into as small a piece as possible.

There's probably only 2 or 3 basic algorithms that will need to have all
that speed.
3. *If it is still slow, embed some assembler where it is slowing down..

I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.
Although it's true I'd only have to embed it for one system to see (more or
less).
For portability, I'd simply write different asm routines for different
systems. *How wide a variety of systems I'd support I don't know. *As a
bare
minimum, 32-bit x86, 64-bit x86, and one or more of their available forms
of
SIMD.
Even on the same processor you may have different assemblers depending
on the OS.

yeah I don't know much about that, *I was figuring perhaps I could limitthe
assembler parts / methodology to something I could write generically
enough.. and if all else fails write for the other OS's or only support
windows. * also I think I should be using SIMD of some sort, and I'm not
sure but I highly doubt C++ compilers support SIMD.
The Society for Inherited Metabolic Disorders?

Why wouldn't the compilers support it? It's part of the x86
architexture,
isn't it?
Jun 27 '08 #10
>3. If it is still slow, embed some assembler where it is slowing down.
>>

I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.
Although it's true I'd only have to embed it for one system to see (more or
less).
Regardless of whether it's faster, I thought you indicated that really
it's most important that it's fast enough.

That said, it's not true that you won't know if it's faster until you
embed it--that's what unit testing would be for. Write your loop(s)
in Python, C, ASM, <insert language hereand run them, on actual
inputs (or synthetic, if necessary, I suppose). That's how you'll be
able to tell whether it's even worth the effort to get the assembly
callable from Python.

On Fri, May 16, 2008 at 1:27 PM, Mensanator <me********@aol.comwrote:
>
Why wouldn't the compilers support it? It's part of the x86
architexture,
isn't it?
Yeah, but I don't know if it uses it by default, and my guess is it
depends on how the compiler back end goes about optimizing the code
for whether it will see data access/computation patterns amenable to
SIMD.
Jun 27 '08 #11
On May 16, 12:24 pm, "inhahe" <inh...@gmail.comwrote:
"D'Arcy J.M. Cain" <da...@druid.netwrote in messagenews:ma************************************ ***@python.org...
2. Once the code is functioning, benchmark it and find the
bottlenecks. Replace the problem methods with a C extension. Refactor
(and check your unit tests again) if needed to break out the problem
areas into as small a piece as possible.

There's probably only 2 or 3 basic algorithms that will need to have all
that speed.
3. If it is still slow, embed some assembler where it is slowing down.

I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.
You won't know if the C is faster than the assembly until you write
it, and if you're going to do that you might as well use it...

If the C is fast enough, there's no point in wasting time writing the
assembly.

(Also FWIW C and C++ are different languages; you seem to conflate the
two a few times upthread).
Jun 27 '08 #12

"Dan Upton" <up***@virginia.eduwrote in message
news:ma***************************************@pyt hon.org...

>
On Fri, May 16, 2008 at 1:27 PM, Mensanator <me********@aol.comwrote:
>>
Why wouldn't the compilers support it? It's part of the x86
architexture,
isn't it?

Yeah, but I don't know if it uses it by default, and my guess is it
depends on how the compiler back end goes about optimizing the code
for whether it will see data access/computation patterns amenable to
SIMD.
perhaps you explicitly use them with some extended syntax or something?
Jun 27 '08 #13
On Fri, May 16, 2008 at 2:08 PM, inhahe <in****@gmail.comwrote:
>
"Dan Upton" <up***@virginia.eduwrote in message
news:ma***************************************@pyt hon.org...

>>
On Fri, May 16, 2008 at 1:27 PM, Mensanator <me********@aol.comwrote:
>>>
Why wouldn't the compilers support it? It's part of the x86
architexture,
isn't it?

Yeah, but I don't know if it uses it by default, and my guess is it
depends on how the compiler back end goes about optimizing the code
for whether it will see data access/computation patterns amenable to
SIMD.

perhaps you explicitly use them with some extended syntax or something?
Hey, I learned something today.

http://www.tuleriit.ee/progs/rexample.php

Also, from the gcc manpage, apparently 387 is the default when
compiling for 32 bit architectures, and using sse instructions is
default on x86-64 architectures, but you can use -march=(some
architecture with simd instructions), -msse, -msse2, -msse3, or
-mfpmath=(one of 387, sse, or sse,387) to get the compiler to use
them.

As long as we're talking about compilers and such... anybody want to
chip in how this works in Python bytecode or what the bytecode
interpreter does? Okay, wait, before anybody says that's
implementation-dependent: does anybody want to chip in what the
CPython implementation does? (or any other implementation they're
familiar with, I guess)
Jun 27 '08 #14
On Fri, 16 May 2008 10:13:04 -0400, inhahe wrote:
Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?
Look up pygame sources. They have some hot inline MMX stuff.
I experimented with this rescently and I must admit that it's etremely
hard to beat C compiler. My first asm code actually was slower than C,
only after reading Intel docs, figuring out what makes 'movq' and
'movntq' different I was able to write something that was several times
faster than C.

D language inline asm and tools to make Python extensions look very
promising although I haven't tried them yet.

-- Ivan
Jun 27 '08 #15
Also, from the gcc manpage, apparently 387 is the default when
compiling for 32 bit architectures, and using sse instructions is
default on x86-64 architectures, but you can use -march=(some
architecture with simd instructions), -msse, -msse2, -msse3, or
-mfpmath=(one of 387, sse, or sse,387) to get the compiler to use
them.

As long as we're talking about compilers and such... anybody want to
chip in how this works in Python bytecode or what the bytecode
interpreter does? Okay, wait, before anybody says that's
implementation-dependent: does anybody want to chip in what the
CPython implementation does? (or any other implementation they're
familiar with, I guess)
There isn't anything in (C)Python aware of these architecture extensions
- unless 3rd-party-libs utilize it. The bytecode-interpreter is machine
and os-independent. So it's above that level anyway. And AFAIK all
mathematical functionality is the one exposed by the OS math-libs.

Having said that, there are of course libs like NumPy that do take
advantage of these architectures, through the use of e.g. lib atlas.

Diez
Jun 27 '08 #16
On Fri, 16 May 2008 11:21:39 -0400, "inhahe"
<in****@gmail.comwrote:
They say that the C++ optimizer can usually optimize
better than a person coding in assembler by hand can,
but I just can't believe that, at least for me,
because when I code in assembler,
if one hand compiles C++ into assembler, the result will
doubtless be crap compared to what the compiler will
generate. If, however, one expresses one's algorithm in
assembler, rather than in C++, the result may well be
dramatically more efficient than expressing one's
algorithm in C++ and the compiler translating it into
assembler. A factor of ten is quite common.

--
----------------------
We have the right to defend ourselves and our property, because
of the kind of animals that we are. True law derives from this
right, not from the arbitrary power of the omnipotent state.

http://www.jim.com/ James A. Donald
Jun 27 '08 #17

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Simon Foster | last post: by
53 posts views Thread by Krystian | last post: by
reply views Thread by mario.danic | last post: by
2 posts views Thread by antonyliu2002 | last post: by
3 posts views Thread by =?Utf-8?B?TWljaGFlbA==?= | last post: by
1 post views Thread by spohle | last post: by
15 posts views Thread by kyosohma | last post: by
reply views Thread by Adam Salisbury | last post: by
1 post views Thread by Waqarahmed | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.