Standard Forth versus Python: a case study

John Doty

I realized that I have a little job on the table that is a fine test of
the Python versus Standard Forth code availability and reusability issue.

Note that I have little experience with either Python or Standard Forth
(but I have much experience with a very nonstandard Forth). I've noodled
around a bit with both gforth and Python, but I've never done a serious
application in either. In my heart, I'm more of a Forth fan: Python is a
bit too much of a black box for my taste. But in the end, I have work to
get done.

The problem:

I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

This is a real problem I need to solve, not a made-up toy problem. I was
originally thinking of solving it in C (I know where to get the pieces
in that language), but it seemed like a good test problem for the Python
versus Forth issue.

I looked to import FITS reading/writing, array manipulation, and median
determination. From there, the solution should be pretty easy.

So, first look for a median function in Python. A little googling finds:

http://www.astro.cornell.edu/staff/loredo/statpy/

Wow! This is good stuff! An embarrassment of riches here! There are even
several FITS modules, and I wasn't even looking for those yet. And just
for further gratification, the page's author is an old student of mine
(but I'll try not to let this influence my attitude). So, I followed the
link to:

http://www.nmr.mgh.harvard.edu/Neura...ry/python.html

From there, I downloaded stats.py, and the two other modules the page
says it requires, and installed them in my site-packages directory. Then
"from stats import median" got me a function to approximately determine
the median of a list. It just worked. The approximation is good enough
for my purposes here.

Pyfits required a little more resourcefulness, in part because STSCI's
ftp server was down yesterday, but I got it installed too. It helps that
when something is missing, the error message gives you a module name. It
needs the numarray module, so I got array manipulation as a side effect.

I haven't finished the program, but I've tried out the pieces and all
looks well here.

OK, now for Forth. Googling for "forth dup swap median" easily found:

http://www.taygeta.com/fsl/library/find.seq

At first blush, this looked really good for Forth. The search zeroed in
on just what I wanted, no extras. The algorithm is better than the one
in the Python stats module: it gives exact results, so there's no need
to check that an approximation is good enough. But then, the
disappointment came.

What do you do with this file? It documents the words it depends on, but
not where to get them. I'm looking at a significant effort to assemble
the pieces here, an effort I didn't suffer through with Python. So, my
first question was: "Is it worth it?".

The answer came from searching for FITS support in Forth. If it exists
in public, it must be really well hidden. That's a "show stopper", so
there was no point in pursuing the Forth approach further.

In the end, it was like comparing a muzzle-loading sharpshooter's rifle
with a machine gun: Forth got off one really good shot, but Python just
mowed the problems down.

The advocates of the idea that Standard Forth has been a boon to code
reusability seem mostly to be people with large private libraries of
Forth legacy code. No doubt to them it really has been a boon. But I
think this little experiment shows that for the rest of us, Python has a
published base of reusable code that puts Forth to shame.

--
John Doty, Noqsi Aerospace, Ltd.
--
Specialization is for robots.

Oct 11 '06 #1

Subscribe Post Reply

5395

Paul Rubin

John Doty <jp*@whispertel.LoseTheH.netwrites:

I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

I dunno what FITS is, but if you have a list of pixel values, that
calculation sounds like two lines:

median = sorted(pixels)[len(pixels)//2]
new_pixels = [p-median for p in pixels]

Oct 11 '06 #2

John Doty

Paul Rubin wrote:

John Doty <jp*@whispertel.LoseTheH.netwrites:
>I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

I dunno what FITS is, but if you have a list of pixel values, that
calculation sounds like two lines:

median = sorted(pixels)[len(pixels)//2]
new_pixels = [p-median for p in pixels]

Yes. The efficient exact algorithms for this problem use *partial*
sorts. The Forth one from the FSL is of this class (although I know of
two better ones for big arrays). But it's tough to beat the efficiency
of the approximate histogram-based method the Python stats module
implements.

--
John Doty, Noqsi Aerospace, Ltd.
--
Specialization is for robots.

Oct 11 '06 #3

bearophileHUGS

John Doty:

Yes. The efficient exact algorithms for this problem use *partial*
sorts. The Forth one from the FSL is of this class (although I know of
two better ones for big arrays). But it's tough to beat the efficiency
of the approximate histogram-based method the Python stats module
implements.

The usual way to compute a true median with Python may be:

def median(inlist):
newlist = sorted(inlist)
index = len(newlist) // 2
if len(newlist) % 2:
return newlist[index]
else:
return (newlist[index] + newlist[index-1]) / 2.0

If you can use Psyco and your FITS lines are really long (well, maybe
too much, the treshold if about >~3000 in my PC) you can use something
like this instead the builtin timsort:
http://aspn.activestate.com/ASPN/Coo.../Recipe/466330
(To compute the median on a image line the median that uses sort is
probably better in most cases, expecially if you use the built in sort
of numerical libraries.)

Bye,
bearophile

Oct 11 '06 #4

jacko

be************@lycos.com wrote:

John Doty:
Yes. The efficient exact algorithms for this problem use *partial*
sorts. The Forth one from the FSL is of this class (although I know of
two better ones for big arrays). But it's tough to beat the efficiency
of the approximate histogram-based method the Python stats module
implements.

The usual way to compute a true median with Python may be:

def median(inlist):
newlist = sorted(inlist)
index = len(newlist) // 2
if len(newlist) % 2:
return newlist[index]
else:
return (newlist[index] + newlist[index-1]) / 2.0

If you can use Psyco and your FITS lines are really long (well, maybe
too much, the treshold if about >~3000 in my PC) you can use something
like this instead the builtin timsort:
http://aspn.activestate.com/ASPN/Coo.../Recipe/466330
(To compute the median on a image line the median that uses sort is
probably better in most cases, expecially if you use the built in sort
of numerical libraries.)

Bye,
bearophile

partial bucket sort with quicksort of individual bucket needed for
index list.

APL would be fast, try a solution in J

it calculates need on demand i think, and so the calculation dependance
tree only does the one quicksort on the bucket needed, or both if on a
bucket boundry, but this can be avoided with clever bucket selection.

cheers

Oct 11 '06 #5

idknow

be************@lycos.com wrote:

John Doty:
Yes. The efficient exact algorithms for this problem use *partial*
sorts. The Forth one from the FSL is of this class (although I know of
two better ones for big arrays). But it's tough to beat the efficiency
of the approximate histogram-based method the Python stats module
implements.

The usual way to compute a true median with Python may be:

def median(inlist):
newlist = sorted(inlist)
index = len(newlist) // 2
if len(newlist) % 2:
return newlist[index]
else:
return (newlist[index] + newlist[index-1]) / 2.0

[snip]

no sort() is needed to calculate the median of a list.

you just need one temp var.

Oct 12 '06 #6

Terry Reedy

<id****@gmail.comwrote in message
news:11*********************@h48g2000cwc.googlegro ups.com...

be************@lycos.com wrote:
no sort() is needed to calculate the median of a list.
you just need one temp var.

Of course. But for a short enough list, the builtin sort() method may be
faster than an O(n) algorithm coded in Python. And .sort() is already
present.

tjr

Oct 12 '06 #7

Paul McGuire

<id****@gmail.comwrote in message
news:11*********************@h48g2000cwc.googlegro ups.com...

[snip]

no sort() is needed to calculate the median of a list.

you just need one temp var.

Ok, I'll bite. How do you compute the median of a list using just a single
temp var?

-- Paul

Oct 12 '06 #8

werty

Apples/oranges ? programmers are making very little $$ today .
Thats software ! No one is makin money on obsolete Forth ,
so why a comparisom ?

Ultimately the best OpSys will be free and millions of lines of code
obsoleted . Because no one can protect intellectual property , its
simply too costly for a Government to do this .

Notice the crypto fiasco , the USA govt forbad export of SSL and
in short order Australia gave the world a free SSLeasy !

This is software . No one will , for long , own software .
Microsoft and Linux will die in 24 months . No hardware will
produce money/profits using them .

Because we have a new hardware competitor , that obsoletes the BIOS
chip , North bridge/South bridge USB helpers , NICs .
No mo PNP (Plug_and_Pray) .
In 2 chips ( CPU /SoC and LCD controller) it will be faster than
a PC . 100's have allready written HDD drivers and haven't yet
free'd them . But when others give free , what good do it to
hold yours ! You look stupid selling what others improve and
free . Trying to sell what others find easy to create !
Intel made hardware too hard to program , ARM is easy .
Python wont last on ARM , for WCE and Linux will be tossed .
A simpler structured method will be used to customise
Browsers . There will be 2 columns , one on left will be main/orig
but on the Right will have hyperlink result . This way ya dont have
to go back to compare ! side by side .
Text editors will also work this way . You will read orig in left
columns
and 'edit'/annotate your stuff to the Right columns .

Sun microsystems et all ( the ones we seldom think about ) will
all be obsoleted , bankrupted , their hardware surplused .
No more h/w servers .
There will be no high speed 64 bit boxes in the future .
The Pocket PC will do work you could not imagine !
All will have 100 GB HDD , VGA LCD , USBH/USBD , WIFI N
and no WERTY keyboard ! full GUI ...
ethernet and firewire and rs232 will die

Why not "see" the future ?
No Linux ,no WXP , no C+ , no PC .....

John Doty wrote:

I realized that I have a little job on the table that is a fine test of
the Python versus Standard Forth code availability and reusability issue.

Note that I have little experience with either Python or Standard Forth
(but I have much experience with a very nonstandard Forth). I've noodled
around a bit with both gforth and Python, but I've never done a serious
application in either. In my heart, I'm more of a Forth fan: Python is a
bit too much of a black box for my taste. But in the end, I have work to
get done.

The problem:

I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

This is a real problem I need to solve, not a made-up toy problem. I was
originally thinking of solving it in C (I know where to get the pieces
in that language), but it seemed like a good test problem for the Python
versus Forth issue.

I looked to import FITS reading/writing, array manipulation, and median
determination. From there, the solution should be pretty easy.

So, first look for a median function in Python. A little googling finds:

http://www.astro.cornell.edu/staff/loredo/statpy/

Wow! This is good stuff! An embarrassment of riches here! There are even
several FITS modules, and I wasn't even looking for those yet. And just
for further gratification, the page's author is an old student of mine
(but I'll try not to let this influence my attitude). So, I followed the
link to:

http://www.nmr.mgh.harvard.edu/Neura...ry/python.html

From there, I downloaded stats.py, and the two other modules the page
says it requires, and installed them in my site-packages directory. Then
"from stats import median" got me a function to approximately determine
the median of a list. It just worked. The approximation is good enough
for my purposes here.

Pyfits required a little more resourcefulness, in part because STSCI's
ftp server was down yesterday, but I got it installed too. It helps that
when something is missing, the error message gives you a module name. It
needs the numarray module, so I got array manipulation as a side effect.

I haven't finished the program, but I've tried out the pieces and all
looks well here.

OK, now for Forth. Googling for "forth dup swap median" easily found:

http://www.taygeta.com/fsl/library/find.seq

At first blush, this looked really good for Forth. The search zeroed in
on just what I wanted, no extras. The algorithm is better than the one
in the Python stats module: it gives exact results, so there's no need
to check that an approximation is good enough. But then, the
disappointment came.

What do you do with this file? It documents the words it depends on, but
not where to get them. I'm looking at a significant effort to assemble
the pieces here, an effort I didn't suffer through with Python. So, my
first question was: "Is it worth it?".

The answer came from searching for FITS support in Forth. If it exists
in public, it must be really well hidden. That's a "show stopper", so
there was no point in pursuing the Forth approach further.

In the end, it was like comparing a muzzle-loading sharpshooter's rifle
with a machine gun: Forth got off one really good shot, but Python just
mowed the problems down.

The advocates of the idea that Standard Forth has been a boon to code
reusability seem mostly to be people with large private libraries of
Forth legacy code. No doubt to them it really has been a boon. But I
think this little experiment shows that for the rest of us, Python has a
published base of reusable code that puts Forth to shame.

--
John Doty, Noqsi Aerospace, Ltd.
--
Specialization is for robots.

Oct 12 '06 #9

Ben Finney

"werty" <we***@swissinfo.orgwrites:

Browsers . There will be 2 columns , one on left will be main/orig
but on the Right will have hyperlink result . This way ya dont have
to go back to compare ! side by side .
Text editors will also work this way . You will read orig in left
columns
and 'edit'/annotate your stuff to the Right columns .

Hmm. Your ideas are intriguing to me and I wish to subscribe to your
newsletter.

--
\ "A celebrity is one who is known by many people he is glad he |
`\ doesn't know." -- Henry L. Mencken |
_o__) |
Ben Finney

Oct 12 '06 #10

Paul Rubin

"Paul McGuire" <pt***@austin.rr._bogus_.comwrites:

Ok, I'll bite. How do you compute the median of a list using just a single
temp var?

Well there's an obvious quadratic-time method...

Oct 12 '06 #11

bearophileHUGS

id****@gmail.com wrote:

no sort() is needed to calculate the median of a list.
you just need one temp var.

Can you show some actual code?

(There is the median of 5 algorithm too).

Bye,
bearophile

Oct 12 '06 #12

Fredrik Lundh

Paul Rubin wrote:

>Ok, I'll bite. How do you compute the median of a list using just a single
temp var?

Well there's an obvious quadratic-time method...

that does it without modifying the list?

if you can modify the list, there are plenty of algorithms that does it
in expected O(n) or better, but I cannot think of a one that doesn't use
at least a few variables (e.g. two list indexes and a pivot).

but I haven't had enough coffee yet, so I'm probably missing something
simple here.

</F>

Oct 12 '06 #13

Paul Rubin

Fredrik Lundh <fr*****@pythonware.comwrites:

Ok, I'll bite. How do you compute the median of a list using just
a single temp var?
Well there's an obvious quadratic-time method...

that does it without modifying the list?

if you can modify the list, there are plenty of algorithms that does
it in expected O(n) or better, but I cannot think of a one that
doesn't use at least a few variables (e.g. two list indexes and a pivot).

Hmm, whoops, I didn't count the list index for the quadratic time
version (but that version shouldn't need to modify the list).

If you can modify the list, let's see, you can swap two elements with
no temp vars:

a[i] ^= a[i+1]
a[i+1] ^= a[i]
a[i] ^= a[i+1]

This assumes an indexed addressing mode so finding a[i+1] doesn't
require using a temp var to hold (i+1). Let's say the list length is
n, which is not a variable, and constant expressions like n-1 are also
not variables. I'm still envisioning some reasonable type of assembly
code. So now we can straightforwardly sort the list with one index
var and one flag bit:

flag = True
while flag:
flag = False
for i in 0..(n-2):
if a[i] a[i+1]:
swap a[i], a[i+1] as above
flag = True

and then pick the median out of the middle.

but I haven't had enough coffee yet, so I'm probably missing something
simple here.

Yeah, it's night here, maybe after I get some sleep I'll look for a
way to get rid of the flag bit above.

Oct 12 '06 #14

Ian McConnell

be************@lycos.com writes:

John Doty:
>Yes. The efficient exact algorithms for this problem use *partial*
sorts. The Forth one from the FSL is of this class (although I know of
two better ones for big arrays). But it's tough to beat the efficiency
of the approximate histogram-based method the Python stats module
implements.

The usual way to compute a true median with Python may be:

def median(inlist):
newlist = sorted(inlist)
index = len(newlist) // 2
if len(newlist) % 2:
return newlist[index]
else:
return (newlist[index] + newlist[index-1]) / 2.0

If you can use Psyco and your FITS lines are really long (well, maybe
too much, the treshold if about >~3000 in my PC) you can use something
like this instead the builtin timsort:
http://aspn.activestate.com/ASPN/Coo.../Recipe/466330
(To compute the median on a image line the median that uses sort is
probably better in most cases, expecially if you use the built in sort
of numerical libraries.)

sort() sorts all of the data, but you're only after one or two numbers, so
the MODFIND method may be faster for the median:

http://www.geocities.com/SiliconVall...323/algor.html

Ian

Oct 12 '06 #15

idknow

Paul McGuire wrote:

<id****@gmail.comwrote in message
news:11*********************@h48g2000cwc.googlegro ups.com...
[snip]

no sort() is needed to calculate the median of a list.

you just need one temp var.

Ok, I'll bite. How do you compute the median of a list using just a single
temp var?

-- Paul

hi Paul; well when this was a stats-class assignment (back when pascal
was popular :) i just stepped through the vector and compared it

(pseudo-code)

ptr p = [with values].

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

of course, pascal is more verbose but that's median()

Oct 12 '06 #16

Paul Rubin

"id****@gmail.com" <id****@gmail.comwrites:

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

I count two variables, p and x.

Oct 12 '06 #17

Paul Rubin

Paul Rubin <http://ph****@NOSPAM.invalidwrites:

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

I count two variables, p and x.

Also, that finds the maximum, not the median. I had stopped examining
it after seeing it used more than one variable.

Oct 12 '06 #18

Andrew Haley

In comp.lang.forth Paul Rubin <http://ph****@nospam.invalidwrote:

"id****@gmail.com" <id****@gmail.comwrites:
>fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

I count two variables, p and x.

Isn't this the maximum?

Andrew.

Oct 12 '06 #19

Fredrik Lundh

id****@gmail.com wrote:

hi Paul; well when this was a stats-class assignment (back when pascal
was popular :) i just stepped through the vector and compared it

(pseudo-code)

ptr p = [with values].

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

of course, pascal is more verbose but that's median()

that's a rather unusual definition of median, though.

</F>

Oct 12 '06 #20

Peter Decker

On 12 Oct 2006 04:40:32 -0700, Paul Rubin
<"http://phr.cx"@nospam.invalidwrote:

Paul Rubin <http://ph****@NOSPAM.invalidwrites:

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}
I count two variables, p and x.

Also, that finds the maximum, not the median. I had stopped examining
it after seeing it used more than one variable.

Um... isn't 'p' the list in question? There is only one variable used
for the calculation of the return value (yes, it's a maximum), and
that's 'x'.
--

# p.d.

Oct 12 '06 #21

Fredrik Lundh

Peter Decker wrote:

>Also, that finds the maximum, not the median. I had stopped examining
it after seeing it used more than one variable.

Um... isn't 'p' the list in question?

no, it's a pointer to the current item in the list.

</F>

Oct 12 '06 #22

bearophileHUGS

Ian McConnell wrote:

If you can use Psyco and your FITS lines are really long (well, maybe
too much, the treshold if about >~3000 in my PC) you can use something
like this instead the builtin timsort:
http://aspn.activestate.com/ASPN/Coo.../Recipe/466330
(To compute the median on a image line the median that uses sort is
probably better in most cases, expecially if you use the built in sort
of numerical libraries.)

sort() sorts all of the data, but you're only after one or two numbers, so
the MODFIND method may be faster for the median:

The modified quicksort I have shown in the cookbook (466330) is O(n)
too, and it modifies the list in place, so you can apply it twice for
lists of even len.

Bye,
bearophile

Oct 12 '06 #23

Paul McGuire

<id****@gmail.comwrote in message
news:11**********************@i42g2000cwa.googlegr oups.com...

Paul McGuire wrote:
><id****@gmail.comwrote in message
news:11*********************@h48g2000cwc.googlegr oups.com...
[snip]

no sort() is needed to calculate the median of a list.

you just need one temp var.

Ok, I'll bite. How do you compute the median of a list using just a
single
temp var?

-- Paul

hi Paul; well when this was a stats-class assignment (back when pascal
was popular :) i just stepped through the vector and compared it

(pseudo-code)

ptr p = [with values].

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

of course, pascal is more verbose but that's median()

No, that's the maximum. The median value is the value that is in the middle
of the list when the list is sorted. Many analyses prefer median to mean
(also known as "average") because the median is less sensitive to wild
outlier points.

My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just an
academic exercise - given a large historical data set, sorting the data is
one of the costliest parts of computing the median, and I would greatly
appreciate seeing an alternative algorithm.)

-- Paul

Oct 12 '06 #24

Fredrik Lundh

Paul McGuire wrote:

My original question was in response to your post, that sort() wasn't required but only a temp
variable. I am very interested in seeing your solution that does not require the data to be
sorted. (This is not just an academic exercise - given a large historical data set, sorting the
data is one of the costliest parts of computing the median, and I would greatly appreciate seeing
an alternative algorithm.)

if you absolutely definitely cannot afford to modify or copy the input data set, but can
read the data sequentially multiple times reasonably fast, you can do what's basically a
binary search for the median, by counting how many values you have that's above or
below the current guess, and repeating until you find the right value. see e.g.

http://ndevilla.free.fr/median/median/src/torben.c

</F>

Oct 12 '06 #25

Paul Rubin

"Paul McGuire" <pt***@austin.rr._bogus_.comwrites:

My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just an
academic exercise - given a large historical data set, sorting the data is
one of the costliest parts of computing the median, and I would greatly
appreciate seeing an alternative algorithm.)

There are well known algorithms for finding the median in expected
O(n) time, the most straightforward of which is a modified quicksort.
You do the traditional quicksort pivot step, figure out which
partition the median is in, and recurse on just that partition instead
of on both of them. Expected running time is n + n/2 + n/4 + ...
which is approx 2n.

Tarjan discovered a guaranteed O(n) algorithm in the 1970's(?) whose
operation is much different and quite complex. But all of these need
more than one temp var. See an algorithms book like CLRS or Knuth
for more info.

Oct 12 '06 #26

Paul Rubin

Paul Rubin <http://ph****@NOSPAM.invalidwrites:

Tarjan discovered a guaranteed O(n) algorithm in the 1970's(?) whose
operation is much different and quite complex. But all of these need
more than one temp var. See an algorithms book like CLRS or Knuth
for more info.

Ehh, make that Blum, Floyd, Pratt, Rivest, and Tarjan, and the main
different part is selecting the pivot, plus the complexity analysis.

http://www.cs.cmu.edu/afs/cs.cmu.edu...s/lect0907.pdf

http://en.wikipedia.org/wiki/Selection_algorithm (see "median of
medians algorithm")

Oct 12 '06 #27

Paul McGuire

"Fredrik Lundh" <fr*****@pythonware.comwrote in message
news:ma**************************************@pyth on.org...

Paul McGuire wrote:

>My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just
an academic exercise - given a large historical data set, sorting the
data is one of the costliest parts of computing the median, and I would
greatly appreciate seeing an alternative algorithm.)

if you absolutely definitely cannot afford to modify or copy the input
data set, but can
read the data sequentially multiple times reasonably fast, you can do
what's basically a
binary search for the median, by counting how many values you have that's
above or
below the current guess, and repeating until you find the right value.
see e.g.

http://ndevilla.free.fr/median/median/src/torben.c

</F>

Thanks!

-- Paul

Oct 12 '06 #28

Neil Cerutti

On 2006-10-12, Paul Rubin <httpwrote:

Tarjan discovered a guaranteed O(n) algorithm in the 1970's(?)

Huhn! I thought Tarjan was just the big bad evil guy in Bard's
Tale 2 who was creating eternal winter. I'm glad he also
contributed to our stock of *useful* algorithms.

--
Neil Cerutti
We will not have an all volunteer army. We *will* have an all
volunteer army. --George W. Bush

Oct 12 '06 #29

John Doty

Paul McGuire wrote:

<id****@gmail.comwrote in message
news:11**********************@i42g2000cwa.googlegr oups.com...
>Paul McGuire wrote:
>><id****@gmail.comwrote in message
news:11*********************@h48g2000cwc.googleg roups.com...
[snip]

no sort() is needed to calculate the median of a list.

you just need one temp var.

Ok, I'll bite. How do you compute the median of a list using just a
single
temp var?

-- Paul
hi Paul; well when this was a stats-class assignment (back when pascal
was popular :) i just stepped through the vector and compared it

(pseudo-code)

ptr p = [with values].

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

of course, pascal is more verbose but that's median()

No, that's the maximum. The median value is the value that is in the middle
of the list when the list is sorted. Many analyses prefer median to mean
(also known as "average") because the median is less sensitive to wild
outlier points.

My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just an
academic exercise - given a large historical data set, sorting the data is
one of the costliest parts of computing the median, and I would greatly
appreciate seeing an alternative algorithm.)

Here's a K&R C function I wrote almost 20 years ago. It's a general
purpose quantile. The essential idea is to choose an element at random
(thus mostly avoiding perverse behavior with already sorted data) and
divide the set into two pieces around it. Then you figure out which
piece contains the quantile you want, and what quantile it is within
that piece, and recurse. When you see enough identical elements in the
piece you're processing, it's done. In the extreme case you'll get down
to one element.

ixrand(n) generates a random integer in the range 0..n-1. I think that's
the only nonstandard function used.

The style is torqued by what Unisoft C could and couldn't optimize: I
wouldn't write it quite like that today. One of my students pointed out
that all of the recursion is tail recursion so it should be easy to
flatten. Left as an exercise to the reader...

Somewhere, in Knuth I think, I saw a somewhat similar algorithm that
promised a little better performance by estimating the median from a
sample of the data, breaking the data up there, and then looking for a
quantile (statistically guaranteed to be) close to the min or max of one
of the subsets.

It shouldn't be hard to recode in Python, Forth, or whatever. That
wasn't my purpose in the exercise that started the thread though: I
wanted to see if I could import modules good enough to do the job from
public sources. In Python I could, and the entire image processing
program is 15 lines. In Forth I couldn't.

Anyway, here it is:

/* Find the nth from the minimum value in an array */
/* Monte Carlo method intended for finding medians */
/* 2/13/85 jpd */

/* For random data, this routine takes about */
/* 2.6*numdata + O( log( numdata )) comparisons */
/* If the data is tightly clustered about the mean, */
/* there is a speedup; it may take as few as
/* 0.5*numdata comparisons. */
/* There is a slight penalty if the array is completely */
/* or partially sorted; it is at most 25%. */

/* NTH will be nthi, nths, etc., depending on DATATYPE */

NTH( data, numdata, n )
DATATYPE data[]; /* Data array (will be scrambled on return) */
int numdata; /* lemgth of data array */
int n; /* index if item to find:
1 <= n <= numdata */
{
register DATATYPE boundary, thisdata;
register DATATYPE *lowp, *highp;
DATATYPE v1, v2;
int nlowbin;

lowp = data; /* Init data pointer */

v1 = data[ ixrand( numdata )];
{
register DATATYPE v1r = v1;
int nc = 1 + numdata - n; /* "Complement" of n */

if( nc n )
highp = lowp + nc;
else
highp = lowp + n; /* Limit to test for done */

/* Scan for the first point which
doesn't match the boundary point.
If we encounter enough
matching points,
the boundary is the answer */
while( *lowp++ == v1r ) {
if( lowp >= highp ) return( v1r );
}
v2 = *--lowp; /* Back up to get point */
}

boundary = ( v1 >1 ) + ( v2 >1 ); /* Beware overflows */

highp = data + numdata; /* Now process the whole thing */
thisdata = *lowp; /* Prime the pump */

if( v2 < v1 ) { /* Bin 2 is low bin */
for( ; lowp < highp; thisdata = *lowp ) {
if( thisdata <= boundary ) { /* Bin 2 */
*lowp = *--highp; /* Exchange */
*highp = thisdata;
}
else ++lowp; /* Data point in right place */
}

nlowbin = numdata - ( lowp - data );
if( nlowbin >= n ) return( NTH( highp, nlowbin, n ));
else return( NTH( data, lowp - data, n - nlowbin ));

}
else { /* Primary bin is low bin */
for( ; lowp < highp; thisdata = *lowp ) {
if( thisdata boundary ) { /* Bin 2 */
*lowp = *--highp; /* Exchange */
*highp = thisdata;
}
else ++lowp; /* Don't move point */
}

nlowbin = ( lowp - data );
if( nlowbin >= n ) return( NTH( data, nlowbin, n ));
else return( NTH( highp, numdata - nlowbin, n - nlowbin ));
}
}

--
John Doty, Noqsi Aerospace, Ltd.
--
Specialization is for robots.

Oct 12 '06 #30

Neal Bridges

"John Doty" <jp*@whispertel.LoseTheH.netwrote in message
news:Wc******************************@wispertel.co m...
[snip]

Here's a K&R C function I wrote almost 20 years ago.

[code snipped]

John, 'man indent' right away!

--
Neal Bridges
http://quartus.net Home of Quartus Forth for the Palm OS!

Oct 12 '06 #31

Neil Cerutti

On 2006-10-12, Neal Bridges <nb******@interlog.comwrote:

"John Doty" <jp*@whispertel.LoseTheH.netwrote in message
news:Wc******************************@wispertel.co m...
[snip]
>Here's a K&R C function I wrote almost 20 years ago.
[code snipped]

John, 'man indent' right away!

As far as I know, he just forgot to strip out the tab characters
before pasting and posting.

--
Neil Cerutti
We will not have an all volunteer army. We *will* have an all
volunteer army. --George W. Bush

Oct 12 '06 #32

Juho Schultz

John Doty wrote:

The problem:

I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

This may be personal bias... I have spent a few years with FITS files
so every time I see 'FITS' I think 'astronomy' - maybe you are doing
something else. (Someone wrote he does not know what FITS is - see
fits.gsfc.nasa.gov In a nutshell: FITS file is a list of header-data
units. Each header is text containing optional keyword-value pairs and
reading instructions for the data unit. Usually astronomical image data
but it can be anything.)

Sounds like subtracting sky background from the frames, though for some
reason (speed?) column-wise. You could have a look at either PyRAF
(www.stsci.edu/resources/software_hardware/pyraf) or PyMidas
(www.eso.org/sampo/pymidas/). Running some semi-official
sky-subtraction algorithm would at least give you a good case to test
against.

You could also check if ftools already does this. I have not used it
much, but sometimes it saves huge amounts of coding time.
heasarc.gsfc.nasa.gov/docs/software/ftools/ftools_menu.html

This is a real problem I need to solve, not a made-up toy problem. I was
originally thinking of solving it in C (I know where to get the pieces
in that language)

Could you tell a bit more of the data? Are you aiming for accuracy,
speed or getting as complete time series as possible? (for me, speed
has never been an issue) Photometry/astrometry/something else? Is there
some trade-off like "get best possible result in X seconds"?

--
Juho Schultz

Oct 12 '06 #33

John Doty

Juho Schultz wrote:

John Doty wrote:

>The problem:

I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

This may be personal bias... I have spent a few years with FITS files
so every time I see 'FITS' I think 'astronomy' - maybe you are doing
something else. (Someone wrote he does not know what FITS is - see
fits.gsfc.nasa.gov In a nutshell: FITS file is a list of header-data
units. Each header is text containing optional keyword-value pairs and
reading instructions for the data unit. Usually astronomical image data
but it can be anything.)

Sounds like subtracting sky background from the frames, though for some
reason (speed?) column-wise. You could have a look at either PyRAF
(www.stsci.edu/resources/software_hardware/pyraf) or PyMidas
(www.eso.org/sampo/pymidas/). Running some semi-official
sky-subtraction algorithm would at least give you a good case to test
against.

You could also check if ftools already does this. I have not used it
much, but sometimes it saves huge amounts of coding time.
heasarc.gsfc.nasa.gov/docs/software/ftools/ftools_menu.html

>This is a real problem I need to solve, not a made-up toy problem. I was
originally thinking of solving it in C (I know where to get the pieces
in that language)

Could you tell a bit more of the data? Are you aiming for accuracy,
speed or getting as complete time series as possible? (for me, speed
has never been an issue) Photometry/astrometry/something else? Is there
some trade-off like "get best possible result in X seconds"?

Yeah, I'm fairly familiar with that stuff, I was/am an investigator on
ASCA, Chandra, HETE-2, Suzaku. Didn't know of anything quite right for
this job. With the right imports it was only 15 lines of Python, hard to
beat that.

It isn't my data to release, it's MKI's. Development project. And they
have plenty of folks who know IRAF, ftools, etc. They didn't know how to
do quite what they wanted here.

--
John Doty, Noqsi Aerospace, Ltd.
--
Specialization is for robots.

Oct 12 '06 #34

idknow

Paul Rubin wrote:

Paul Rubin <http://ph****@NOSPAM.invalidwrites:

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}
I count two variables, p and x.

Also, that finds the maximum, not the median. I had stopped examining
it after seeing it used more than one variable.

oops, i apologise, the should be <
that's just a typo.

As for your quoting, you cut off the global var line and miscounted; p
is a global or from the caller of median()

also, i plead `too early in the morning to code properly' :)

clearly, i've forgotten the definition of the median of a list.
to that i plead faulty memory.

my apologises for coding at a bad hour.

Oct 12 '06 #35

Gabriel Genellina

At Thursday 12/10/2006 17:44, id****@gmail.com wrote:

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

clearly, i've forgotten the definition of the median of a list.
to that i plead faulty memory.

That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I
can do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the
list when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3
--
Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Oct 13 '06 #36

Paul Rubin

Gabriel Genellina <ga******@yahoo.com.arwrites:

That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I can
do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the list
when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3

How is this? Note that m and n are treated as constants, so
expressions like n-1, n/2 etc. are also constants. Also assume
m[i+1] is evaluated as (m+1)[i] and of course m+1 is constant.

================================================== ==============

#include <stdio.h>

int
median (int m[], int n)
{
int i;
while (1) {
for (i = 0; i < n-1; i++) {
if (m[i] m[i+1]) {
/* swap m[i] and m[i+1] with no temp var */
m[i] ^= m[i+1];
m[i+1] ^= m[i];
m[i] ^= m[i+1];
goto yes;
}
}
break;
yes:
for (i = 0; i < n-1; i++) {
if (m[i] m[i+1]) {
m[i] ^= m[i+1];
m[i+1] ^= m[i];
m[i] ^= m[i+1];
}
}
}
return m[n / 2];
}

int a[] = {9,6,1,5,4,2,8,3,7};

main()
{
int m;
m = median(a, 9);
printf ("%d\n", m);
}

Oct 13 '06 #37

Gabriel Genellina

At Thursday 12/10/2006 21:54, Paul Rubin wrote:

>Gabriel Genellina <ga******@yahoo.com.arwrites:
That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I can
do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the list
when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3

How is this? Note that m and n are treated as constants, so
expressions like n-1, n/2 etc. are also constants. Also assume
m[i+1] is evaluated as (m+1)[i] and of course m+1 is constant.

median (int m[], int n)
{
while (1) {
for (i = 0; i < n-1; i++) {
}
return m[n / 2];

Oh, yes, almost no auxiliary storage needed! You are sorting the data
first, then getting the middle element.
So if you are really really really concerned about variable storage
(maybe an embedded application?) this may work, at the expense of
speed: this sort appears to be O(n2), quicksort would do in O(nlogn),
and a customised version -at each stage, continue sorting only the
branch containing the median- can work in *expected* linear time.
Of course other set of constraints would dictate another approach,
maybe minimizing the number of array accesses by example.
--
Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Oct 13 '06 #38

Fredrik Lundh

Gabriel Genellina wrote:

At Thursday 12/10/2006 17:44, id****@gmail.com wrote:

fun median {
var x = 0.
while( *p++) {
if( (*p) x) x = *p.
}
return x.
}

clearly, i've forgotten the definition of the median of a list.
to that i plead faulty memory.

That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I can
do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the list
when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3

note that both his proposals does indeed find the median if given a list
of zeros, though. maybe they just used lousy sample data in that stat
class he took?

</F>

Oct 13 '06 #39

Paddy

werty wrote:

Apples/oranges ? programmers are making very little $$ today .
Thats software ! No one is makin money on obsolete Forth ,
so why a comparisom ?

Ultimately the best OpSys will be free and millions of lines of code
obsoleted . Because no one can protect intellectual property , its
simply too costly for a Government to do this .

Notice the crypto fiasco , the USA govt forbad export of SSL and
in short order Australia gave the world a free SSLeasy !

This is software . No one will , for long , own software .
Microsoft and Linux will die in 24 months . No hardware will
produce money/profits using them .

Because we have a new hardware competitor , that obsoletes the BIOS
chip , North bridge/South bridge USB helpers , NICs .
No mo PNP (Plug_and_Pray) .
In 2 chips ( CPU /SoC and LCD controller) it will be faster than
a PC . 100's have allready written HDD drivers and haven't yet
free'd them . But when others give free , what good do it to
hold yours ! You look stupid selling what others improve and
free . Trying to sell what others find easy to create !
Intel made hardware too hard to program , ARM is easy .
Python wont last on ARM , for WCE and Linux will be tossed .
A simpler structured method will be used to customise
Browsers . There will be 2 columns , one on left will be main/orig
but on the Right will have hyperlink result . This way ya dont have
to go back to compare ! side by side .
Text editors will also work this way . You will read orig in left
columns
and 'edit'/annotate your stuff to the Right columns .

Sun microsystems et all ( the ones we seldom think about ) will
all be obsoleted , bankrupted , their hardware surplused .
No more h/w servers .
There will be no high speed 64 bit boxes in the future .
The Pocket PC will do work you could not imagine !
All will have 100 GB HDD , VGA LCD , USBH/USBD , WIFI N
and no WERTY keyboard ! full GUI ...
ethernet and firewire and rs232 will die

Why not "see" the future ?
No Linux ,no WXP , no C+ , no PC .....

Hah! This is the output from a program and I claim my prize.
:-)

Oct 14 '06 #40

jacko

Paddy wrote:

werty wrote:
Apples/oranges ? programmers are making very little $$ today .
Thats software ! No one is makin money on obsolete Forth ,
so why a comparisom ?

Ultimately the best OpSys will be free and millions of lines of code
obsoleted . Because no one can protect intellectual property , its
simply too costly for a Government to do this .

Notice the crypto fiasco , the USA govt forbad export of SSL and
in short order Australia gave the world a free SSLeasy !

This is software . No one will , for long , own software .
Microsoft and Linux will die in 24 months . No hardware will
produce money/profits using them .

Because we have a new hardware competitor , that obsoletes the BIOS
chip , North bridge/South bridge USB helpers , NICs .
No mo PNP (Plug_and_Pray) .
In 2 chips ( CPU /SoC and LCD controller) it will be faster than
a PC . 100's have allready written HDD drivers and haven't yet
free'd them . But when others give free , what good do it to
hold yours ! You look stupid selling what others improve and
free . Trying to sell what others find easy to create !
Intel made hardware too hard to program , ARM is easy .
Python wont last on ARM , for WCE and Linux will be tossed .
A simpler structured method will be used to customise
Browsers . There will be 2 columns , one on left will be main/orig
but on the Right will have hyperlink result . This way ya dont have
to go back to compare ! side by side .
Text editors will also work this way . You will read orig in left
columns
and 'edit'/annotate your stuff to the Right columns .

Sun microsystems et all ( the ones we seldom think about ) will
all be obsoleted , bankrupted , their hardware surplused .
No more h/w servers .
There will be no high speed 64 bit boxes in the future .
The Pocket PC will do work you could not imagine !
All will have 100 GB HDD , VGA LCD , USBH/USBD , WIFI N
and no WERTY keyboard ! full GUI ...
ethernet and firewire and rs232 will die

Why not "see" the future ?
No Linux ,no WXP , no C+ , no PC .....

Hah! This is the output from a program and I claim my prize.
:-)

werty bot has to claim the prize to prove intelligence challange ;-)

Oct 14 '06 #41

proteusguy

John you nailed it. I was a big forth fan in the mid-80s but it was
very clear that you either had to spend a lot of money on proprietary
systems or do it ALL yourself. Not having any money I was pleased to be
able to do it all but today, in the age of instant communication and
collaboration, its not a competitive option in any language. Had forth
kicked off the open source community about a decade earlier than the
UNIX folk (in useful terms) I think we'd be living in a much different
world from a computing perspective. Forth is still cool - but only when
its for something I wanna do all by myself... :)

-- Ben

John Doty wrote:

I realized that I have a little job on the table that is a fine test of
the Python versus Standard Forth code availability and reusability issue.

Note that I have little experience with either Python or Standard Forth
(but I have much experience with a very nonstandard Forth). I've noodled
around a bit with both gforth and Python, but I've never done a serious
application in either. In my heart, I'm more of a Forth fan: Python is a
bit too much of a black box for my taste. But in the end, I have work to
get done.

The problem:

I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

This is a real problem I need to solve, not a made-up toy problem. I was
originally thinking of solving it in C (I know where to get the pieces
in that language), but it seemed like a good test problem for the Python
versus Forth issue.

I looked to import FITS reading/writing, array manipulation, and median
determination. From there, the solution should be pretty easy.

So, first look for a median function in Python. A little googling finds:

http://www.astro.cornell.edu/staff/loredo/statpy/

Wow! This is good stuff! An embarrassment of riches here! There are even
several FITS modules, and I wasn't even looking for those yet. And just
for further gratification, the page's author is an old student of mine
(but I'll try not to let this influence my attitude). So, I followed the
link to:

http://www.nmr.mgh.harvard.edu/Neura...ry/python.html

From there, I downloaded stats.py, and the two other modules the page
says it requires, and installed them in my site-packages directory. Then
"from stats import median" got me a function to approximately determine
the median of a list. It just worked. The approximation is good enough
for my purposes here.

Pyfits required a little more resourcefulness, in part because STSCI's
ftp server was down yesterday, but I got it installed too. It helps that
when something is missing, the error message gives you a module name. It
needs the numarray module, so I got array manipulation as a side effect.

I haven't finished the program, but I've tried out the pieces and all
looks well here.

OK, now for Forth. Googling for "forth dup swap median" easily found:

http://www.taygeta.com/fsl/library/find.seq

At first blush, this looked really good for Forth. The search zeroed in
on just what I wanted, no extras. The algorithm is better than the one
in the Python stats module: it gives exact results, so there's no need
to check that an approximation is good enough. But then, the
disappointment came.

What do you do with this file? It documents the words it depends on, but
not where to get them. I'm looking at a significant effort to assemble
the pieces here, an effort I didn't suffer through with Python. So, my
first question was: "Is it worth it?".

The answer came from searching for FITS support in Forth. If it exists
in public, it must be really well hidden. That's a "show stopper", so
there was no point in pursuing the Forth approach further.

In the end, it was like comparing a muzzle-loading sharpshooter's rifle
with a machine gun: Forth got off one really good shot, but Python just
mowed the problems down.

The advocates of the idea that Standard Forth has been a boon to code
reusability seem mostly to be people with large private libraries of
Forth legacy code. No doubt to them it really has been a boon. But I
think this little experiment shows that for the rest of us, Python has a
published base of reusable code that puts Forth to shame.

--
John Doty, Noqsi Aerospace, Ltd.
--
Specialization is for robots.

Oct 15 '06 #42

Elizabeth D Rather

pr********@gmail.com wrote:

John you nailed it. I was a big forth fan in the mid-80s but it was
very clear that you either had to spend a lot of money on proprietary
systems or do it ALL yourself. Not having any money I was pleased to be
able to do it all but today, in the age of instant communication and
collaboration, its not a competitive option in any language. Had forth
kicked off the open source community about a decade earlier than the
UNIX folk (in useful terms) I think we'd be living in a much different
world from a computing perspective. Forth is still cool - but only when
its for something I wanna do all by myself... :)

Fortunately, modern Forths have quite a lot more functionality than
mid-80's Forths, in terms of application-relevant features, programming
aids, more intelligent compilers, OS (e.g. Windows, Unix) compatibility
and access, etc. You should really have another look!

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310-491-3356
5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454
Hawthorne, CA 90250
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Oct 16 '06 #43

Standard Forth versus Python: a case study

Similar topics