I haven't implemented this on a large scale, but I'll pass on

what I think is a good approach. I suggested it in a different

context in the thread found here:

http://groups.google.com/groups/sear...exemplar+skass
Variations on this idea are used in GIS and other such systems.

Note that my comments aren't specific to your exact question,

but they should apply reasonably well.

Very briefly, the idea is to maintain a set of points *not*

in your set that are evenly scattered through space - the

lattice points of a coarse grid. These will serve as neighborhood

markers, and with each point, you will store its NeighborhoodID.

You can think of these as neighborhood post offices, for example.

The coarseness of the grid will be chosen depending on what kinds

of questions you ask, but typically the size of the neighborhoods

reflects the meaning of "close".

Points can only be close to each other if they live in neighborhoods

that are close to each other, or equivalently, if their neighborhood

post offices are close to each other.

For example, if your 3D coordinates range from 0 to 1000

each, you might slice up space every 100 units and use these

as lattice points: {i*100+50,j*100+50,k+100*50) for 0<=i,j,k<10.

So you have 1000 lattice points in this case. No point in

space is more than 100*SQRT(3) ~ 86.6 units from one of these 1000

points, so this will work if the "closeness" you need is on this

order ("close" should mean same or adjacent neighborhood).

No matter how many points you have here, you have

only 1K neighborhoods. If you had 10M points, you

now have 10K points/neighborhood on average.

Now the part that reduces the cost of querying: For many queries,

you can first search or look up pairs of post offices, not points,

that meet your requirement of close. (This could be "adjacent

pairs" in most cases, and post office adjacencies could be

maintained permanently). There is one Mega-pair of post offices

in this example, but 100 Tera-pairs of points

(in nanoseconds, this is 1 millisecond vs. 1 day).

Even if you still have to look at all pairs of points in

adjacent neighborhoods, you have made progress, but for many

problems, you might be able to narrow down the search further.

You might maintain various values in your tables: for points,

the distance to the neighborhood post office and the post

office ID, or the same for all adjacent neighborhood post

offices. For post offices, you might maintain the number of

points in the neighborhood. You might use a fixed rectangular

lattice with a canonical way of enumerating neighbors, so you

can pack things into arrays, or you might allow the lattice to

be dynamic, depending on where the points are. You might maintain

all inter-postoffice distances or just nearby ones. And you

might use triggers a lot if there are many updates and

inserts, but if the data is relatively static, you might

maintain larger and well-indexed collections of

relevant calculations. Depending on how real-time the

system is, you might treat parts of it like a warehousing

problem. And on and on (there are many books on the subject,

on subjects like "spatial databases" and geodatabases -

I haven't read whole books on it)

This is only a sketch, and it is not trivial (but not impossible)

to implement, but maybe it will help. The alternative is

probably to look for GIS-type third-party products that can

connect to SQL data. I'm sure there are some out there, but

they may not be cheap.

The shorter answer is that there is probably no single fast

query to do most things like this if none of this framework is

in place. At best, maybe you can generate a crude lattice on

the fly with derived tables from Numbers tables - but that is

basically putting some of this framework in place.

Steve Kass

Drew University

be*******@gmail.com wrote:

Hi all

I have a large data set of points situated in 3d space. I have a simple

primary key and an x, y and z value.

What I would like is an efficient method for finding the group of

points within a threshold.

So far I have tested the following however it is very slow.

---------------

select *

from locations a full outer join locations b

on a.ID < b.ID and a.X-b.X<2 and a.Y-b.Y<2 and a.Z-b.Z<2

where a.ID is not null and b.ID is not null

---------------

If anyone knows of a more efficient method to arrive at this results it

would be most appreciated.

Thanks in advance

Bevan