469,898 Members | 1,574 Online

# Generating an Index

Hi, I need some assistance with a query. To be honest, I'm not even
sure it can be done.

I'll try to keep the information limited to only what's relevant to
what I have and what I am trying to achieve with this query.

I have a table that contains around 100,000 records. For the sake of
this discussion, assume just two columns:

ID Data
1 000000000000
2 010000000000
3 011111111111
4 101100011101
5 110100011011
6 111100000000
7 111100000111
8 111100010111
9 111100011011
10 111111111111

The Data column contains only 1's and 0's.
The Data column is a text column, not numeric.
The Data column is actually 255 chars long. (I limited it above to 12
for this example only)
Duplicates on the Data column are allowed and do exist.
With 100,000 records, you would note that in the above example record
10 would actually be record 100,000.
My aim is to somehow sort the data (by creating a third column) so that
the records are in order of "string distance" or similar. In other
words, so that similar strings are located next to (or as close as
possible to) each other.
For example, taking the data above:

The "distance" between record 8 and record 9 is 2 (ie Only two
positions in the Data are not the same)
The "distance" between record 3 and record 4 is 6

However, the "distance" between record 3 and record 10 is only 1,
but sorted in the normal fashion, there would be some 99,990 records
between them.
I was hoping that as I add records to this table I could calculate a
number or a code or something to create a third column that could be
indexed. Accessing records using this index would give me the records
in order of "String Distance" or similar.

I have looked up functions such as "Levenshtein - Edit String
Distance", which is fine when I have to strings to compare. I can't
see it helping in generating an index though.

I hope I have been clear in my explanation.

I would really appreciate any comments or discussion that could help me
achieve this.

MJ

Nov 13 '05 #1
7 1760
1. Convert A and B to a binary.
2. A Xor B, then add the number of ones in the result.

--
Darryl Kerkeslager

Power corrupts.
Absolute power corrupts absolutely.
Knowledge is power.
"Mary" <ma************@hotmail.com> wrote in message
Hi, I need some assistance with a query. To be honest, I'm not even
sure it can be done.

I'll try to keep the information limited to only what's relevant to
what I have and what I am trying to achieve with this query.

I have a table that contains around 100,000 records. For the sake of
this discussion, assume just two columns:

ID Data
1 000000000000
2 010000000000
3 011111111111
4 101100011101
5 110100011011
6 111100000000
7 111100000111
8 111100010111
9 111100011011
10 111111111111

The Data column contains only 1's and 0's.
The Data column is a text column, not numeric.
The Data column is actually 255 chars long. (I limited it above to 12
for this example only)
Duplicates on the Data column are allowed and do exist.
With 100,000 records, you would note that in the above example record
10 would actually be record 100,000.
My aim is to somehow sort the data (by creating a third column) so that
the records are in order of "string distance" or similar. In other
words, so that similar strings are located next to (or as close as
possible to) each other.
For example, taking the data above:

The "distance" between record 8 and record 9 is 2 (ie Only two
positions in the Data are not the same)
The "distance" between record 3 and record 4 is 6

However, the "distance" between record 3 and record 10 is only 1,
but sorted in the normal fashion, there would be some 99,990 records
between them.
I was hoping that as I add records to this table I could calculate a
number or a code or something to create a third column that could be
indexed. Accessing records using this index would give me the records
in order of "String Distance" or similar.

I have looked up functions such as "Levenshtein - Edit String
Distance", which is fine when I have to strings to compare. I can't
see it helping in generating an index though.

I hope I have been clear in my explanation.

I would really appreciate any comments or discussion that could help me
achieve this.

MJ

Nov 13 '05 #2
On 7 Sep 2005 19:01:55 -0700, "Mary" <ma************@hotmail.com>
wrote:

Interesting problem. I think your quest is not logical, but to be sure
I want to know:
* How would you sort your example data, and why that way?
* Is there more than one good answer, or is there only one best

My first reaction was to turn the data into GUIDs, but they are only
128 bits. Your 255 bit "value" is huge. What do these 1 and 0 values
represent?

-Tom.

Hi, I need some assistance with a query. To be honest, I'm not even
sure it can be done.

I'll try to keep the information limited to only what's relevant to
what I have and what I am trying to achieve with this query.

I have a table that contains around 100,000 records. For the sake of
this discussion, assume just two columns:

ID Data
1 000000000000
2 010000000000
3 011111111111
4 101100011101
5 110100011011
6 111100000000
7 111100000111
8 111100010111
9 111100011011
10 111111111111

The Data column contains only 1's and 0's.
The Data column is a text column, not numeric.
The Data column is actually 255 chars long. (I limited it above to 12
for this example only)
Duplicates on the Data column are allowed and do exist.
With 100,000 records, you would note that in the above example record
10 would actually be record 100,000.
My aim is to somehow sort the data (by creating a third column) so that
the records are in order of "string distance" or similar. In other
words, so that similar strings are located next to (or as close as
possible to) each other.
For example, taking the data above:

The "distance" between record 8 and record 9 is 2 (ie Only two
positions in the Data are not the same)
The "distance" between record 3 and record 4 is 6

However, the "distance" between record 3 and record 10 is only 1,
but sorted in the normal fashion, there would be some 99,990 records
between them.
I was hoping that as I add records to this table I could calculate a
number or a code or something to create a third column that could be
indexed. Accessing records using this index would give me the records
in order of "String Distance" or similar.

I have looked up functions such as "Levenshtein - Edit String
Distance", which is fine when I have to strings to compare. I can't
see it helping in generating an index though.

I hope I have been clear in my explanation.

I would really appreciate any comments or discussion that could help me
achieve this.

MJ

Nov 13 '05 #3
Hi Tom,

I put forward limited information in an attempt to not confuse anyone
as to what I want to do with the data.

The "Data" column in the table is not a number as such, but I guess it
could be treated as one. It's simply a text field that happens to have
1's and 0's in it...it could have been A's and B's for example.

The "Data" column holds response time information...let me explain. An
application asks 255 questions and records how long it takes the user
to answer each question. If the response time is below a certain time
limit, a "1" is recorded, otherwise a "0". The response time limit on
each question varies.
This is why a "1" or "0" is incidental and it could easily have been
"A" or "B".

I'm trying to find those "users" that answered all of the questions in
a similar time frame...hence why in the example data record 3 and
record 10 (or 100,000) are more similar than record 3 and record 4.

1. The data is not currently sorted in any order (other than the ID
which is actually a UserID)
2. I'm sure there is more than one answer to this...read on...

Currently, to find all of these "matches" or "similarities", I have to
take each record and compare it to every other record in the database.
As you could imagine, with 100,000 records this can take forever. All
I'm hoping to do is reduce this down somewhat.
Even if I have to compare each record with the next 50 or so in order
I'd be happy....just something to reduce the number of compares I have
to do...hence the "sort by similarity" topic.

Given that I could turn the "Data" into chunks and therefore useable
numbers, could your GUID comment be of use? Sorry, I'm not familiar
with GUIDs, so I'm not sure where you were going with that. I know how
to read the manual though, so I'm prepared to do research in this area
if you think it would help.

MJ

Nov 13 '05 #4
On 7 Sep 2005 20:18:35 -0700, "Mary" <ma************@hotmail.com>
wrote:

Hmmm, let's make sure we know what you are trying to accomplish. Which
one is it:
* Sort the data such that people who scored "1" a similar number of
times are grouped together. E.g. these 3 have an identical score and
would be grouped together:
111111000000
111110000001
101010101010

* Sort the data such that people who scored "1" a similar number of
times WITH A SIMILAR PATTERN are grouped together
e.g. these two:
111111000000
111110000001
would be more similar than
111111000000
101010101010
A sub-variant here takes proximity into account, such that:
111111000000
111111100000
would be more similar than
111111000000
111111000001

* Other?

A GUID is a Globally Unique IDentifier. It's a big number that you can
create yourself (by calling a Windows API) and as the name suggests
the returned value is pretty much guaranteed to be unique. You
typically see them represented like this:
{00000514-0000-0010-8000-00AA006D2EA4}
but that is just the human-readable format for a 128-bit value.
At this time I'm not sure GUIDs will be of any help.

-Tom.
Hi Tom,

I put forward limited information in an attempt to not confuse anyone
as to what I want to do with the data.

The "Data" column in the table is not a number as such, but I guess it
could be treated as one. It's simply a text field that happens to have
1's and 0's in it...it could have been A's and B's for example.

The "Data" column holds response time information...let me explain. An
application asks 255 questions and records how long it takes the user
to answer each question. If the response time is below a certain time
limit, a "1" is recorded, otherwise a "0". The response time limit on
each question varies.
This is why a "1" or "0" is incidental and it could easily have been
"A" or "B".

I'm trying to find those "users" that answered all of the questions in
a similar time frame...hence why in the example data record 3 and
record 10 (or 100,000) are more similar than record 3 and record 4.

1. The data is not currently sorted in any order (other than the ID
which is actually a UserID)
2. I'm sure there is more than one answer to this...read on...

Currently, to find all of these "matches" or "similarities", I have to
take each record and compare it to every other record in the database.
As you could imagine, with 100,000 records this can take forever. All
I'm hoping to do is reduce this down somewhat.
Even if I have to compare each record with the next 50 or so in order
I'd be happy....just something to reduce the number of compares I have
to do...hence the "sort by similarity" topic.

Given that I could turn the "Data" into chunks and therefore useable
numbers, could your GUID comment be of use? Sorry, I'm not familiar
with GUIDs, so I'm not sure where you were going with that. I know how
to read the manual though, so I'm prepared to do research in this area
if you think it would help.

MJ

Nov 13 '05 #5
Hi Tom,

Yes, you are correct to a certain point.

I would consider this a reasonable match:
111111000000
111110000001

ie the difference is 2. (remembering that in reality this Data column
is 255 chars long)

This would not be considered a match:
101010101010
010101010101

Although they have the same amount of 1's, every position is
different...therefore 100% different and not a match.
If it was a case of how many 1's, I'd simply have a third column that
was a total number of 1's for each person and sort by that.

What I'm trying to do is produce a report (yes it will be big) showing
each user, and for each of those users all the other users that matched
(say) 95% the same...so that's people where the "Data" is the same in
95% of the positions.

111111000000
111111100000
would be more similar than
111111000000
111111000001

I would consider these to have the same match value...they both differ
by only one position.

I have seen some posts where people what to compare two strings to find
a "distance"...eg:
"My Name is Mary" and
"My Game is Card"

The "distance" here is 3.

I hope you can understand what I'm trying to do, and why this has
become more of a drama than I first thought it would be.
My initial thoughts were of finding a way to generate a "Weighting
factor" or something similar and sort by that.

Like I said earlier, I'm not expecting all records to be right next to
each other, just (quite) a bit closer so that I don't have to compare
every record.

MJ

Nov 13 '05 #6
On 7 Sep 2005 21:28:11 -0700, "Mary" <ma************@hotmail.com>
wrote:

Hi Mary,
Thanks for clarifying. I think I understand what you want to do.
Coming back to my original post, I do think your question is not
logical. Perhaps that's too strong a statement (I'm not a native
speaker).
The reason I say this, is that the Difference calculation depends on
both values, not an arbitrary standard value such as:
000000000000
or
111111111111

You're saying: I'll give you an unknown value X, and I want to sort
the data based on the Difference to that value. I think it's obvious
the result cannot be pre-calculated because it depends on the value of
X. Theoretically you could do it if you had not just 1 sortcolumn, but
2^255 columns, one for each possible value of X.

I'm thinking of a radically different approach: the Ratcliff/Obershelp
Pattern Recognition Algorithm. I implemented this for situations where
we wanted to find CompanyNames that were very similar (expressed as a
similarity percentage) to a given CompanyName. The list has almost
10,000 elements, and executes in mere seconds (written in C and
compiled into a classic DLL). With 100,000 rows, you should be able to
find the similar values in less than 30 seconds. I've never used the
algorithm for situations like yours, but for CompanyName and
PersonName it works quite well.

Regards,

-Tom.

Hi Tom,

Yes, you are correct to a certain point.

I would consider this a reasonable match:
111111000000
111110000001

ie the difference is 2. (remembering that in reality this Data column
is 255 chars long)

This would not be considered a match:
101010101010
010101010101

Although they have the same amount of 1's, every position is
different...therefore 100% different and not a match.
If it was a case of how many 1's, I'd simply have a third column that
was a total number of 1's for each person and sort by that.

What I'm trying to do is produce a report (yes it will be big) showing
each user, and for each of those users all the other users that matched
(say) 95% the same...so that's people where the "Data" is the same in
95% of the positions.

111111000000
111111100000
would be more similar than
111111000000
111111000001

I would consider these to have the same match value...they both differ
by only one position.

I have seen some posts where people what to compare two strings to find
a "distance"...eg:
"My Name is Mary" and
"My Game is Card"

The "distance" here is 3.

I hope you can understand what I'm trying to do, and why this has
become more of a drama than I first thought it would be.
My initial thoughts were of finding a way to generate a "Weighting
factor" or something similar and sort by that.

Like I said earlier, I'm not expecting all records to be right next to
each other, just (quite) a bit closer so that I don't have to compare
every record.

MJ

Nov 13 '05 #7
Hi Tom,
You're saying: I'll give you an unknown value X, and I want to sort
the data based on the Difference to that value. I think it's obvious
the result cannot be pre-calculated because it depends on the value of
X. Theoretically you could do it if you had not just 1 sortcolumn, but
2^255 columns, one for each possible value of X.

Yes, you are correct. The main problem here is that I need to do
"binary" compares...this is why I was looking for some sort of
"weighting" algorithm that I could use as I added records to the
database. I have looked at "Hamming Distance" and "Levenshtein - Edit
String" algorithms. These work fine when you have the two pieces of
data to compare...but not when you already have 100,000 records.

It's been suggested that I use one of the two algorithms and compare
each record as I add it to the database with a "reference" data
field...maybe "0101010101...."
So I'll try that over the next day or so and see if that helps.

posted. It has helped having someone else offer suggestions as it has
enabled me to think outside of the immediate range of possible
solutions.

Thanks again,

MJ

Nov 13 '05 #8

### This discussion thread is closed

Replies have been disabled for this discussion.