473,699 Members | 2,834 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Generating an Index

Hi, I need some assistance with a query. To be honest, I'm not even
sure it can be done.

I'll try to keep the information limited to only what's relevant to
what I have and what I am trying to achieve with this query.

I have a table that contains around 100,000 records. For the sake of
this discussion, assume just two columns:

ID Data
1 000000000000
2 010000000000
3 011111111111
4 101100011101
5 110100011011
6 111100000000
7 111100000111
8 111100010111
9 111100011011
10 111111111111

The Data column contains only 1's and 0's.
The Data column is a text column, not numeric.
The Data column is actually 255 chars long. (I limited it above to 12
for this example only)
Duplicates on the Data column are allowed and do exist.
With 100,000 records, you would note that in the above example record
10 would actually be record 100,000.
My aim is to somehow sort the data (by creating a third column) so that
the records are in order of "string distance" or similar. In other
words, so that similar strings are located next to (or as close as
possible to) each other.
For example, taking the data above:

The "distance" between record 8 and record 9 is 2 (ie Only two
positions in the Data are not the same)
The "distance" between record 3 and record 4 is 6

However, the "distance" between record 3 and record 10 is only 1,
but sorted in the normal fashion, there would be some 99,990 records
between them.
I was hoping that as I add records to this table I could calculate a
number or a code or something to create a third column that could be
indexed. Accessing records using this index would give me the records
in order of "String Distance" or similar.

I have looked up functions such as "Levenshtei n - Edit String
Distance", which is fine when I have to strings to compare. I can't
see it helping in generating an index though.

I hope I have been clear in my explanation.

I would really appreciate any comments or discussion that could help me
achieve this.

Thanks for your time,

MJ

Nov 13 '05 #1
7 1978
1. Convert A and B to a binary.
2. A Xor B, then add the number of ones in the result.

--
Darryl Kerkeslager

Power corrupts.
Absolute power corrupts absolutely.
Knowledge is power.
See www.adcritic.com/interactive/view.php?id=5927
"Mary" <ma************ @hotmail.com> wrote in message
news:11******** **************@ o13g2000cwo.goo glegroups.com.. .
Hi, I need some assistance with a query. To be honest, I'm not even
sure it can be done.

I'll try to keep the information limited to only what's relevant to
what I have and what I am trying to achieve with this query.

I have a table that contains around 100,000 records. For the sake of
this discussion, assume just two columns:

ID Data
1 000000000000
2 010000000000
3 011111111111
4 101100011101
5 110100011011
6 111100000000
7 111100000111
8 111100010111
9 111100011011
10 111111111111

The Data column contains only 1's and 0's.
The Data column is a text column, not numeric.
The Data column is actually 255 chars long. (I limited it above to 12
for this example only)
Duplicates on the Data column are allowed and do exist.
With 100,000 records, you would note that in the above example record
10 would actually be record 100,000.
My aim is to somehow sort the data (by creating a third column) so that
the records are in order of "string distance" or similar. In other
words, so that similar strings are located next to (or as close as
possible to) each other.
For example, taking the data above:

The "distance" between record 8 and record 9 is 2 (ie Only two
positions in the Data are not the same)
The "distance" between record 3 and record 4 is 6

However, the "distance" between record 3 and record 10 is only 1,
but sorted in the normal fashion, there would be some 99,990 records
between them.
I was hoping that as I add records to this table I could calculate a
number or a code or something to create a third column that could be
indexed. Accessing records using this index would give me the records
in order of "String Distance" or similar.

I have looked up functions such as "Levenshtei n - Edit String
Distance", which is fine when I have to strings to compare. I can't
see it helping in generating an index though.

I hope I have been clear in my explanation.

I would really appreciate any comments or discussion that could help me
achieve this.

Thanks for your time,

MJ

Nov 13 '05 #2
On 7 Sep 2005 19:01:55 -0700, "Mary" <ma************ @hotmail.com>
wrote:

Interesting problem. I think your quest is not logical, but to be sure
I want to know:
* How would you sort your example data, and why that way?
* Is there more than one good answer, or is there only one best
answer?

My first reaction was to turn the data into GUIDs, but they are only
128 bits. Your 255 bit "value" is huge. What do these 1 and 0 values
represent?

-Tom.

Hi, I need some assistance with a query. To be honest, I'm not even
sure it can be done.

I'll try to keep the information limited to only what's relevant to
what I have and what I am trying to achieve with this query.

I have a table that contains around 100,000 records. For the sake of
this discussion, assume just two columns:

ID Data
1 000000000000
2 010000000000
3 011111111111
4 101100011101
5 110100011011
6 111100000000
7 111100000111
8 111100010111
9 111100011011
10 111111111111

The Data column contains only 1's and 0's.
The Data column is a text column, not numeric.
The Data column is actually 255 chars long. (I limited it above to 12
for this example only)
Duplicates on the Data column are allowed and do exist.
With 100,000 records, you would note that in the above example record
10 would actually be record 100,000.
My aim is to somehow sort the data (by creating a third column) so that
the records are in order of "string distance" or similar. In other
words, so that similar strings are located next to (or as close as
possible to) each other.
For example, taking the data above:

The "distance" between record 8 and record 9 is 2 (ie Only two
positions in the Data are not the same)
The "distance" between record 3 and record 4 is 6

However, the "distance" between record 3 and record 10 is only 1,
but sorted in the normal fashion, there would be some 99,990 records
between them.
I was hoping that as I add records to this table I could calculate a
number or a code or something to create a third column that could be
indexed. Accessing records using this index would give me the records
in order of "String Distance" or similar.

I have looked up functions such as "Levenshtei n - Edit String
Distance", which is fine when I have to strings to compare. I can't
see it helping in generating an index though.

I hope I have been clear in my explanation.

I would really appreciate any comments or discussion that could help me
achieve this.

Thanks for your time,

MJ


Nov 13 '05 #3
Hi Tom,

I put forward limited information in an attempt to not confuse anyone
as to what I want to do with the data.

The "Data" column in the table is not a number as such, but I guess it
could be treated as one. It's simply a text field that happens to have
1's and 0's in it...it could have been A's and B's for example.

The "Data" column holds response time information...l et me explain. An
application asks 255 questions and records how long it takes the user
to answer each question. If the response time is below a certain time
limit, a "1" is recorded, otherwise a "0". The response time limit on
each question varies.
This is why a "1" or "0" is incidental and it could easily have been
"A" or "B".

I'm trying to find those "users" that answered all of the questions in
a similar time frame...hence why in the example data record 3 and
record 10 (or 100,000) are more similar than record 3 and record 4.

To answer your questions you posted:

1. The data is not currently sorted in any order (other than the ID
which is actually a UserID)
2. I'm sure there is more than one answer to this...read on...

Currently, to find all of these "matches" or "similariti es", I have to
take each record and compare it to every other record in the database.
As you could imagine, with 100,000 records this can take forever. All
I'm hoping to do is reduce this down somewhat.
Even if I have to compare each record with the next 50 or so in order
I'd be happy....just something to reduce the number of compares I have
to do...hence the "sort by similarity" topic.

Given that I could turn the "Data" into chunks and therefore useable
numbers, could your GUID comment be of use? Sorry, I'm not familiar
with GUIDs, so I'm not sure where you were going with that. I know how
to read the manual though, so I'm prepared to do research in this area
if you think it would help.

MJ

Nov 13 '05 #4
On 7 Sep 2005 20:18:35 -0700, "Mary" <ma************ @hotmail.com>
wrote:

Hmmm, let's make sure we know what you are trying to accomplish. Which
one is it:
* Sort the data such that people who scored "1" a similar number of
times are grouped together. E.g. these 3 have an identical score and
would be grouped together:
111111000000
111110000001
101010101010

* Sort the data such that people who scored "1" a similar number of
times WITH A SIMILAR PATTERN are grouped together
e.g. these two:
111111000000
111110000001
would be more similar than
111111000000
101010101010
A sub-variant here takes proximity into account, such that:
111111000000
111111100000
would be more similar than
111111000000
111111000001

* Other?

A GUID is a Globally Unique IDentifier. It's a big number that you can
create yourself (by calling a Windows API) and as the name suggests
the returned value is pretty much guaranteed to be unique. You
typically see them represented like this:
{00000514-0000-0010-8000-00AA006D2EA4}
but that is just the human-readable format for a 128-bit value.
At this time I'm not sure GUIDs will be of any help.

-Tom.
Hi Tom,

I put forward limited information in an attempt to not confuse anyone
as to what I want to do with the data.

The "Data" column in the table is not a number as such, but I guess it
could be treated as one. It's simply a text field that happens to have
1's and 0's in it...it could have been A's and B's for example.

The "Data" column holds response time information...l et me explain. An
application asks 255 questions and records how long it takes the user
to answer each question. If the response time is below a certain time
limit, a "1" is recorded, otherwise a "0". The response time limit on
each question varies.
This is why a "1" or "0" is incidental and it could easily have been
"A" or "B".

I'm trying to find those "users" that answered all of the questions in
a similar time frame...hence why in the example data record 3 and
record 10 (or 100,000) are more similar than record 3 and record 4.

To answer your questions you posted:

1. The data is not currently sorted in any order (other than the ID
which is actually a UserID)
2. I'm sure there is more than one answer to this...read on...

Currently, to find all of these "matches" or "similariti es", I have to
take each record and compare it to every other record in the database.
As you could imagine, with 100,000 records this can take forever. All
I'm hoping to do is reduce this down somewhat.
Even if I have to compare each record with the next 50 or so in order
I'd be happy....just something to reduce the number of compares I have
to do...hence the "sort by similarity" topic.

Given that I could turn the "Data" into chunks and therefore useable
numbers, could your GUID comment be of use? Sorry, I'm not familiar
with GUIDs, so I'm not sure where you were going with that. I know how
to read the manual though, so I'm prepared to do research in this area
if you think it would help.

MJ


Nov 13 '05 #5
Hi Tom,

Yes, you are correct to a certain point.

I would consider this a reasonable match:
111111000000
111110000001

ie the difference is 2. (remembering that in reality this Data column
is 255 chars long)

This would not be considered a match:
101010101010
010101010101

Although they have the same amount of 1's, every position is
different...the refore 100% different and not a match.
If it was a case of how many 1's, I'd simply have a third column that
was a total number of 1's for each person and sort by that.

What I'm trying to do is produce a report (yes it will be big) showing
each user, and for each of those users all the other users that matched
(say) 95% the same...so that's people where the "Data" is the same in
95% of the positions.

Taking your example from your last post:
111111000000
111111100000
would be more similar than
111111000000
111111000001


I would consider these to have the same match value...they both differ
by only one position.

I have seen some posts where people what to compare two strings to find
a "distance"...eg :
"My Name is Mary" and
"My Game is Card"

The "distance" here is 3.

I hope you can understand what I'm trying to do, and why this has
become more of a drama than I first thought it would be.
My initial thoughts were of finding a way to generate a "Weighting
factor" or something similar and sort by that.

Like I said earlier, I'm not expecting all records to be right next to
each other, just (quite) a bit closer so that I don't have to compare
every record.

MJ

Nov 13 '05 #6
On 7 Sep 2005 21:28:11 -0700, "Mary" <ma************ @hotmail.com>
wrote:

Hi Mary,
Thanks for clarifying. I think I understand what you want to do.
Coming back to my original post, I do think your question is not
logical. Perhaps that's too strong a statement (I'm not a native
speaker).
The reason I say this, is that the Difference calculation depends on
both values, not an arbitrary standard value such as:
000000000000
or
111111111111

You're saying: I'll give you an unknown value X, and I want to sort
the data based on the Difference to that value. I think it's obvious
the result cannot be pre-calculated because it depends on the value of
X. Theoretically you could do it if you had not just 1 sortcolumn, but
2^255 columns, one for each possible value of X.

I'm thinking of a radically different approach: the Ratcliff/Obershelp
Pattern Recognition Algorithm. I implemented this for situations where
we wanted to find CompanyNames that were very similar (expressed as a
similarity percentage) to a given CompanyName. The list has almost
10,000 elements, and executes in mere seconds (written in C and
compiled into a classic DLL). With 100,000 rows, you should be able to
find the similar values in less than 30 seconds. I've never used the
algorithm for situations like yours, but for CompanyName and
PersonName it works quite well.

Regards,

-Tom.

Hi Tom,

Yes, you are correct to a certain point.

I would consider this a reasonable match:
111111000000
111110000001

ie the difference is 2. (remembering that in reality this Data column
is 255 chars long)

This would not be considered a match:
101010101010
010101010101

Although they have the same amount of 1's, every position is
different...th erefore 100% different and not a match.
If it was a case of how many 1's, I'd simply have a third column that
was a total number of 1's for each person and sort by that.

What I'm trying to do is produce a report (yes it will be big) showing
each user, and for each of those users all the other users that matched
(say) 95% the same...so that's people where the "Data" is the same in
95% of the positions.

Taking your example from your last post:
111111000000
111111100000
would be more similar than
111111000000
111111000001


I would consider these to have the same match value...they both differ
by only one position.

I have seen some posts where people what to compare two strings to find
a "distance"...eg :
"My Name is Mary" and
"My Game is Card"

The "distance" here is 3.

I hope you can understand what I'm trying to do, and why this has
become more of a drama than I first thought it would be.
My initial thoughts were of finding a way to generate a "Weighting
factor" or something similar and sort by that.

Like I said earlier, I'm not expecting all records to be right next to
each other, just (quite) a bit closer so that I don't have to compare
every record.

MJ


Nov 13 '05 #7
Hi Tom,
You're saying: I'll give you an unknown value X, and I want to sort
the data based on the Difference to that value. I think it's obvious
the result cannot be pre-calculated because it depends on the value of
X. Theoretically you could do it if you had not just 1 sortcolumn, but
2^255 columns, one for each possible value of X.


Yes, you are correct. The main problem here is that I need to do
"binary" compares...this is why I was looking for some sort of
"weighting" algorithm that I could use as I added records to the
database. I have looked at "Hamming Distance" and "Levenshtei n - Edit
String" algorithms. These work fine when you have the two pieces of
data to compare...but not when you already have 100,000 records.

It's been suggested that I use one of the two algorithms and compare
each record as I add it to the database with a "reference" data
field...maybe "0101010101.... "
So I'll try that over the next day or so and see if that helps.

Tom, thanks for persisting with this thread and your comments you have
posted. It has helped having someone else offer suggestions as it has
enabled me to think outside of the immediate range of possible
solutions.

Thanks again,

MJ

Nov 13 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
7280
by: eric.gagnon | last post by:
In a program randomly generating 10 000 000 alphanumeric codes of 16 characters in length (Ex.: "ZAZAZAZAZAZAZ156"), what would be an efficient way to ensure that I do not generate duplicates? STL set, map? Could you give me a little code example? Thank you.
10
1959
by: Al Christoph | last post by:
Please forgive me if this is the wrong place to post this. The last place I posted got me a fairly rude response. I guess vb.db people just don't want to think about XML as database. At any rate, here is what I posted: I have downloaded the final production version of VS 2005 pro after testing with Team version. I wonder if I've been screwed or if it's just too early on a Sunday morning. It seems to me that it was relatively trivial to...
16
12071
by: Leon | last post by:
I need a program that generate 5 non-duplicates random number between 1-10 as string values store in an array. Do anybody know of any good books or websites that explain how to generator random numbers using asp.net? I know about the random namespace within .net, but I need a reference to some code that do the similar stated function above. Plus If you have any coding practice ideas for the above defined project please share them.
4
1600
by: Lee Chapman | last post by:
Hi, I am having difficulty getting the ASP.NET framework to generate valid XHTML. My immediate problem surrounds user input in, for example, textbox controls. I consider characters such as less-than and ampersand perfectly valid in user input. So I've disabled request validation by adding the following to my web.config file.
13
1554
by: vasudevmukherjee | last post by:
Hi! Can somebody help tell me why the following code gives a garbage value while producing first student's name, whereas it gives the names correctly for other three students - I really fail to understand since it is not generating the required results.- Thank you in anticipation - Vasudev. #include <stdio.h> main () { char student, *x;
1
1559
by: Ryan Ginstrom | last post by:
I have been maintaining a body of documentation in plain HTML files. I would now like to automate the generation of the HTML files. Maintaining the HTML files now is tedious and error prone, because every time I move/remove/add a section, I have to renumber the following sections and update my internal links. I have looked at some of the HTML-generation frameworks, but the selection is somewhat bewildering, and most also seem geared to...
18
2104
by: baltimoredude1 | last post by:
Hi I was writing a simple code to generate the first 100 prime numbers. Everything looks fine to me except the output of the program. What's wrong with it? I am attaching the program as well as the output. Would appreciate if someone could mail me at baltimoredude1@gmail.com Thanks A M Rahman
0
923
by: manuaz | last post by:
Hi I am having problems generating C# class from .xsd files. I have multiple .xsd pointing to each other in different sub directories. In a different situation I was able to generate a class when all the .xsd files are in one directory but in the above specified case I am getting URI formats are not supported error. I am new to this field and trying to understand as I go. The schema is defined by third party entity which I have no control on....
1
2203
nine72
by: nine72 | last post by:
Ok, I am at a complete loss on this and have finally come to the XML Parsing Gods (and perhaps a PHP minor deity) for guidance… I will try my best to describe what I have going on… 1) I have 15 form pages, well over 500 potential fields, which are written in PHP. While most pages are one time entry forms, there are 5 that can be “recycled” as many times as needed. An example would be the Contacts Form. A user can give me 1 contact and move...
0
8612
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9171
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8880
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7743
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
4373
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4625
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3053
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2342
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2008
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.