faster way to get from SQL database into array

Cleverbum

I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

Dec 11 '06 #1

Subscribe Reply

1982

Erwin Moller

Cl*******@hotmail.com wrote:

I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

Hi,

Loading the whole table into PHP's memory is surely slow.
I think it is faster to make the field textmd5 UNIQUE on databaselevel, and
just insert and of course catch the possible error (UNIQUE CONSTAINT
VIOLATION etc etc): In case of collision it will complain. than handle
that.
In that way only your database has to scan the table, and will not tranfer
its contents to php.

Also, avoid * when making queries, it slows down too.
You only want textmd5, so just ask for that field. :-)

Regards,
Erwin Moller

Dec 11 '06 #2

Rik

Erwin Moller wrote:

Cl*******@hotmail.com wrote:

>I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found
that the fastest way to do this is to have all the md5's stored in
an array and use the php function in_array().
my only problem now is that populating this array with data from my
sql server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

Hi,

Loading the whole table into PHP's memory is surely slow.
I think it is faster to make the field textmd5 UNIQUE on
databaselevel, and just insert and of course catch the possible error
(UNIQUE CONSTAINT VIOLATION etc etc): In case of collision it will
complain. than handle that.
In that way only your database has to scan the table, and will not
tranfer its contents to php.

Indeed, or possibly check it first:
$searchstring = md5(//something);
$result = mysql_query("SELECT `textmd5` FROM `logs_full` WHERE `textmd5` =
'{$searchstring}'");
if(mysql_num_rows($result) < 1){
//not in table logic..
} else {
//is in table logic..
}

Also, avoid * when making queries, it slows down too.
You only want textmd5, so just ask for that field. :-)

Amen.
--
Rik Wasmus

Dec 12 '06 #3

Toby Inkster

Rik wrote:

Indeed, or possibly check it first:
$searchstring = md5(//something);
$result = mysql_query("SELECT `textmd5` FROM `logs_full` WHERE `textmd5` =
'{$searchstring}'");
if(mysql_num_rows($result) < 1){
//not in table logic..
} else {
//is in table logic..
}

As the OP said, he needs to check "hundreds of thousands" of md5 strings
-- using an SQL call for each one will slow him down.

My advice would be to keep doing roughly what you're already doing, but
speed up your array search.

To speed up your search, make sure your array is sorted in alphabetical
order. You can do this using your initial SQL query:

SELECT textmd5 FROM logs_full ORDER BY textmd5;

(Note: the query will speed up by specifying the exact column you need to
select; not '*'.)

You can then use a binary search function[1] instead of in_array() to
check that a value exists in your array.

____
1. Such as...
http://www.rci.rutgers.edu/~jfulton/...ary_search.inc

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Dec 12 '06 #4

Cleverbum

Toby Inkster wrote:

Rik wrote:

Indeed, or possibly check it first:
$searchstring = md5(//something);
$result = mysql_query("SELECT `textmd5` FROM `logs_full` WHERE `textmd5` =
'{$searchstring}'");
if(mysql_num_rows($result) < 1){
//not in table logic..
} else {
//is in table logic..
}

As the OP said, he needs to check "hundreds of thousands" of md5 strings
-- using an SQL call for each one will slow him down.

Hit the nail on the head.

>
My advice would be to keep doing roughly what you're already doing, but
speed up your array search.

To speed up your search, make sure your array is sorted in alphabetical
order. You can do this using your initial SQL query:

SELECT textmd5 FROM logs_full ORDER BY textmd5;

(Note: the query will speed up by specifying the exact column you need to
select; not '*'.)

You can then use a binary search function[1] instead of in_array() to
check that a value exists in your array.

____
1. Such as...
http://www.rci.rutgers.edu/~jfulton/...ary_search.inc

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Thanks for the binary search idea I've done some timing and the
in_array seems to be taking a lot longer than I originally thought it
would.

Dec 12 '06 #5

Erwin Moller

Cl*******@hotmail.com wrote:

>
Toby Inkster wrote:
>Rik wrote:

Indeed, or possibly check it first:
$searchstring = md5(//something);
$result = mysql_query("SELECT `textmd5` FROM `logs_full` WHERE
`textmd5` = '{$searchstring}'");
if(mysql_num_rows($result) < 1){
//not in table logic..
} else {
//is in table logic..
}

As the OP said, he needs to check "hundreds of thousands" of md5 strings
-- using an SQL call for each one will slow him down.

Hit the nail on the head.

Sorry to be rude, but I think you both missed the nail.
You have a certain value that you transform to a md5 and check if it is your
db allready, right?
PHP can do that md5.

After that you simply need to find that value in the table, like:
$mymd5= md5("whatever");
$SQL = "SELECT md5text FROM logs_full WHERE (md5text='".$mymd5."');"

You do not have to make hundreds of thousands of queries, or hundreds of
thousands md5-calculations.
You only feed this query once and sees if it returns any results.

Am I missing something completely here?

Regards,
Erwin Moller

>
>>
My advice would be to keep doing roughly what you're already doing, but
speed up your array search.

To speed up your search, make sure your array is sorted in alphabetical
order. You can do this using your initial SQL query:

SELECT textmd5 FROM logs_full ORDER BY textmd5;

(Note: the query will speed up by specifying the exact column you need to
select; not '*'.)

You can then use a binary search function[1] instead of in_array() to
check that a value exists in your array.

____
1. Such as...
http://www.rci.rutgers.edu/~jfulton/...ary_search.inc

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Thanks for the binary search idea I've done some timing and the
in_array seems to be taking a lot longer than I originally thought it
would.

Dec 12 '06 #6

Geoff Berrow

Message-ID: <45*********************@news.xs4all.nlfrom Erwin Moller
contained the following:

>You do not have to make hundreds of thousands of queries, or hundreds of
thousands md5-calculations.
You only feed this query once and sees if it returns any results.

Am I missing something completely here?

If you are, so am I.
--
Geoff Berrow (put thecat out to email)
It's only Usenet, no one dies.
My opinions, not the committee's, mine.
Simple RFDs http://www.ckdog.co.uk/rfdmaker/

Dec 12 '06 #7

Norman Peelman

<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...

I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

Maybe,

I think you are going about this project all wrong:

Assumption - your 'list' of md5's is actually the md5's in the database.
Assumption - you arecreating md5's from strings in text files.
Assumption - you need to check to make sure that the new value isn't already
in the database.

Problem - You are reading in the ENTIRE database.
Problem - you may be tempted to think in_array() would be faster, but it has
to start at the beginning of the array for each new value every time.
Problem - big waste of time and resources. You are doing double work.

Solution - Let MySQL do what it was designed to do. Since md5's are meant to
be unique in their own right, simply make your 'textmd5' field the PRIMARY
KEY and it will automatically be indexed. Now only do the operations you
actually require:

$query = "SELECT textmd5 FROM logs_full WHERE textmd5 = 'search_md5';
$resone = mysql_query($query, $dbc);
if (!mysql_num_rows($resone))
{
// no match found
// insert new info into database
}
else
{
//match found
// no need to insert
}

no arrays being used and I guarantee it will be WAY faster for any size of
database. Now for the real question. Any particular reason you are creating
an md5 database? It's already being done...

Norm
--
FREE Avatar hosting at www.easyavatar.com

Dec 12 '06 #8

Cl*******@hotmail.com wrote:

Toby Inkster wrote:
Rik wrote:

Indeed, or possibly check it first:
$searchstring = md5(//something);
$result = mysql_query("SELECT `textmd5` FROM `logs_full` WHERE `textmd5` =
'{$searchstring}'");
if(mysql_num_rows($result) < 1){
//not in table logic..
} else {
//is in table logic..
}
As the OP said, he needs to check "hundreds of thousands" of md5 strings
-- using an SQL call for each one will slow him down.

Hit the nail on the head.

PHP arrays are not particularly fast / efficient - particularly when
they get big. The fastest way to resolve two large lists like this is
to keep the "hundreds of thousands" of needles in a table as well as
the haystack and join the two in an SQL query (rDBMS are designed
specifically to do this kind of thing).

C.

Dec 12 '06 #9

Jerry Stuckle

Norman Peelman wrote:

<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...

>>I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

Maybe,

I think you are going about this project all wrong:

Assumption - your 'list' of md5's is actually the md5's in the database.
Assumption - you arecreating md5's from strings in text files.
Assumption - you need to check to make sure that the new value isn't already
in the database.

Problem - You are reading in the ENTIRE database.
Problem - you may be tempted to think in_array() would be faster, but it has
to start at the beginning of the array for each new value every time.
Problem - big waste of time and resources. You are doing double work.

Solution - Let MySQL do what it was designed to do. Since md5's are meant to
be unique in their own right, simply make your 'textmd5' field the PRIMARY
KEY and it will automatically be indexed. Now only do the operations you
actually require:

Actually, md5's are NOT unique. If they were, it would be the most
impressive compression mechanism known to computers. Imagine - any size
file compressed to 32 characters!

$query = "SELECT textmd5 FROM logs_full WHERE textmd5 = 'search_md5';
$resone = mysql_query($query, $dbc);
if (!mysql_num_rows($resone))
{
// no match found
// insert new info into database
}
else
{
//match found
// no need to insert
}

no arrays being used and I guarantee it will be WAY faster for any size of
database. Now for the real question. Any particular reason you are creating
an md5 database? It's already being done...

Norm

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Dec 12 '06 #10

Norman Peelman

"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:As******************************@comcast.com. ..

Norman Peelman wrote:
<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...

>I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

Maybe,

I think you are going about this project all wrong:

Assumption - your 'list' of md5's is actually the md5's in the database.
Assumption - you arecreating md5's from strings in text files.
Assumption - you need to check to make sure that the new value isn't

already

in the database.

Problem - You are reading in the ENTIRE database.
Problem - you may be tempted to think in_array() would be faster, but it

has

to start at the beginning of the array for each new value every time.
Problem - big waste of time and resources. You are doing double work.

Solution - Let MySQL do what it was designed to do. Since md5's are

meant to

be unique in their own right, simply make your 'textmd5' field the

PRIMARY

KEY and it will automatically be indexed. Now only do the operations you
actually require:

Actually, md5's are NOT unique. If they were, it would be the most
impressive compression mechanism known to computers. Imagine - any size
file compressed to 32 characters!

Well, I knew i'd get some flak for that. :) I was assuming they would be
unique enough for the situation.

$query = "SELECT textmd5 FROM logs_full WHERE textmd5 = 'search_md5';
$resone = mysql_query($query, $dbc);
if (!mysql_num_rows($resone))
{
// no match found
// insert new info into database
}
else
{
//match found
// no need to insert
}

no arrays being used and I guarantee it will be WAY faster for any size

database. Now for the real question. Any particular reason you are

creating

an md5 database? It's already being done...

Norm

Norm
--
FREE Avatar hosting at www.easyavatar.com

Dec 12 '06 #11

Toby Inkster

Erwin Moller wrote:

You have a certain value that you transform to a md5 and check if it is your
db allready, right?

No -- he has *hundred of thousands* of "certain values" that he needs to
transform to an MD5 and check to see if it's in his DB already.

What you are suggesting is (pseudo-code abound):

$needles = (hundreds of thousands of values);
foreach ($needles as $n)
{
$nMD5 = md5($n);
$r = sql_query("SELECT md5text FROM logsfull WHERE md5text='$nMD5';");
if (sql_fetch_array($r))
print "$nMD5 exists in database.\n";
else
print "$nMD5 does not exist in database.\n";
}

This will involve hundreds of thousands of SQL queries. Say the "logsfull"
table has zero rows (as it well might!), then that is hundreds of
thousands of useless calls to your RDBMS.

What I am suggesting is:

$needles = (hundreds of thousands of values);
$r = sql_query("SELECT md5text FROM logsfull ORDER BY md5text;");
while (list($straw) = sql_fetch_array($r))
$haystack[] = $straw;
foreach ($needles as $n)
{
$nMD5 = md5($n);
if (bsearch($haystack, $n))
print "$nMD5 does exist in database\n";
else
print "$nMD5 does not exist in database\n";
}

A single call to our poor beleagured RDBMS, and then an efficient binary
search for each needle. The bottleneck is likely to be the md5() function
here.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Dec 12 '06 #12

Toby Inkster

Norman Peelman wrote:

Well, I knew i'd get some flak for that. :) I was assuming they would be
unique enough for the situation.

There are varying degrees of uniqueness?

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Dec 12 '06 #13

Norman Peelman

<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...

I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

I can only shorten it:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_assoc($resone)) {
$md5array[]= $row['textmd5'];
}

How big (how many rows) is the table you are reading in? Are you storing new
md5's when they aren't found in the db?
Norm
--
FREE Avatar hosting at www.easyavatar.com

Dec 12 '06 #14

Norman Peelman

"Toby Inkster" <us**********@tobyinkster.co.ukwrote in message
news:l2************@ophelia.g5n.co.uk...

Norman Peelman wrote:

Well, I knew i'd get some flak for that. :) I was assuming they would

unique enough for the situation.

There are varying degrees of uniqueness?

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

The poster is converting strings not files to md5. md5 is not a method
of encryption per se but a more complicated method of CRC to 'help' validate
data... It is convienent to use md5 to 'one-way encrypt' small strings used
as passwords. I would assume that the poster is also storing the matching
plaintext along with the md5 so if a match was found he could then see if
the plaintext matched as well but, according to the definitions i've read on
the net; the odds are extremely high, paraphrasing:

Taken from
http://www.google.com/search?hl=en&l...ition&ct=title

The odds of any two strings having the same md5 are on the order of 2^64
operations.
The odds of finding a string having a given md5 is on the order of 2^128
operations.

....so yes, i'd say that there are variable degrees of uniqueness. So maybe
an easier way than my first post would be:

Using textmd5 as PRIMARY KEY

$db_link = mysql_connect(...,...,...); //etc.
$query = "INSERT INTO logs_full VALUES ($textmd5, $textplain)";
$resone = mysql_query($query, $dbc);
if (!mysql_errno($db_link))
{
// match found (duplicate entry for key error 1062 returned from MySQL)
// could then double check textplain here although odds of being here
are very high
// do other stuff here
}
else
{
// no match found
// new entry has just been inserted
}

Of course this assumes he is also storing new md5's to check against. I
gueuss we don't fully know what the poster completely intends on doing
unless he pops back in with a little more info.

Norm

Dec 12 '06 #15

Toby Inkster

Norman Peelman wrote:

The odds of any two strings having the same md5 are on the order of 2^64
operations.

The odds of any two *given* strings having the same MD5 are about 3.4E-38.

The odds that some two strings have the same MD5 are 1. On a modern
computer, you should be able to calculate a "collision" in minutes.
<http://www.links.org/?p=6>

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Dec 13 '06 #16

Jerry Stuckle

Norman Peelman wrote:

"Toby Inkster" <us**********@tobyinkster.co.ukwrote in message
news:l2************@ophelia.g5n.co.uk...

>>Norman Peelman wrote:

>>>Well, I knew i'd get some flak for that. :) I was assuming they would

be

>>>unique enough for the situation.

There are varying degrees of uniqueness?

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

The poster is converting strings not files to md5. md5 is not a method
of encryption per se but a more complicated method of CRC to 'help' validate
data... It is convienent to use md5 to 'one-way encrypt' small strings used
as passwords. I would assume that the poster is also storing the matching
plaintext along with the md5 so if a match was found he could then see if
the plaintext matched as well but, according to the definitions i've read on
the net; the odds are extremely high, paraphrasing:

Taken from
http://www.google.com/search?hl=en&l...ition&ct=title

The odds of any two strings having the same md5 are on the order of 2^64
operations.
The odds of finding a string having a given md5 is on the order of 2^128
operations.

...so yes, i'd say that there are variable degrees of uniqueness. So maybe
an easier way than my first post would be:

Using textmd5 as PRIMARY KEY

$db_link = mysql_connect(...,...,...); //etc.
$query = "INSERT INTO logs_full VALUES ($textmd5, $textplain)";
$resone = mysql_query($query, $dbc);
if (!mysql_errno($db_link))
{
// match found (duplicate entry for key error 1062 returned from MySQL)
// could then double check textplain here although odds of being here
are very high
// do other stuff here
}
else
{
// no match found
// new entry has just been inserted
}

Of course this assumes he is also storing new md5's to check against. I
gueuss we don't fully know what the poster completely intends on doing
unless he pops back in with a little more info.

Norm

And when you're comparing unknown strings, the odds go down rapidly.

The odds of anyone having the same birthday as you is 1 in 365.25.
However, in a group of 24 people, the odds of any two of them having the
same birthday are 1 in 2.

I don't have the time (or inkling) to determine how many md5 hashes you
would have to generate to get a reasonable chance of hitting a
duplicate. But it is much less than 2^64.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Dec 13 '06 #17

Norman Peelman

"Toby Inkster" <us**********@tobyinkster.co.ukwrote in message
news:f9************@ophelia.g5n.co.uk...

Norman Peelman wrote:

The odds of any two strings having the same md5 are on the order of 2^64
operations.

The odds of any two *given* strings having the same MD5 are about 3.4E-38.

The odds that some two strings have the same MD5 are 1. On a modern
computer, you should be able to calculate a "collision" in minutes.
<http://www.links.org/?p=6>

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Ok, not sure what's going on (read a bunch of posts concerning collisions)
but I cannot reproduce the output of the link above. Give the two separate
strings I get two different hashes:

simple script to show differences in input strings and the md5 hashes

<?php
$str1 =
'd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4 004583eb8fb7f8955ad340609f
4b30283e488832571415a085125e8f7cdc99fd91dbdf280373 c5bd8823e3156348f5bae6dacd
436c919c6dd53e2b487da03fd02396306d248cda0e99f33420 f577ee8ce54b67080a80d1ec69
821bcb6a8839396f9652b6ff72a70';
$str2 =
'd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4 004583eb8fb7f8955ad340609f
4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373 c5bd8823e3156348f5bae6dacd
436c919c6dd53e23487da03fd02396306d248cda0e99f33420 f577ee8ce54b67080280d1ec69
821bcb6a8839396f965ab6ff72a70';
$md51 = md5($str1);
$md52 = md5($str2);

$str_diffs = show_diff($str1, $str2);

echo '<pre>';
echo "$str_diffs ";
echo "$str1 ";
echo 'Length: '.strlen($str1).' ';
echo "md5: $md51 ";
echo 'md5 length: '.strlen($md51).' ';
echo "$str2 ";
echo 'Length: '.strlen($str2).' ';
echo "md5: $md52 ";
echo 'md5 length: '.strlen($md52).' ';
echo '</pre>';
echo ' ';

function show_diff($str1, $str2)
{
$str1_tmp = '';
$str2_tmp = '';
for ($loop = 0; $loop <= strlen($str1)-1; $loop++)
{
if ($str1{$loop} != $str2{$loop})
{
$font_start = '';
$font_end = '';
}
else
{
$font_start = '';
$font_end = '';
}
$str1_tmp .= $font_start.$str1{$loop}.$font_end;
$str2_tmp .= $font_start.$str2{$loop}.$font_end;
}
$final = $str1_tmp.' '.$str2_tmp.' ';
return $final;
}
?>

output:

d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab40 04583eb8fb7f8955ad340609f4
b30283e488832571415a085125e8f7cdc99fd91dbdf280373c 5bd8823e3156348f5bae6dacd4
36c919c6dd53e2b487da03fd02396306d248cda0e99f33420f 577ee8ce54b67080a80d1ec698
21bcb6a8839396f9652b6ff72a70

d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab40 04583eb8fb7f8955ad340609f4
b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c 5bd8823e3156348f5bae6dacd4
36c919c6dd53e23487da03fd02396306d248cda0e99f33420f 577ee8ce54b67080280d1ec698
21bcb6a8839396f965ab6ff72a70
d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab40 04583eb8fb7f8955ad340609f4
b30283e488832571415a085125e8f7cdc99fd91dbdf280373c 5bd8823e3156348f5bae6dacd4
36c919c6dd53e2b487da03fd02396306d248cda0e99f33420f 577ee8ce54b67080a80d1ec698
21bcb6a8839396f9652b6ff72a70

Length: 256

md5: edde4181249fea68547c2fd0edd2e22f

md5 length: 32

d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab40 04583eb8fb7f8955ad340609f4
b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c 5bd8823e3156348f5bae6dacd4
36c919c6dd53e23487da03fd02396306d248cda0e99f33420f 577ee8ce54b67080280d1ec698
21bcb6a8839396f965ab6ff72a70

Length: 256

md5: e234dbc6aa0932d9dd5facd53ba0372a

md5 length: 32

Is it possible this has been fixed (running PHP4.3.4) as I notice the post
is from 2005... Or am I not looking at something right?

Norm

Dec 13 '06 #18

Erwin Moller

Toby Inkster wrote:

Erwin Moller wrote:

>You have a certain value that you transform to a md5 and check if it is
your db allready, right?

No -- he has *hundred of thousands* of "certain values" that he needs to
transform to an MD5 and check to see if it's in his DB already.

What you are suggesting is (pseudo-code abound):

No, I am explicitely NOT suggesting that because that would be very stupid
and I am not. :P
I suggested he does 1 query with the precalculated md5-hash (You must have
misread my post: please reread my post, it clearly suggest 1 query).

But it is entirely possible I do not understand the problem.
That has to do with the fact I cannot imagine a setup that needs to do this,
and if it is needed, that design was bad and needs to be redone IMHO.

I understood that the database holds the md5-hashes, not the raw data, and
the OP said this in his original message:
"I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times."

What is unclear to me is WHY he has zillion values to transform to
md5-hashes.
Sounds like very bad design to me.

But since Cleverburn dropped out of this discussion, it will be hard to tell
what he actually is doing. We can only guess.

Regards,
Erwin Moller

Dec 13 '06 #19

Toby Inkster

Norman Peelman wrote:

Is it possible this has been fixed (running PHP4.3.4) as I notice the post
is from 2005... Or am I not looking at something right?

The strings given in the article need to be un-wordwrapped and un-hexed
before comparison...

<?php

# These are hexadecimal versions of the true strings that need to be
# tested for MD5 values.
$hex_A =
'd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4 004583eb8fb7f8955ad340609f
4b30283e488832571415a085125e8f7cdc99fd91dbdf280373 c5bd8823e3156348f5bae6dacd
436c919c6dd53e2b487da03fd02396306d248cda0e99f33420 f577ee8ce54b67080a80d1ec69
821bcb6a8839396f9652b6ff72a70';
$hex_B =
'd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4 004583eb8fb7f8955ad340609f
4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373 c5bd8823e3156348f5bae6dacd
436c919c6dd53e23487da03fd02396306d248cda0e99f33420 f577ee8ce54b67080280d1ec69
821bcb6a8839396f965ab6ff72a70';

# First, remove extra characters introduced by wrapping in article.
$hex_A = preg_replace('/[\r\n\s]/', '', $hex_A);
$hex_B = preg_replace('/[\r\n\s]/', '', $hex_B);

# Now, unhex them.
$str_A = pack('H*', $hex_A);
$str_B = pack('H*', $hex_B);

$sha1_A = sha1($str_A);
$sha1_B = sha1($str_B);

$crc32_A = crc32($str_A);
$crc32_B = crc32($str_B);

$md5_A = md5($str_A);
$md5_B = md5($str_B);

echo "<pre>\n";
echo "To prove the strings are different, look at their SHA-1 and CRC32 values.\n";
echo "\n";
echo "SHA-1 A: {$sha1_A}\n";
echo "SHA-1 B: {$sha1_B}\n";
echo "\n";
echo "CRC32 A: {$crc32_A}\n";
echo "CRC32 B: {$crc32_B}\n";
echo "\n";
echo "Now behold...\n";
echo "\n";
echo "MD5 A: {$md5_A}\n";
echo "MD5 B: {$md5_B}\n";
echo "</pre>\n";

?>
--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Dec 13 '06 #20

Norman Peelman

"Erwin Moller"
<si******************************************@spam yourself.comwrote in
message news:45*********************@news.xs4all.nl...

Toby Inkster wrote:

Erwin Moller wrote:

You have a certain value that you transform to a md5 and check if it is
your db allready, right?
No -- he has *hundred of thousands* of "certain values" that he needs to
transform to an MD5 and check to see if it's in his DB already.

What you are suggesting is (pseudo-code abound):

No, I am explicitely NOT suggesting that because that would be very stupid
and I am not. :P
I suggested he does 1 query with the precalculated md5-hash (You must have
misread my post: please reread my post, it clearly suggest 1 query).

But it is entirely possible I do not understand the problem.
That has to do with the fact I cannot imagine a setup that needs to do

this,

and if it is needed, that design was bad and needs to be redone IMHO.

I understood that the database holds the md5-hashes, not the raw data, and
the OP said this in his original message:
"I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times."

What is unclear to me is WHY he has zillion values to transform to
md5-hashes.
Sounds like very bad design to me.

But since Cleverburn dropped out of this discussion, it will be hard to

tell

what he actually is doing. We can only guess.

Regards,
Erwin Moller

Yeah, that's what I said...

Norm
--
FREE Avatar hosting at www.easyavatar.com

Dec 13 '06 #21

Norman Peelman

"Toby Inkster" <us**********@tobyinkster.co.ukwrote in message
news:cv************@ophelia.g5n.co.uk...

Norman Peelman wrote:

Is it possible this has been fixed (running PHP4.3.4) as I notice the

post

is from 2005... Or am I not looking at something right?

The strings given in the article need to be un-wordwrapped and un-hexed
before comparison...

>
--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Ok, I guess that headecimal conversion wasn't apparrant (to me). I
understand it now. Wondered what I was doing wrong. My question becomes
that of feasability... how bad is this for passwords as opposed to forging
documents?

Norm
--
FREE Avatar hosting at www.easyavatar.com

Dec 13 '06 #22

Cleverbum

Erwin Moller wrote:

Toby Inkster wrote:

Erwin Moller wrote:

You have a certain value that you transform to a md5 and check if it is
your db allready, right?
No -- he has *hundred of thousands* of "certain values" that he needs to
transform to an MD5 and check to see if it's in his DB already.

What you are suggesting is (pseudo-code abound):

No, I am explicitely NOT suggesting that because that would be very stupid
and I am not. :P
I suggested he does 1 query with the precalculated md5-hash (You must have
misread my post: please reread my post, it clearly suggest 1 query).

But it is entirely possible I do not understand the problem.
That has to do with the fact I cannot imagine a setup that needs to do this,
and if it is needed, that design was bad and needs to be redone IMHO.

I understood that the database holds the md5-hashes, not the raw data, and
the OP said this in his original message:
"I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times."

What is unclear to me is WHY he has zillion values to transform to
md5-hashes.
Sounds like very bad design to me.

But since Cleverburn dropped out of this discussion, it will be hard to tell
what he actually is doing. We can only guess.

Regards,
Erwin Moller

Sorry to have dropped out for a day, I've got a datafile which contains
some duplicate entries, to check that the entry is not a duplicate I
first check that the MD5 of that entry is not in a list of the MD5s of
lines which have already been processed.
Computing the md5 and then checking it against a database is very much
quicker than comparing the raw data strings to one another.

Dec 13 '06 #23

Cleverbum

Norman Peelman wrote:

<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...
I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

I can only shorten it:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_assoc($resone)) {
$md5array[]= $row['textmd5'];
}

How big (how many rows) is the table you are reading in? Are you storing new
md5's when they aren't found in the db?
Norm
--
FREE Avatar hosting at www.easyavatar.com

The table currently has about 390,000 rows, and no I don't send the new
MD5s to it as soon as they are found to be new, I just put them into
the array.
I've found that moving away from single insert queries speeds up the
script many times over and so now I insert the data into the SQL tables
when I have 1000 new unique values.

Dec 13 '06 #24

Cleverbum

Norman Peelman wrote:

<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...
I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

Maybe,

I think you are going about this project all wrong:

Assumption - your 'list' of md5's is actually the md5's in the database.
Assumption - you arecreating md5's from strings in text files.
Assumption - you need to check to make sure that the new value isn't already
in the database.

Problem - You are reading in the ENTIRE database.
Problem - you may be tempted to think in_array() would be faster, but it has
to start at the beginning of the array for each new value every time.
Problem - big waste of time and resources. You are doing double work.

Solution - Let MySQL do what it was designed to do. Since md5's are meant to
be unique in their own right, simply make your 'textmd5' field the PRIMARY
KEY and it will automatically be indexed. Now only do the operations you
actually require:

$query = "SELECT textmd5 FROM logs_full WHERE textmd5 = 'search_md5';
$resone = mysql_query($query, $dbc);

I don't know if it's because my SQL server and webserver are on
different machines, or because it's a feature of the language, but this
simply isn't as fast as the binary search which I am now using.

if (!mysql_num_rows($resone))
{
// no match found
// insert new info into database
}
else
{
//match found
// no need to insert
}

no arrays being used and I guarantee it will be WAY faster for any size of
database. Now for the real question. Any particular reason you are creating
an md5 database? It's already being done...

Norm
--
FREE Avatar hosting at www.easyavatar.com

Dec 13 '06 #25

Norman Peelman

<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...

>
Norman Peelman wrote:
<Cl*******@hotmail.comwrote in message
news:11**********************@f1g2000cwa.googlegro ups.com...
I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my

sql

server is rather slow, I currently use the lines:
>
$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();
>
while ($row = mysql_fetch_array($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}
>
to do this. does anyone have a faster method?
>
I can only shorten it:

$resone = mysql_query("SELECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_assoc($resone)) {
$md5array[]= $row['textmd5'];
}

How big (how many rows) is the table you are reading in? Are you storing

new

md5's when they aren't found in the db?
Norm
--
FREE Avatar hosting at www.easyavatar.com

The table currently has about 390,000 rows, and no I don't send the new
MD5s to it as soon as they are found to be new, I just put them into
the array.
I've found that moving away from single insert queries speeds up the
script many times over and so now I insert the data into the SQL tables
when I have 1000 new unique values.

Ok, so then the next set of questions would be:
1) How large do you expect this table to get and how soon? (already consumes
12.5megs just for the textmd5 field alone)
2) How are you INSERTING the new info at the 1000 mark?
....what about LOAD DATA INFILE with the IGNORE switch?
http://dev.mysql.com/doc/refman/4.1/en/load-data.html

....or using array_diff()?
http://us2.php.net/manual/en/function.array-diff.php

Norm

Dec 13 '06 #26

Toby Inkster

Norman Peelman wrote:

My question becomes that of feasability... how bad is this for
passwords as opposed to forging documents?

For low to medium security purposes, I'm guessing that MD5 will still be a
useful hash function for a few years yet -- I don't think it is yet the
time to rip all the md5() functions out of your existing programmes and
replace them with more modern hashes. But my advice would certainly be to
consider hashes such as SHA256 and Whirlpool for future projects.

For checksummy-type purposes, like verifying a file hasn't been
accidentally damaged during a download or compression/decompression, then
I'd recommend comparing not just hashes, but also file sizes.

For passwordy-type purposes, I'd recommend *not* storing the length of the
un-hashed password, as this gives an additional clue to what the password
is.

(Also, Google for "md5deep".)

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Dec 13 '06 #27

Toby Inkster

Erwin Moller wrote:

I suggested he does 1 query with the precalculated md5-hash.

I know you did -- but there are "hundreds of thousands" of precalculated
ms5-hashes (needles) -- it said so in the original post -- thus hundreds
of thousands of queries.

You must have misread my post: please reread my post, it clearly
suggest 1 query

.... per hash to be found!

Please reread my code: it clearly implements the technique you suggested.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Dec 14 '06 #28

faster way to get from SQL database into array

Similar topics