checksum

Federico Balbi

Hi,
I was wondering if PGSQL has a function similar to binary_checksum() of
MS SQL Server 2000. It is pretty handy when it comes to compare rows of
data instead of having to write long boolean expressions.
binary_checksum() takes a list of fields and it returns an integer value
which sumarize the row content.

Thanks,
Fed
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #1

Subscribe Post Reply

8616

David Helgason

> I was wondering if PGSQL has a function similar to binary_checksum()

of
MS SQL Server 2000. It is pretty handy when it comes to compare rows of
data instead of having to write long boolean expressions.
binary_checksum() takes a list of fields and it returns an integer
value
which sumarize the row content.

On a similar note, I've found myself wanting an extended '=' operator
meaning
(a = b or (a is null and b is null))

same goal of course, for more general comparisons...

d.
--
David Helgason,
Business Development et al.,
Over the Edge I/S (http://otee.dk)
Direct line +45 2620 0663
Main line +45 3264 5049
On 26. sep 2004, at 19:58, Federico Balbi wrote:
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #2

Joshua D. Drake

David Helgason wrote:

I was wondering if PGSQL has a function similar to
binary_checksum() of
MS SQL Server 2000. It is pretty handy when it comes to compare rows of
data instead of having to write long boolean expressions.
binary_checksum() takes a list of fields and it returns an integer value
which sumarize the row content.

You could use the md5 function.... such as :

select md5(foo) from bar where baz = 2;

J

On a similar note, I've found myself wanting an extended '=' operator
meaning
(a = b or (a is null and b is null))

same goal of course, for more general comparisons...

d.

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #3

Federico Balbi

> David Helgason wrote:

I was wondering if PGSQL has a function similar to
binary_checksum() of
MS SQL Server 2000. It is pretty handy when it comes to compare rows of
data instead of having to write long boolean expressions.
binary_checksum() takes a list of fields and it returns an integer value
which sumarize the row content.

You could use the md5 function.... such as :

select md5(foo) from bar where baz = 2;

Looks like md5() takes only a string. I need to pass alist of fields
instead. I was looking at the documentattion and I think I can write
soemthing like:

field1, field2, ..., fieldn = expr1, expr2, ..., exprn

This way one operator will check all the fields for equality.

Fed
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #4

Joshua D. Drake

You could use the md5 function.... such as :

select md5(foo) from bar where baz = 2;
Looks like md5() takes only a string. I need to pass alist of fields
instead. I was looking at the documentattion and I think I can write
soemthing like:

field1, field2, ..., fieldn = expr1, expr2, ..., exprn

This way one operator will check all the fields for equality.

Maybe I am missing what you are saying, but you can md5() the data
column... So you could do:

select one,two from foo where md5(one) = 'e4da3b7fbbce2345d7772b0674a318d5';

or

select one,two from foo where md5(one) = $1; or whatever.

for example....

Sincerely,

Joshua D. Drake
Fed
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #5

Mike Rylander

On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <da***@uti.is> wrote:

I was wondering if PGSQL has a function similar to binary_checksum()
of
MS SQL Server 2000. It is pretty handy when it comes to compare rows of
data instead of having to write long boolean expressions.
binary_checksum() takes a list of fields and it returns an integer
value
which sumarize the row content.

As noted, you can use the md5(text) function with the || (concat) operator
On a similar note, I've found myself wanting an extended '=' operator
meaning
(a = b or (a is null and b is null))

Setting 'transform_null_equals' to true in postgresql.conf should do
what you want.

--miker
same goal of course, for more general comparisons...

d.
--
David Helgason,
Business Development et al.,
Over the Edge I/S (http://otee.dk)
Direct line +45 2620 0663
Main line +45 3264 5049
On 26. sep 2004, at 19:58, Federico Balbi wrote:

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #6

Stephan Szabo

On Mon, 27 Sep 2004, Mike Rylander wrote:

On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <da***@uti.is> wrote:

On a similar note, I've found myself wanting an extended '=' operator
meaning
(a = b or (a is null and b is null))

Setting 'transform_null_equals' to true in postgresql.conf should do
what you want.

Unfortunately, it probably won't. That only changes the explicit token
sequence = NULL into an IS NULL, it won't help if you're doing a=b where a
or b may be NULL.

The original does appear to be equivalent to "not(a is distinct from b)",
although I'm not sure that's necessarily easier to use than the above.

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #7

Greg Stark

Stephan Szabo <ss****@megazone.bigpanda.com> writes:

On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <da***@uti.is> wrote:

On a similar note, I've found myself wanting an extended '=' operator
meaning
(a = b or (a is null and b is null))

The original does appear to be equivalent to "not(a is distinct from b)",
although I'm not sure that's necessarily easier to use than the above.

I often do things like "coalesce(a,0) = coalesce(b,0)".
(Or whatever value you know won't appear)

Though for pretty small values of "often". It always makes me think twice
about my data model when I find myself doing this. But there are definitely
still cases where it's useful and as clean as anything else I could think of.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #8

Dean Gibson

Even simpler: COALESCE( a = b, a IS NULL AND b IS NULL )

-- Dean

Greg Stark wrote on 2004-09-27 08:17:

Stephan Szabo <ss****@megazone.bigpanda.com> writes:
On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <da***@uti.is> wrote:

> On a similar note, I've found myself wanting an extended '=' operator
> meaning
> (a = b or (a is null and b is null))

The original does appear to be equivalent to "not(a is distinct from b)",
although I'm not sure that's necessarily easier to use than the above.

I often do things like "coalesce(a,0) = coalesce(b,0)".
(Or whatever value you know won't appear)

Though for pretty small values of "often". It always makes me think twice
about my data model when I find myself doing this. But there are definitely
still cases where it's useful and as clean as anything else I could think of.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #9

David Helgason

On 27. sep 2004, at 22:08, Dean Gibson (DB Administrator) wrote:

Greg Stark wrote on 2004-09-27 08:17:
Stephan Szabo <ss****@megazone.bigpanda.com> writes:
>> On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <da***@uti.is>

wrote:
>>> On a similar note, I've found myself wanting an extended '='

operator
>>> meaning
>>> (a = b or (a is null and b is null))
>
> The original does appear to be equivalent to "not(a is distinct

from b)",
> although I'm not sure that's necessarily easier to use than the

above.

I often do things like "coalesce(a,0) = coalesce(b,0)".
(Or whatever value you know won't appear)

Even simpler: COALESCE( a = b, a IS NULL AND b IS NULL )

I'm not quite sure what is being accomplished here... My original
expression wasn't that bad, just clunky. I'd prefer a === b or (a
samevalue b), but the above just complicates matters. Also, a 'set'
command outside the expression goes completely against the idea, that
certain fields have 'null' as a legal, comparable value, while others
do not.

Anyway, idle speculation :)

d.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #10

Marco Colombo

On Mon, 27 Sep 2004, David Helgason wrote:

On 27. sep 2004, at 22:08, Dean Gibson (DB Administrator) wrote:
Greg Stark wrote on 2004-09-27 08:17:
Stephan Szabo <ss****@megazone.bigpanda.com> writes:
>> On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <da***@uti.is>
wrote:
>>> On a similar note, I've found myself wanting an extended '=' operator
>>> meaning
>>> (a = b or (a is null and b is null))
>
> The original does appear to be equivalent to "not(a is distinct from
b)",
> although I'm not sure that's necessarily easier to use than the above.

I often do things like "coalesce(a,0) = coalesce(b,0)".
(Or whatever value you know won't appear)

Even simpler: COALESCE( a = b, a IS NULL AND b IS NULL )

I'm not quite sure what is being accomplished here... My original expression
wasn't that bad, just clunky. I'd prefer a === b or (a samevalue b), but the
above just complicates matters. Also, a 'set' command outside the expression
goes completely against the idea, that certain fields have 'null' as a legal,
comparable value, while others do not.

Anyway, idle speculation :)

d.

(a = b or (a is null and b is null))

that raises a flag for me. It seems that NULL is used as a special value,
which is not. NULL just means 'unknown', nothing more, nothing less.
That's why any boolean expression involving a NULL is NULL.

'unknown' when compared to anything else just gives 'unknown'.
It means: "I can't tell whether the two expressions are the same, because
I don't know the value of one of them."
Note that that's different from "I know they're different".
They _could_ be equal, we just don't know.

'unknown' compared to 'unknown' gives of course 'unknown', since it's
just a special case of the above. Not knowing the value of both the
expressions doesn't help much. The answer can never be 'I know they
are the same.' Not knowing one is enough to say you don't know the
result of the comparison.

I can hardly imagine why you may want to select all rows that you
are certain have equal values, plus others that have potentially
different values because they are both unknown. It smells like a
design problem.
BTW,

coalesce(a,0) = coalesce(b,0)

is wrong, since it assumes 0 is a special value, never used in the
table. If so, it's better use it from the start instead of NULL for
those special rows. That espression is true for the following rows:
a | b
---+---
1 | 1
2 | 2
|
0 |
| 0

the last two rows are wrongly selected.
coalesce(a = b, a is null and b is null)

is correct, and maybe slightly better than the original

(a = b) or (a is null and b is null)

if the implementation is smart enough to evaluate its arguments only
when needed. The or operator needs to evaluate the right side when
the left side is either false or null, COALESCE only when it's null.
I think the docs mention that.

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #11

Greg Stark

Marco Colombo <pg***@esiway.net> writes:

(a = b or (a is null and b is null))

that raises a flag for me. It seems that NULL is used as a special value,
which is not.
Well, as I said, it raised a flag for me too. However, it's not good to be too
dogmatic about things. General rules are useful guiding principles but you
have to recognize when it's worth it to break them. We don't know enough about
his problem to say another approach would be any better.

For example, in one application I have a table that *does* have "unknown"
values. However I do need to look up records that match criteria including
having "unknown" values in specific positions. For most queries using NULL is
convenient and it's perfectly appropriate. But I have queries like this user
does and I use coalesce since I find the resulting expression much clearer
than using the three-way logical expression above.

Incidentally, coalesce(a,0)=coalesce(b,0) has the advantage over all the other
suggestions that you can build an index on coalesce(a,0) and/or coalesce(b,0)
and use them for the join or for individual record lookups.
BTW,

coalesce(a,0) = coalesce(b,0)

is wrong, since it assumes 0 is a special value, never used in the
table. If so, it's better use it from the start instead of NULL for
those special rows.
I specifically said you had to use a special value in my suggestion. Saying
something is "wrong" when it does what's needed just because it violates some
abstract design principle is just short-sighted.

Using 0 in the table might violate unique constraints or foreign key
constraints. I try to avoid having a single quirky table propagate its
quirkiness to the rest of the system.

For example, creating a bogus "0" record in some other table just to satisfy
the foreign key constraint then having the rest of the application have to
work around this bogus record results in a much less workable system than
simply using NULL instead of 0 for the special value.
coalesce(a = b, a is null and b is null)

is correct, and maybe slightly better than the original

(a = b) or (a is null and b is null)

if the implementation is smart enough to evaluate its arguments only
when needed. The or operator needs to evaluate the right side when
the left side is either false or null, COALESCE only when it's null.
I think the docs mention that.

Actually it's only the latter expression that will be able to avoid evaluating
the extra expression, not the coalesce example. In any case the time to
evaluate the "a is null and b is null" part is negligible. And the fact that
neither can use any indexes is the only relevant performance question. It's
possible that's not a concern, but if it is they both lose.

That's one possible argument in favour of a === operator. It would be easy (I
think?) to make === use a btree index without even having to build a
functional index like with coalesce(a,0).

--
greg
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #12

Marco Colombo

On Tue, 28 Sep 2004, Greg Stark wrote:

Marco Colombo <pg***@esiway.net> writes:
(a = b or (a is null and b is null))

that raises a flag for me. It seems that NULL is used as a special value,
which is not.
Well, as I said, it raised a flag for me too. However, it's not good to be too
dogmatic about things. General rules are useful guiding principles but you
have to recognize when it's worth it to break them. We don't know enough about
his problem to say another approach would be any better.

The way SQL treats NULLs in boolean expressions is not a "general rule".
It's just the way it is. I'm being pragmatic not dogmatic.
For example, in one application I have a table that *does* have "unknown"
values. However I do need to look up records that match criteria including
having "unknown" values in specific positions. For most queries using NULL is
convenient and it's perfectly appropriate. But I have queries like this user
does and I use coalesce since I find the resulting expression much clearer
than using the three-way logical expression above.
Using special values is a mistake, but I agree that's being dogmatic.

Using NULL as a special value is a way worse mistake. Three way logic
it's there, there's nothing you can do about that. NULL can't be a value.
That's why you can't use it in comparisons.

As for the "dogma" part, I bet you're using special values only to
model states, not values, and are using the same columns you're using
to model values. Just add another column, life will be better.
This has nothing to do with NULLs, BTW.
Incidentally, coalesce(a,0)=coalesce(b,0) has the advantage over all the other
suggestions that you can build an index on coalesce(a,0) and/or coalesce(b,0)
and use them for the join or for individual record lookups.
BTW,

coalesce(a,0) = coalesce(b,0)

is wrong, since it assumes 0 is a special value, never used in the
table. If so, it's better use it from the start instead of NULL for
those special rows.
I specifically said you had to use a special value in my suggestion. Saying
something is "wrong" when it does what's needed just because it violates some
abstract design principle is just short-sighted.

Using 0 in the table might violate unique constraints or foreign key
constraints. I try to avoid having a single quirky table propagate its
quirkiness to the rest of the system.

_Unique_ constraints? I don't get this. 0 has no special meaning.
1, -1, 1000000 might violate unique constraints as well. Any value might.
What's the point here? Same goes for foreign key constraints.
Actually same goes for _any_ constraint.

I agree that using a value you _know_ it's invalid due to some contraints
allows you to do the "coalesce trick" safely. But this assumes there is
at least _one_ invalid value. This is not true in general. In general,

coalesce(a,0) = coalesce(b,0)

is not the same of

(a = b) or (a is null and b is null).

I've even provided a fine example. It can't be a general equivalence.
For example, creating a bogus "0" record in some other table just to satisfy
the foreign key constraint then having the rest of the application have to
work around this bogus record results in a much less workable system than
simply using NULL instead of 0 for the special value.
I'm lost here. I've never proposed to add 0 anywhere. I just wonder:
"what if someone does?". In order to use 0 in coalesce(a,0) = coalesce(b,0)
you have to make sure 0 is invalid for both a and b. If you knew -1 is
invalid instead, you would use coalesce(a,-1) = coalesce(b,-1).
But what if there's no invalid value?

It might seem a "dogmatic" question, but my point is: why bother?
Just get the model right. Mapping NULLs to 0 or -1 or whatever is
meaningless, if the model is right.

That's again the whole point. What you're proposing sounds like this:
"I'd use 0 (or other special value) in the table, but that's not good
cause it may break some constraint. So I use NULLs in place of my
special value, and convert them at later time with coalesce(), so that
I can compare them again."

Now, that's abusing of NULLs. There's a reason why NULLs don't break
foreign key constraints, and a reason why you can't compare them.
Your use of coalesce(), your functional index, it's just placing a
brown paperbag on the real problem, which is a wrong model. Don't play
with NULLs, fix your schema. A badly designed model is not a matter
of "general principles". It's a plain real-world mistake. And leads
to any kind of acrobatic exercises in SQL to get the system work.
One day you'll run into another query you've got a hard time to write
or to make run efficently.

NULLs are not meant to be "values" and should never be used as such.
Anytime you feel the need of an index on them, or to compare them,
you're treating them as values, and that's plain wrong. Not because
of any dogma, but because that's how the system works. The _only_
way to use them as values is to abuse of them.

coalesce(a = b, a is null and b is null)

is correct, and maybe slightly better than the original

(a = b) or (a is null and b is null)

if the implementation is smart enough to evaluate its arguments only
when needed. The or operator needs to evaluate the right side when
the left side is either false or null, COALESCE only when it's null.
I think the docs mention that.
Actually it's only the latter expression that will be able to avoid evaluating
the extra expression, not the coalesce example. In any case the time to

^^^^^^^^^^^^^^^^^^^^^^^^
Why not?
evaluate the "a is null and b is null" part is negligible. And the fact that
neither can use any indexes is the only relevant performance question. It's
possible that's not a concern, but if it is they both lose.
We agree here. I wrote "maybe slightly better".
That's one possible argument in favour of a === operator. It would be easy (I
think?) to make === use a btree index without even having to build a
functional index like with coalesce(a,0).

A new, non standard operator just to support badly modeled schemas?
No, thanks.

In C, you can store a (small) string into a pointer type variable,
if you're careful enough, and know what you're doing. It might work
for you, but it's still abusing of the language: you can't
expect the language to _support_ that. And for sure, you can't sell
it as a general solution.

I'm not against abusing of the db, nor playing dirty tricks, if that fits
your needs. You're free to design your db the way you like and face
the cost of a careful design or of later SQL gymnastics. I'm fine,
as long as you don't ask for syntactic sugar to support those "features".

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #13

Greg Stark

Marco Colombo <pg***@esiway.net> writes:

I'm not against abusing of the db, nor playing dirty tricks, if that fits
your needs. You're free to design your db the way you like and face
the cost of a careful design or of later SQL gymnastics. I'm fine,
as long as you don't ask for syntactic sugar to support those "features".

My point is that you're making judgements about his schema without actually
knowing what you're talking about. For all we know his schema is entirely
reasonable and it's the query that has unusual requirements. Spouting general
design principles that may or may not apply as being iron-clad rules and
saying it's just wrong to break them is wilful blindness.

Database modelling is not something you can do by holding up some textbook and
screaming "third normal form" until the developers trying to get work done
cower in submission. There are lots of times when breaking or bending the
rules is entirely reasonable and blindly following them is simply a waste of
time.

For example, I have a table that uses NULLs to represent absent data. In 90%
of the queries three value logic is just exactly what's needed. In any case
they each have foreign key dependencies and having special values to represent
the absent values would be a major pain. It would require satisfying the
foreign keys with bogus records.

However I have queries that have to match provided data with other records,
including having missing data in the same position. For such a query I need to
break the usual model of three value logic and write something similar to what
this user needs. In my case no index would really be reasonable since there
are half a dozen such fields, but in general there's no reason an index
shouldn't be available for such cases.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #14

Similar topics