By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,963 Members | 920 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,963 IT Pros & Developers. It's quick & easy.

sorting/merging in C

P: n/a
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
Jun 16 '07 #1
Share this Question
Share on Google+
6 Replies


P: n/a

"Osiris" <et**@hotmail.comwrote in message
news:aq********************************@4ax.com...
>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
It isn't really a C question.
Chug through the A-records, and keep a list of keys of B records.
Sort the B record keys.
Chug through the B records and check if each record is on the key list.
If it is not, mark as "unreferenced".

If the list of keys fits in memory there is no problem.
If it is hundreds of millions rather than tens of millions then it might not
fit into a standard PC memory, and you've got to look at other strategies.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm
Jun 16 '07 #2

P: n/a
Osiris said:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

(Translation to C is left as an exercise.)

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Jun 16 '07 #3

P: n/a
On Sat, 16 Jun 2007 13:29:04 +0000, Richard Heathfield
<rj*@see.sig.invalidwrote:
>Osiris said:
>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?
Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall. Better is to sort both allB and do a compare of sorted
lists which is O(n).

There are two kinds of errors, type I and type II. Type I errors are
references to non-existent B records. Type II errors are B records without
errors. (OP does not discuss type I errors but we can get them for free.

The comparison loop has three possible terminations, (a) exhausting
"knownrefs" only, (b) exhausting "allB" only, or (c) exhausting both
simultaneously. In case (a) all remaining "allB" records are type II
errors; in case (b) all remaining "knownref" records are type I errors.
And, of course, case (c) is a successful termination with no error cleanup.
The inner comparison loop logic is something like this:

while (knownrefs.id < allB.id)
write knownrefs.idA to type I error file.
advance knownrefs
while (allB.id < knownrefs.id )
write allB.id to type II error file
advance allB
advance knownrefs
advance allB

An alternative would be to use a hash table for allB with a "seen" tag;
however despite the fact that hash tables are O(1) (modulo ignored
performance factors) using a hash table may be a poor choice due to cache
and page miss costs.
>
(Translation to C is left as an exercise.)
And a good one; comparison loops and merge loops are trickier to write than
is immediately apparent. Example: in the pseudo code above there is a
potential "run off the end" bug. How to avoid it is an exercise for the
programmer. :-)
Jun 16 '07 #4

P: n/a
Richard Harter said:
On Sat, 16 Jun 2007 13:29:04 +0000, Richard Heathfield
<rj*@see.sig.invalidwrote:
>>This involves only one sort step. Can anyone improve on this?

Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall.
Oh darn, I was writing Phase 4 as if both files were sorted. Yeah. Sort
that one too. Oops.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Jun 16 '07 #5

P: n/a
Osiris wrote:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
How does an A record reference a B record? How many B records may be
referenced by an A record? How?
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.
I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.
>
However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
Answer my questions and I'll try to answer yours.

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Jun 16 '07 #6

P: n/a
On Sat, 16 Jun 2007 18:22:56 -0400, Joe Wright
<jo********@comcast.netwrote:
>Osiris wrote:
>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,

How does an A record reference a B record? How many B records may be
referenced by an A record? How?
just by record ID nr. A may have a B-record ID in it.
Most of the B records are referenced by A records.
Say, maybe 5% are not....

>But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.
>>
sure, but the initial question is : how many. Later I can correct.
That will be the easy part.
The issue here is performance. The job may take 10 hours
>However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Answer my questions and I'll try to answer yours.
Jun 17 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.