473,387 Members | 1,882 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

sorting/merging in C

I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
Jun 16 '07 #1
6 1462

"Osiris" <et**@hotmail.comwrote in message
news:aq********************************@4ax.com...
>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
It isn't really a C question.
Chug through the A-records, and keep a list of keys of B records.
Sort the B record keys.
Chug through the B records and check if each record is on the key list.
If it is not, mark as "unreferenced".

If the list of keys fits in memory there is no problem.
If it is hundreds of millions rather than tens of millions then it might not
fit into a standard PC memory, and you've got to look at other strategies.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm
Jun 16 '07 #2
Osiris said:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

(Translation to C is left as an exercise.)

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Jun 16 '07 #3
On Sat, 16 Jun 2007 13:29:04 +0000, Richard Heathfield
<rj*@see.sig.invalidwrote:
>Osiris said:
>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?
Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall. Better is to sort both allB and do a compare of sorted
lists which is O(n).

There are two kinds of errors, type I and type II. Type I errors are
references to non-existent B records. Type II errors are B records without
errors. (OP does not discuss type I errors but we can get them for free.

The comparison loop has three possible terminations, (a) exhausting
"knownrefs" only, (b) exhausting "allB" only, or (c) exhausting both
simultaneously. In case (a) all remaining "allB" records are type II
errors; in case (b) all remaining "knownref" records are type I errors.
And, of course, case (c) is a successful termination with no error cleanup.
The inner comparison loop logic is something like this:

while (knownrefs.id < allB.id)
write knownrefs.idA to type I error file.
advance knownrefs
while (allB.id < knownrefs.id )
write allB.id to type II error file
advance allB
advance knownrefs
advance allB

An alternative would be to use a hash table for allB with a "seen" tag;
however despite the fact that hash tables are O(1) (modulo ignored
performance factors) using a hash table may be a poor choice due to cache
and page miss costs.
>
(Translation to C is left as an exercise.)
And a good one; comparison loops and merge loops are trickier to write than
is immediately apparent. Example: in the pseudo code above there is a
potential "run off the end" bug. How to avoid it is an exercise for the
programmer. :-)
Jun 16 '07 #4
Richard Harter said:
On Sat, 16 Jun 2007 13:29:04 +0000, Richard Heathfield
<rj*@see.sig.invalidwrote:
>>This involves only one sort step. Can anyone improve on this?

Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall.
Oh darn, I was writing Phase 4 as if both files were sorted. Yeah. Sort
that one too. Oops.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Jun 16 '07 #5
Osiris wrote:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
How does an A record reference a B record? How many B records may be
referenced by an A record? How?
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.
I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.
>
However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
Answer my questions and I'll try to answer yours.

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Jun 16 '07 #6
On Sat, 16 Jun 2007 18:22:56 -0400, Joe Wright
<jo********@comcast.netwrote:
>Osiris wrote:
>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,

How does an A record reference a B record? How many B records may be
referenced by an A record? How?
just by record ID nr. A may have a B-record ID in it.
Most of the B records are referenced by A records.
Say, maybe 5% are not....

>But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.
>>
sure, but the initial question is : how many. Later I can correct.
That will be the easy part.
The issue here is performance. The job may take 10 hours
>However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Answer my questions and I'll try to answer yours.
Jun 17 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: William Ahern | last post by:
I'm looking for resources on splitting and merging XML trees. Specifically, on methods to pare large XML documents into smaller documents which can be merged later. Off of the top of my head, I...
2
by: Klatuu | last post by:
Whew, I've struggled my way through figuring out how to use XML to transport data..now I can imagine what having a baby is like :) But, I'm stuck now. I generate the XML (single table, no...
3
by: Patrick | last post by:
I have got 2 XML documents, both of which conform to the same XSD Schema, which define possible optional elements. The 2 XML documents contain 2 disjoint set of XML elements. What is the best,...
2
by: Emmett Power | last post by:
Hi, I have an Access table with a number of records which refer to the same person but with data in different fields. So for example the table would look like this: Name..............Field...
1
by: svdh | last post by:
I have posed a question last saturday and have advanced alot in the meantime. But I am still not there Problem is that I try to merging various fields from various tables in one document in Word...
2
by: Dave Taylor | last post by:
Is there a decent explanation of how menu merging with MDI forms work in VB.NET? I've read through the online help and it still seems that whenever I change menus around or whatever, it breaks...
1
by: zoro | last post by:
there is a property in the book saying: block sorting on P processors using butcher's sort with merging comparators can sort N records in a bout (Log p)2/ 2 parallell sort. and i have question...
12
by: bisuvious | last post by:
hi all, I am looking for a technique to sort data of a large file, say 2GB/ 4GB. Please suggest any algorithm, if you have any idea. thanks bisuvious
10
by: John | last post by:
I have two separate arrays A and B. The comparison function only depends on elements in A (keys are in A), but the swap() function needs to swap not only A,A, but also B,B ; whenever it swaps A,A....
2
by: anilbisen | last post by:
I am working on an application which requires merging of XML documents and then reorder/sort them based on elements in the XML. This needs to be done in C++ and requires very high performance. I...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.