sorting/merging in C

Osiris

I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Jun 16 '07 #1

Subscribe Post Reply

1462

Malcolm McLean

"Osiris" <et**@hotmail.comwrote in message
news:aq********************************@4ax.com...

>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It isn't really a C question.
Chug through the A-records, and keep a list of keys of B records.
Sort the B record keys.
Chug through the B records and check if each record is on the key list.
If it is not, mark as "unreferenced".

If the list of keys fits in memory there is no problem.
If it is hundreds of millions rather than tens of millions then it might not
fit into a standard PC memory, and you've got to look at other strategies.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Jun 16 '07 #2

Richard Heathfield

Osiris said:

I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

(Translation to C is left as an exercise.)

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.

Jun 16 '07 #3

Richard Harter

On Sat, 16 Jun 2007 13:29:04 +0000, Richard Heathfield
<rj*@see.sig.invalidwrote:

>Osiris said:

>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall. Better is to sort both allB and do a compare of sorted
lists which is O(n).

There are two kinds of errors, type I and type II. Type I errors are
references to non-existent B records. Type II errors are B records without
errors. (OP does not discuss type I errors but we can get them for free.

The comparison loop has three possible terminations, (a) exhausting
"knownrefs" only, (b) exhausting "allB" only, or (c) exhausting both
simultaneously. In case (a) all remaining "allB" records are type II
errors; in case (b) all remaining "knownref" records are type I errors.
And, of course, case (c) is a successful termination with no error cleanup.
The inner comparison loop logic is something like this:

while (knownrefs.id < allB.id)
write knownrefs.idA to type I error file.
advance knownrefs
while (allB.id < knownrefs.id )
write allB.id to type II error file
advance allB
advance knownrefs
advance allB

An alternative would be to use a hash table for allB with a "seen" tag;
however despite the fact that hash tables are O(1) (modulo ignored
performance factors) using a hash table may be a poor choice due to cache
and page miss costs.

>
(Translation to C is left as an exercise.)

And a good one; comparison loops and merge loops are trickier to write than
is immediately apparent. Example: in the pseudo code above there is a
potential "run off the end" bug. How to avoid it is an exercise for the
programmer. :-)

Jun 16 '07 #4

Richard Heathfield

Richard Harter said:

On Sat, 16 Jun 2007 13:29:04 +0000, Richard Heathfield
<rj*@see.sig.invalidwrote:

>>This involves only one sort step. Can anyone improve on this?

Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall.

Oh darn, I was writing Phase 4 as if both files were sorted. Yeah. Sort
that one too. Oops.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.

Jun 16 '07 #5

Joe Wright

Osiris wrote:

I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,

How does an A record reference a B record? How many B records may be
referenced by an A record? How?

But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.

>
However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Answer my questions and I'll try to answer yours.

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Jun 16 '07 #6

Osiris

On Sat, 16 Jun 2007 18:22:56 -0400, Joe Wright
<jo********@comcast.netwrote:

>Osiris wrote:
>I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,

How does an A record reference a B record? How many B records may be
referenced by an A record? How?

just by record ID nr. A may have a B-record ID in it.
Most of the B records are referenced by A records.
Say, maybe 5% are not....

>But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.
>>

sure, but the initial question is : how many. Later I can correct.
That will be the easy part.
The issue here is performance. The job may take 10 hours

>However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Answer my questions and I'll try to answer yours.

Jun 17 '07 #7

Similar topics

Tree splitting/merging

by: William Ahern | last post by:

I'm looking for resources on splitting and merging XML trees. Specifically, on methods to pare large XML documents into smaller documents which can be merged later. Off of the top of my head, I...

.NET Framework

READING/MERGING XML DATA

by: Klatuu | last post by:

Whew, I've struggled my way through figuring out how to use XML to transport data..now I can imagine what having a baby is like :) But, I'm stuck now. I generate the XML (single table, no...

.NET Framework

Merging XML Documents (with ADO.NET?)

by: Patrick | last post by:

I have got 2 XML documents, both of which conform to the same XSD Schema, which define possible optional elements. The 2 XML documents contain 2 disjoint set of XML elements. What is the best,...

.NET Framework

Merging and deleting duplicate records

by: Emmett Power | last post by:

Hi, I have an Access table with a number of records which refer to the same person but with data in different fields. So for example the table would look like this: Name..............Field...

Microsoft Access / VBA

merging multiple fields from various tables in one word document

by: svdh | last post by:

I have posed a question last saturday and have advanced alot in the meantime. But I am still not there Problem is that I try to merging various fields from various tables in one document in Word...

Microsoft Access / VBA

Explanation of menu merging

by: Dave Taylor | last post by:

Is there a decent explanation of how menu merging with MDI forms work in VB.NET? I've read through the online help and it still seems that whenever I change menus around or whatever, it breaks...

Visual Basic .NET

need help in sorting

by: zoro | last post by:

there is a property in the book saying: block sorting on P processors using butcher's sort with merging comparators can sort N records in a bout (Log p)2/ 2 parallell sort. and i have question...

C / C++

sorting large file

by: bisuvious | last post by:

hi all, I am looking for a technique to sort data of a large file, say 2GB/ 4GB. Please suggest any algorithm, if you have any idea. thanks bisuvious

C / C++

Sorting two arrays with one call to sort()?

by: John | last post by:

I have two separate arrays A and B. The comparison function only depends on elements in A (keys are in A), but the swap() function needs to swap not only A,A, but also B,B ; whenever it swaps A,A....

C / C++

XML sorting

by: anilbisen | last post by:

I am working on an application which requires merging of XML documents and then reorder/sort them based on elements in the XML. This needs to be done in C++ and requires very high performance. I...

.NET Framework

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing