By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,413 Members | 1,023 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,413 IT Pros & Developers. It's quick & easy.

Datastructure design

P: n/a
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance
Nov 13 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
On Tue, 18 Nov 2003 16:13:49 -0800, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.
Your question is off topic here, but I have to wonder a bit... You
have a billion records in a file? How interesting. You know there
are these things called databases. They're quite useful in fact.
I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data.


No kidding... Do a google on disk sort algorithms.

Nov 13 '05 #2

P: n/a
Mac
On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance


Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--

Nov 13 '05 #3

P: n/a
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

Thanks

"Mac" <fo*@bar.net> wrote in message news:<pa****************************@bar.net>...
On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance


Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--

Nov 13 '05 #4

P: n/a
Santosh wrote:

I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.


It's still off-topic, and C programmers are not ipso
facto experts on managing large data collections. Maybe
there will eventually be some C questions as you move ahead
with the implementation, but as yet no C question has been
asked.

<off-topic>

Start with Knuth "The Art of Computer Programming, Volume
III: Sorting and Searching," section 6.5 "Retrieval on
secondary keys." Follow bibliographical links as appropriate.

</off-topic>

--
Er*********@sun.com
Nov 13 '05 #5

P: n/a

"Santosh" <sa************@yahoo.com> wrote in message
news:a5**************************@posting.google.c om...
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

Thanks


Any sort implies a comparison of elements. Factor this
comparison out into a function (or several if different
comparisons are needed), and have your sort function call
the appropriate comparison function(s) (Do this by providing
a function pointer parameter for your sort function). This
is the way the standard library's 'qsort()' function works
(but 'qsort()' sorts an array, not a 'tree').

-Mike
Nov 13 '05 #6

P: n/a
sa************@yahoo.com (Santosh) wrote in message news:<a5**************************@posting.google. com>...
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.
You can achieve it by using SQL. Easy!

Thanks

"Mac" <fo*@bar.net> wrote in message news:<pa****************************@bar.net>...
On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance


Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--

Nov 13 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.