473,378 Members | 1,439 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Datastructure design

Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance
Nov 13 '05 #1
6 3636
On Tue, 18 Nov 2003 16:13:49 -0800, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.
Your question is off topic here, but I have to wonder a bit... You
have a billion records in a file? How interesting. You know there
are these things called databases. They're quite useful in fact.
I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data.


No kidding... Do a google on disk sort algorithms.

Nov 13 '05 #2
Mac
On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance


Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--

Nov 13 '05 #3
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

Thanks

"Mac" <fo*@bar.net> wrote in message news:<pa****************************@bar.net>...
On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance


Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--

Nov 13 '05 #4
Santosh wrote:

I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.


It's still off-topic, and C programmers are not ipso
facto experts on managing large data collections. Maybe
there will eventually be some C questions as you move ahead
with the implementation, but as yet no C question has been
asked.

<off-topic>

Start with Knuth "The Art of Computer Programming, Volume
III: Sorting and Searching," section 6.5 "Retrieval on
secondary keys." Follow bibliographical links as appropriate.

</off-topic>

--
Er*********@sun.com
Nov 13 '05 #5

"Santosh" <sa************@yahoo.com> wrote in message
news:a5**************************@posting.google.c om...
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

Thanks


Any sort implies a comparison of elements. Factor this
comparison out into a function (or several if different
comparisons are needed), and have your sort function call
the appropriate comparison function(s) (Do this by providing
a function pointer parameter for your sort function). This
is the way the standard library's 'qsort()' function works
(but 'qsort()' sorts an array, not a 'tree').

-Mike
Nov 13 '05 #6
sa************@yahoo.com (Santosh) wrote in message news:<a5**************************@posting.google. com>...
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.
You can achieve it by using SQL. Easy!

Thanks

"Mac" <fo*@bar.net> wrote in message news:<pa****************************@bar.net>...
On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance


Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--

Nov 13 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Prateek Basu | last post by:
Hi, Does there exist any resource (link or book) where one can find the solution for some datastructure and algorithm related problems. Pls provide the pointer. Thanks in advance.. Prateek
3
by: raju | last post by:
i am having a single linked list say 5 node are present, and one pointer that is pointing to 3rd node using this pointer only i have to come to the 1st node..... or how can i get back to the 1st...
7
by: Curious | last post by:
Hi, I have created my own data structure. Basically it is a two way 'stack', where I can push elements and pop elements both from top and bottom. I also included a counter to show the number...
15
by: A. Farber | last post by:
Hello, I'm programming a web game on OpenBSD, but am also trying to keep in runnable on Linux and Cygwin. I have a list of tables at which a player/kibitzer can sit down or create a new empty...
0
by: lumirb | last post by:
Hello, I would like to disccuss with the following issue: We want to create database in which we can dynamically add attributes. The problem is, we do not know the datastructure before. The...
6
by: jason | last post by:
Hello, I have a question about what kind of datastructure to use. I'm reading collumn based data in the form of: 10\t12\t9\t11\n 24\t11\t4\t10\n ..... I now have a structure which allows me...
21
by: Pieter | last post by:
Hi, I need some type of array/list/... In which I can store objects together with a unique key. The most important thing is performance: I will need to do the whole time searches in the list of...
2
Digital Don
by: Digital Don | last post by:
Hi, Iwas trying to look for a STL datastructure in C++ where I can store structure variables and then get out value from it by just comparing the value as follows; The structure variable:...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.