473,503 Members | 11,281 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

sorting large file

hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.
thanks

bisuvious

Mar 25 '07 #1
12 10711
bisuvious wrote:
hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.
Provide more details. How much memory do you have? Do you
need a stable sort or not? Is the file a text file or a
binary file?

If you have a lot of memory available, perhaps you could read
the entire file into memory and then use std::sort. If not,
it will get complicated.

HTH,
- J.
Mar 25 '07 #2
"bisuvious" writes:
I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.
Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.
Mar 25 '07 #3
On 25 Mar 2007 08:21:35 -0700, "bisuvious" wrote:
>I am looking for a technique to sort data of a large file, say 2GB/
4GB.
If you like old-fashioned 'modern' C++:
http://www.informatik.hs-bremen.de/%7Ebrey/stlbe.html
Chapter 10: External sorting

Good luck,
Roland Pibinger
Mar 25 '07 #4
On Mar 25, 7:21 pm, "bisuvious" <dhrubamaha...@gmail.comwrote:
hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

thanks

bisuvious
Well What size are your records stored in the file?

Mar 25 '07 #5
On Mar 25, 11:21 pm, "bisuvious" <dhrubamaha...@gmail.comwrote:
hi all,

I am looking for a technique to sort data of alargefile, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

thanks

bisuvious
Hi, here is more details of my problem...

File size will be of order of GB. Data to be sorted are of integer
type. Main memory is considered to be lower(256MB-512MB) than the file
size.

Theorytically We can use External memory merge sort procedure. But
phase II (merging) of this procedure seems to be problematic.

Please suggest in this regard.

thanks
bisu
Mar 26 '07 #6
On Mar 26, 10:16 am, "bisuvious" <dhrubamaha...@gmail.comwrote:
>
Theorytically We can use External memory merge sort procedure. But
phase II (merging) of this procedure seems to be problematic.
I agree.
If you are facing low disk space you can define some container type
whose elements stored on disk (file) locations rather than ram, then
use standard algorithms (e.g. std::sort) ,but this looks too slow
and I am worried about too many times of writing to disk.

osmium wrote:
>
Break the file into big "chunks"
This is good: If your using 32-bit ints then we are dealing with the
range [0,4G-1] of integers you can divide this range into a few (say
8) files each one catching numbers in its own range (say [0,512M-1] ,
[512M,1G-1] ... [3.5G,4G-1] respectively ) then delete/archive the
original file and merge the above files into one.I do not see too much
problem in merging any more.
Mar 26 '07 #7
On Mar 25, 6:04 pm, "osmium" <r124c4u...@comcast.netwrote:
"bisuvious" writes:
I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.
I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?

Mar 27 '07 #8
On Mar 26, 11:01 pm, "ManicQin" <Manic...@gmail.comwrote:
On Mar 25, 6:04 pm, "osmium" <r124c4u...@comcast.netwrote:
Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.

I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?
If the individual chunks are sorted internally, then you don't have to
load the entire chunk into memory -- just the next element out of that
chunk. Then you take the loaded element with the least value, append
it to output, and replace it with the next value from the chunk it
came from. If you reach the end of one chunk before the others, stop
considering values from it.

Mar 27 '07 #9
"ManicQin" writes:
>Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before
starting
the endeavor.
I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?
Let's say you have 8 chunks and you have reached the point where they have
all been sorted into 8 files. You open these 8 files for reading plus
another one, for writing, which will be a final, sorted, file. The OS lets
you see a little "window" in RAM into each file, not the entire file. One
of those 8 files has the smallest record of the original, unsorted, file.
Write it to the output file. Now find the next one. Repeat until done.

Note that when making the "chunks", you have to observe record boundaries,
the "seam" can't be just any old place.
Mar 27 '07 #10
In message <11*********************@n76g2000hsh.googlegroups. com>,
bisuvious <dh***********@gmail.comwrites
>hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.
Don't reinvent the wheel: read chapter 5 of Knuth's "Art of Computer
Programming" before you write a line of code.

--
Richard Herring
Apr 4 '07 #11
4N
did you try b-trees?
Apr 5 '07 #12
4N wrote:
did you try b-trees?
Not an optimal solution for "sorting" a file that does not fit into memory.

Probably the best solution for this is the "merge sort" or some variant.

http://en.wikipedia.org/wiki/Merge_sort

If, however, the file does fit in memory, then reading it fully into
memory and sorting it using std::sort or similar would be just about as
fast as you will get.
Apr 5 '07 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
4608
by: Paul | last post by:
Hi all I have a sorting problem, but my experience with Python is rather limited (3 days), so I am running this by the list first. I have a large database of 15GB, consisting of 10^8 entries...
22
4116
by: mike | last post by:
If I had a date in the format "01-Jan-05" it does not sort properly with my sort routine: function compareDate(a,b) { var date_a = new Date(a); var date_b = new Date(b); if (date_a < date_b)...
25
2190
by: Dan Stromberg | last post by:
Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort...
7
1763
by: Cerebrus99 | last post by:
Hi all, I am confused about how to sort an XML file. I mean how to *actually* sort the data in the physical file, not how to display sorted data. I am using a large XML file as a back-end...
0
2557
by: SvenMathijssen | last post by:
Hi, I've been wrestling with a problem for some time that ought to be fairly simple, but turns out to be very difficult for me to solve. Maybe someone here knows the answer. What I try to do is...
13
2782
by: JJ | last post by:
I have a need to input a large tab delimited text file, which I will parse to check it has the expected columns, before allowing the user to submit it to the database. The user may paste the file...
2
2684
by: Simon | last post by:
Hi, I have a table I am populating with data from a DB, and have it sortable on the client side. When the table exceeds around 300 rows, I start running into problems with rendering and sorting...
7
1905
by: Steve Bergman | last post by:
I'm involved in a discussion thread in which it has been stated that: """ Anything written in a language that is 20x slower (Perl, Python, PHP) than C/C++ should be instantly rejected by users...
5
4902
by: jrod11 | last post by:
hi, I found a jquery html table sorting code i have implemented. I am trying to figure out how to edit how many colums there are, but every time i remove code that I think controls how many colums...
0
7207
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7095
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7294
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7361
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7470
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
4693
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3183
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3173
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
403
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.