473,569 Members | 2,526 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

sorting large file

hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.
thanks

bisuvious

Mar 25 '07 #1
12 10724
bisuvious wrote:
hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.
Provide more details. How much memory do you have? Do you
need a stable sort or not? Is the file a text file or a
binary file?

If you have a lot of memory available, perhaps you could read
the entire file into memory and then use std::sort. If not,
it will get complicated.

HTH,
- J.
Mar 25 '07 #2
"bisuvious" writes:
I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.
Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.
Mar 25 '07 #3
On 25 Mar 2007 08:21:35 -0700, "bisuvious" wrote:
>I am looking for a technique to sort data of a large file, say 2GB/
4GB.
If you like old-fashioned 'modern' C++:
http://www.informatik.hs-bremen.de/%7Ebrey/stlbe.html
Chapter 10: External sorting

Good luck,
Roland Pibinger
Mar 25 '07 #4
On Mar 25, 7:21 pm, "bisuvious" <dhrubamaha...@ gmail.comwrote:
hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

thanks

bisuvious
Well What size are your records stored in the file?

Mar 25 '07 #5
On Mar 25, 11:21 pm, "bisuvious" <dhrubamaha...@ gmail.comwrote:
hi all,

I am looking for a technique to sort data of alargefile, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

thanks

bisuvious
Hi, here is more details of my problem...

File size will be of order of GB. Data to be sorted are of integer
type. Main memory is considered to be lower(256MB-512MB) than the file
size.

Theorytically We can use External memory merge sort procedure. But
phase II (merging) of this procedure seems to be problematic.

Please suggest in this regard.

thanks
bisu
Mar 26 '07 #6
On Mar 26, 10:16 am, "bisuvious" <dhrubamaha...@ gmail.comwrote:
>
Theorytically We can use External memory merge sort procedure. But
phase II (merging) of this procedure seems to be problematic.
I agree.
If you are facing low disk space you can define some container type
whose elements stored on disk (file) locations rather than ram, then
use standard algorithms (e.g. std::sort) ,but this looks too slow
and I am worried about too many times of writing to disk.

osmium wrote:
>
Break the file into big "chunks"
This is good: If your using 32-bit ints then we are dealing with the
range [0,4G-1] of integers you can divide this range into a few (say
8) files each one catching numbers in its own range (say [0,512M-1] ,
[512M,1G-1] ... [3.5G,4G-1] respectively ) then delete/archive the
original file and merge the above files into one.I do not see too much
problem in merging any more.
Mar 26 '07 #7
On Mar 25, 6:04 pm, "osmium" <r124c4u...@com cast.netwrote:
"bisuvious" writes:
I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.
I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?

Mar 27 '07 #8
On Mar 26, 11:01 pm, "ManicQin" <Manic...@gmail .comwrote:
On Mar 25, 6:04 pm, "osmium" <r124c4u...@com cast.netwrote:
Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.

I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?
If the individual chunks are sorted internally, then you don't have to
load the entire chunk into memory -- just the next element out of that
chunk. Then you take the loaded element with the least value, append
it to output, and replace it with the next value from the chunk it
came from. If you reach the end of one chunk before the others, stop
considering values from it.

Mar 27 '07 #9
"ManicQin" writes:
>Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before
starting
the endeavor.
I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?
Let's say you have 8 chunks and you have reached the point where they have
all been sorted into 8 files. You open these 8 files for reading plus
another one, for writing, which will be a final, sorted, file. The OS lets
you see a little "window" in RAM into each file, not the entire file. One
of those 8 files has the smallest record of the original, unsorted, file.
Write it to the output file. Now find the next one. Repeat until done.

Note that when making the "chunks", you have to observe record boundaries,
the "seam" can't be just any old place.
Mar 27 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
4622
by: Paul | last post by:
Hi all I have a sorting problem, but my experience with Python is rather limited (3 days), so I am running this by the list first. I have a large database of 15GB, consisting of 10^8 entries of approximately 100 bytes each. I devised a relatively simple key map on my database, and I would like to order the database with respect to the...
22
4128
by: mike | last post by:
If I had a date in the format "01-Jan-05" it does not sort properly with my sort routine: function compareDate(a,b) { var date_a = new Date(a); var date_b = new Date(b); if (date_a < date_b) { return -1; } else
25
2198
by: Dan Stromberg | last post by:
Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Thanks!
7
1771
by: Cerebrus99 | last post by:
Hi all, I am confused about how to sort an XML file. I mean how to *actually* sort the data in the physical file, not how to display sorted data. I am using a large XML file as a back-end database, and am making many inserts and updates using the XmlDocument class. But I need to make the XML file human readable too, and want to physically...
0
2567
by: SvenMathijssen | last post by:
Hi, I've been wrestling with a problem for some time that ought to be fairly simple, but turns out to be very difficult for me to solve. Maybe someone here knows the answer. What I try to do is sort the records in a plain-text index file based on certain columns. The index file consists of records and fields within the records. The...
13
2789
by: JJ | last post by:
I have a need to input a large tab delimited text file, which I will parse to check it has the expected columns, before allowing the user to submit it to the database. The user may paste the file into a textbox, or upload it (haven't decided yet). The problem I have is that the text file consists of around 3000 lines, and I want to display...
2
2691
by: Simon | last post by:
Hi, I have a table I am populating with data from a DB, and have it sortable on the client side. When the table exceeds around 300 rows, I start running into problems with rendering and sorting it. I started out using Prototype to help with the sorting, but it appears that this results in a lot of calls to _getElementsByXPath, which are...
7
1909
by: Steve Bergman | last post by:
I'm involved in a discussion thread in which it has been stated that: """ Anything written in a language that is 20x slower (Perl, Python, PHP) than C/C++ should be instantly rejected by users on those grounds alone. """ I've challenged someone to beat the snippet of code below in C, C++, or assembler, for reading in one million pairs...
5
4916
by: jrod11 | last post by:
hi, I found a jquery html table sorting code i have implemented. I am trying to figure out how to edit how many colums there are, but every time i remove code that I think controls how many colums there are, it crashes. There are currently 6 columns, and I only want 4. How do I remove the last two (discount and date)? Here is a link:...
0
7700
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7614
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7924
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8125
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
7974
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5219
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3653
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3642
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
938
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.