473,396 Members | 1,892 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Sorting Large File (Code/Performance)

Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

I'd greatly appreciate if someone can post sample code that can help
me do this.

Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

Cheers,

Ira

Jan 24 '08 #1
12 3908
Ir*******@gmail.com writes:
I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

I'd greatly appreciate if someone can post sample code that can help
me do this.
Use the unix sort command:

sort inputfile -o outputfile

I think there is a cygwin port.
Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).
Eh, unix sort would probably take a while, somewhere between 15
minutes and an hour. If you only have to do it once it's not worth
writing special purpose code. If you have to do it a lot, get some
more ram for that box, suck the file into memory and do a radix sort.
Jan 24 '08 #2
Ir*******@gmail.com wrote:
Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.
Given those numbers, the average number of characters per line is
less than 2. Please check.

John Nagle
Jan 24 '08 #3
Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:
>What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:
The file is UTF-8
Do the first two characters always belong to the ASCII subset?
Yes, first two always belong to ASCII subset
What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)
Here's a start: http://docs.python.org/lib/typesseq-mutable.html
Google "GnuWin32" and see if their sort does what you want.
Will do, thanks for the tip.
If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.
I am limited with resources. Unfortunately.

Cheers,

Ira
Jan 24 '08 #4
Ir*******@gmail.com wrote:
>What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)
"Isolate" as in "extract"? Remove the rest?

Then why don't you extract the lines first, without sorting the file? (or sort
it afterwards if you still need to). That would heavily cut down your memory
footprint.

Stefan
Jan 24 '08 #5
Stefan Behnel wrote:
Ir*******@gmail.com wrote:
>>What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)

"Isolate" as in "extract"? Remove the rest?

Then why don't you extract the lines first, without sorting the file? (or sort
it afterwards if you still need to). That would heavily cut down your memory
footprint.
Just for fun, this is what I meant:

for utf8_line in open(filename, 'rb'):
if utf8_line.startswith('zz'):
print utf8_line

Stefan
Jan 24 '08 #6
On Jan 25, 8:26 am, Ira.Ko...@gmail.com wrote:
I need to isolate all lines that start with two characters (zz to be
particular)
What does "isolate" mean to you? What does this have to do with
sorting? What do you actually want to do with (a) the lines starting
with "zz" (b) the other lines? What percentage of the lines start with
"zz"?

When looking at the GnuWin32 collection:
(a) "Unicode" is not relevant to your sort problem.
(b) grab yourself a wc and a grep while you're there -- they will help
with "how many lines" and "what percentage of lines" questions.

Jan 24 '08 #7
On Thursday 24 January 2008 20:56 John Nagle wrote:
Ir*******@gmail.com wrote:
>Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

Given those numbers, the average number of characters per line is
less than 2. Please check.
which would be true if 1.599.999.999 had 2 chars and the rest of the lines
just one :)

(but yes that would be an interesting question how to sort a 1 character
line based on the first 2 of that line)

martin

--
http://noneisyours.marcher.name
http://feeds.feedburner.com/NoneIsYours

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Jan 24 '08 #8
Ir*******@gmail.com wrote:
Thanks to all who replied. It's very appreciated.

Yes, I had to double check line counts and the number of lines is ~16
million (instead of stated 1.6B).
OK, that's not bad at all.

You have a few options:

- Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
or Python's "sort" function.
- Thrash; in-memory sorts do very badly with virtual memory, but eventually
they finish. Might take many hours.
- Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/".
There's a free 30-day trial. It can probably sort your data
in about a minute.)
- Load the data into a database like MySQL and let it do the work.
This is slow if done wrong, but OK if done right.
- Write a distribution sort yourself. Fan out the incoming file into
one file for each first letter, sort each subfile, merge the
results.

With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

John Nagle
Jan 25 '08 #9
John Nagle <na***@animats.comwrites:
- Get enough memory to do the sort with an in-memory sort, like
UNIX "sort" or Python's "sort" function.
Unix sort does external sorting when needed.
Jan 25 '08 #10
Paul Rubin wrote:
John Nagle <na***@animats.comwrites:
> - Get enough memory to do the sort with an in-memory sort, like
UNIX "sort" or Python's "sort" function.

Unix sort does external sorting when needed.
Ah, someone finally put that in. Good. I hadn't looked at "sort"'s manual
page in many years.

John Nagle
Jan 25 '08 #11
John Nagle <na***@animats.comwrites:
Unix sort does external sorting when needed.

Ah, someone finally put that in. Good. I hadn't looked at
"sort"'s manual page in many years.
Huh? It has been like that from the beginning. It HAD to be. Unix
was originally written on a PDP-11. The GNU version has also always
been external.
Jan 25 '08 #12
In article <b1**********************************@i7g2000prf.g ooglegroups.com>,
<Ir*******@gmail.comwrote:
>Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:
>>What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:
The file is UTF-8
>Do the first two characters always belong to the ASCII subset?
Yes, first two always belong to ASCII subset
>What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)
Like in?
grep '^zz' longfile aapje

You will have a hard time to beat that with python, in every respect.

<SNIP>
>
Cheers,

Ira
Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Feb 2 '08 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: ohaya | last post by:
Hi, I'm a real newbie, but have been asked to try to fix a problem in one of our JSP pages that is suppose to read in a text file and display it. From my testing thus far, it appears this page...
24
by: Joerg Schuster | last post by:
Hello, I am looking for a method to "shuffle" the lines of a large file. I have a corpus of sorted and "uniqed" English sentences that has been produced with (1): (1) sort corpus | uniq >...
1
by: DJTB | last post by:
zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...
0
by: pruebauno | last post by:
Hello all, I am having issues compiling Python with large file support. I tried forcing the configure script to add it but then it bombs in the make process. Any help will be appreciated. ...
7
by: Joseph | last post by:
Hi, I'm having bit of questions on recursive pointer. I have following code that supports upto 8K files but when i do a file like 12K i get a segment fault. I Know it is in this line of code. ...
3
by: chris chan | last post by:
I just want to create a large file with size of 80MB and add a char 1 at the end. Two method I has been used. i) seek to the End and write one char ii) open the file , and append it to the end ...
3
by: Joseph Geretz | last post by:
I'm working on an file transfer gateway using WSE with DIME for file attachments. Our goal is to replace our direct file repository access (via windows network folder sharre) with the Web Service...
12
by: bisuvious | last post by:
hi all, I am looking for a technique to sort data of a large file, say 2GB/ 4GB. Please suggest any algorithm, if you have any idea. thanks bisuvious
30
by: galiorenye | last post by:
Hi, Given this code: A** ppA = new A*; A *pA = NULL; for(int i = 0; i < 10; ++i) { pA = ppA; //do something with pA
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.