Working with Huge Text Files

Lorn Davies

Hi there, I'm a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking at
simpleParse, but it's a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn

Jul 18 '05 #1

Subscribe Post Reply

2744

mensanator

Lorn Davies wrote:

Hi there, I'm a Python newbie hoping for some direction in working with text files that range from 100MB to 1G in size. Basically certain rows, sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would be written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking at simpleParse, but it's a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn

You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

Jul 18 '05 #2

Michael Hoffman

me********@aol.com wrote:

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.
--
Michael Hoffman

Jul 18 '05 #3

mensanator

me********@aol.com wrote:

Lorn Davies wrote:
Hi there, I'm a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain

rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would

be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking

at
simpleParse, but it's a bit intense at first glance... not sure where to start, or even if I need to go that route. Any help from you

guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn
You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

Obviously, I meant "some.csv". Make sure the name in the program
matches the file you want to process, or pass the input file name
to the program as an argument.

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N '] ['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N '] ['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N '] """

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

Jul 18 '05 #4

cwazir

Hi,

Lorn Davies wrote:

..... working with text files that range from 100MB to 1G in size.
.....
XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
.....

I've found that for working with simple large text files like this,
nothing beats the plain old built-in string operations. Using a parsing
library is convenient if the data format is complex, but otherwise it's
overkill.
In this particular case, even the csv module isn't much of an
advantage. I'd just use split.

The following code should do the job:

data_file = open('data.txt', 'r')
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2]
fields[2] = fields[2].replace(':', '')
fields[3] = fields[3].replace('.', '')
print >>output_files[filename], ', '.join(fields[1:])
for filename in output_files:
output_files[filename].close()
data_file.close()

Note that it does work with unsorted data - at the minor cost of
keeping all output files open till the end of the entire process.

Chirag Wazir
http://chirag.freeshell.org

Jul 18 '05 #5

Al Christians

I did some similar stuff way back about 12-15 years ago -- in 640k
MS-DOS with gigabyte files on 33 MHz machines. I got good performance,
able to bring up any record out of 10 million or so on the screen in a
couple of seconds (not using Python, but that should not make much
difference, maybe even some things in Python would make it work better.)

Even though my files were text, I read them as random-access binary
files. You need to be able to dive in at an arbitrary point in the
file, read a chunk of data, split it up into lines, discarding any
partial lines at the beginning and end, pull out the keys and see where
you are. Even with a gigabyte of file, if you are reading a decent size
chunk, you can binary search down to the spot you want in 15-20 tries or
so. That's the first time, but after that you've got a better idea
where to look. Use a dictionary to save the information from each chunk
to give you an index to get a headstart on the next search. If you can
keep 10k to 100k entries in your index, you can do 1000's of searches or
so before you even have to worry about having too many index entries.

I did learn that on 32-bit hardware, doing a binary search of a file
over a gigabyte will fail if you calculate the next place to look as
(a+b)/2, because a+b can be more than 2GB and overflow. You gotta do
(a + (b-a)/2)
Al

Jul 18 '05 #6

Al Christians

Michael Hoffman wrote:

me********@aol.com wrote:
It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.

There used to be a very nice sort program for PC's that came from
someplace in Nevada. It cost less than $100 and could sort files
faster than most programming languages could read or write them. For
linux, you've gotta figure out the posix sort. If you do, please splain
it to me.

Al

Jul 18 '05 #7

Lorn Davies

Thank you all very much for your suggestions and input... they've been
very helpful. I found the easiest apporach, as a beginner to this, was
working with Chirag's code. Thanks Chirag, I was actually able to read
and make some edit's to the code and then use it... woohooo!

My changes are annotated with ##:

data_file = open('G:\pythonRead.txt', 'r')
data_file.readline() ## this was to skip the first line
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
length = len(fields[3]) ## check how long the field is
N = 'P','N'
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?
fields[1] = fields[1][5:] + months[fields[1][2:5]] +
fields[1][:2]
fields[2] = fields[2].replace(':', '')
if length == 6: ## check for 6 if not add a 0
fields[3] = fields[3].replace('.', '')
else:
fields[3] = fields[3].replace('.', '') + '0'
print >>output_files[filename], ', '.join(fields[1:5])
for filename in output_files:
output_files[filename].close()
data_file.close()

The main changes were to create a check for the length of fields[3], I
wanted to normalize it at 6 digits... the problem I can seee with it
potentially is if I come across lengths < 5, but I have some ideas to
fix that. The other change I attempted was a criteria for what to print
based on the value of fields[8] and fields[6]. It didn't work so well.
I'm a little confused at how to structure booleans like that... I come
from a little experience in a Pascal type scripting language where "x
and y" would entail both having to be true before continuing and "x or
y" would mean either could be true before continuing. Python, unless
I'm misunderstanding (very possible), doesn't organize it as such. I
thought of perhaps using a set of if, elif, else statements for
processing the fileds, but didn't think that would be the most
elegant/efficient solution.

Anyway, any critiques/ideas are welcome... they'll most definitely help
me understand this language a bit better. Thank you all again for your
great replies and thank you Chirag for getting me up and going.

Lorn

Jul 18 '05 #8

cwazir

Lorn Davies wrote:

if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?
In Python you would need to phrase that as follows:
if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
or fields[6] == '1'):
or alternatively:
if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):
The main changes were to create a check for the length of fields[3],
I wanted to normalize it at 6 digits...

Well, you needn't really check the length - you could directly do this:
fields[3] = (fields[3].replace('.', '') + '000000')[:6]
(of course if there are more than 6 digits originally, they'd get
truncated in this case)

Chirag Wazir
http://chirag.freeshell.org

Jul 18 '05 #9

John Machin

cw****@yahoo.com wrote:

Lorn Davies wrote:
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?

In Python you would need to phrase that as follows:
if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
or fields[6] == '1'):
or alternatively:
if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):

and given that the files are huge, a little bit of preprocessing
wouldn't go astray:

initially:

valid_8 = set(['N', 'P'])
valid_6 = set(['0', '1'])

then for each record:

if fields[8] in valid_8 and fields[6] in valid_6:

More meaningful names wouldn't go astray either :-)

Jul 18 '05 #10

cwazir

John Machin wrote:

More meaningful names wouldn't go astray either :-)

I heartily concur!

Instead of starting with:
fields = line.strip().split(',')
you could use something like:
(f_name, f_date, f_time, ...) = line.strip().split(',')

Of course then you won't be able to use ', '.join(fields[1:])
for the output, but the rest of the program will be
MUCH more readable/maintainable.

Chirag Wazir
http://chirag.freeshell.org

Jul 18 '05 #11

by: hakhan | last post by:

Hello, I need to store huge(+/- 100MB) data. Furthermore, my GUI application must select data portions from these huge data files in order to do some post-processing. I wonder in which format I...

.NET Framework

text view for huge files?

by: Jan | last post by:

Is there a program that easily let me view huge files without being slow? Norton Commander for DOS was able to immediately view any file of any size scrolling from beginning or end of file. No time...

.NET Framework

Script is working in IE, but not working in Netscape 7 - trouble with document.selection.createRange();

by: lawrence | last post by:

I'm a beginner with Javascript and especially cross-browser Javascript. I got this working in IE, but not in Netscape 7. It seems like, in Netscape, every time I click on a button, the focus shifts...

Javascript

Working with binary files in C++

by: knapak | last post by:

Hello I'm a self instructed amateur attempting to read a huge file from disk... so bear with me please... I just learned that reading a file in binary is faster than text. So I wrote the...

.NET Framework

Trouble with huge amount of State Server Sessions Timed out

by: Daniel Walzenbach | last post by:

Hi, I have a web application which sometimes throws an â€œout of memoryâ€ exception. To get an idea what happens I traced some values using performance monitor and got the following values (for...

ASP.NET

Huge xml files

by: rickbear | last post by:

Hi Group, I have an xml file wich contains a rootelement and a subelement with a filename and a subelement with bindata. In each xml file there is only one entry for each element. Like this: ...

C# / C Sharp

Need Help On Best Querying ( is LINQ work with Huge amount of data..)

by: ranganadh | last post by:

Dear Group members, I am new to LINQ, pls help on the deeling with huge amount of data with the C# stand Alone application. I have two file, which contains more then 2 lacs lines in every...

C# / C Sharp

Count Lines in (Huge) Text Files

by: NvrBst | last post by:

Whats the best way to count the lines? I'm using the following code at the moment: public long GetNumberOfLines(string fileName) { int buffSize = 65536; int streamSize = 65536; long...

C# / C Sharp

Apache 2.2 VirtualHosts not working on Windows Vista

by: josequinonesii | last post by:

I've searched, I've read, I've tested and re-read numerous post but to no avail yet... Quite simply, the settings I've applied to my httpd.conf, httpd-vhost.conf and my hosts files simply does not...

Apache Web Server

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Working with Huge Text Files

Similar topics