473,401 Members | 2,127 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,401 software developers and data experts.

Working with Huge Text Files

Hi there, I'm a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking at
simpleParse, but it's a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn

Jul 18 '05 #1
10 2744

Lorn Davies wrote:
Hi there, I'm a Python newbie hoping for some direction in working with text files that range from 100MB to 1G in size. Basically certain rows, sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would be written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking at simpleParse, but it's a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn


You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

Jul 18 '05 #2
me********@aol.com wrote:
It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.


And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.
--
Michael Hoffman
Jul 18 '05 #3

me********@aol.com wrote:
Lorn Davies wrote:
Hi there, I'm a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain

rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would

be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking

at
simpleParse, but it's a bit intense at first glance... not sure where to start, or even if I need to go that route. Any help from you

guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn
You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:


Obviously, I meant "some.csv". Make sure the name in the program
matches the file you want to process, or pass the input file name
to the program as an argument.

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N '] ['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N '] ['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N '] """

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.


Jul 18 '05 #4
Hi,

Lorn Davies wrote:
..... working with text files that range from 100MB to 1G in size.
.....
XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
.....


I've found that for working with simple large text files like this,
nothing beats the plain old built-in string operations. Using a parsing
library is convenient if the data format is complex, but otherwise it's
overkill.
In this particular case, even the csv module isn't much of an
advantage. I'd just use split.

The following code should do the job:

data_file = open('data.txt', 'r')
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2]
fields[2] = fields[2].replace(':', '')
fields[3] = fields[3].replace('.', '')
print >>output_files[filename], ', '.join(fields[1:])
for filename in output_files:
output_files[filename].close()
data_file.close()

Note that it does work with unsorted data - at the minor cost of
keeping all output files open till the end of the entire process.

Chirag Wazir
http://chirag.freeshell.org

Jul 18 '05 #5
I did some similar stuff way back about 12-15 years ago -- in 640k
MS-DOS with gigabyte files on 33 MHz machines. I got good performance,
able to bring up any record out of 10 million or so on the screen in a
couple of seconds (not using Python, but that should not make much
difference, maybe even some things in Python would make it work better.)

Even though my files were text, I read them as random-access binary
files. You need to be able to dive in at an arbitrary point in the
file, read a chunk of data, split it up into lines, discarding any
partial lines at the beginning and end, pull out the keys and see where
you are. Even with a gigabyte of file, if you are reading a decent size
chunk, you can binary search down to the spot you want in 15-20 tries or
so. That's the first time, but after that you've got a better idea
where to look. Use a dictionary to save the information from each chunk
to give you an index to get a headstart on the next search. If you can
keep 10k to 100k entries in your index, you can do 1000's of searches or
so before you even have to worry about having too many index entries.

I did learn that on 32-bit hardware, doing a binary search of a file
over a gigabyte will fail if you calculate the next place to look as
(a+b)/2, because a+b can be more than 2GB and overflow. You gotta do
(a + (b-a)/2)
Al
Jul 18 '05 #6
Michael Hoffman wrote:
me********@aol.com wrote:
It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.


There used to be a very nice sort program for PC's that came from
someplace in Nevada. It cost less than $100 and could sort files
faster than most programming languages could read or write them. For
linux, you've gotta figure out the posix sort. If you do, please splain
it to me.

Al
Jul 18 '05 #7
Thank you all very much for your suggestions and input... they've been
very helpful. I found the easiest apporach, as a beginner to this, was
working with Chirag's code. Thanks Chirag, I was actually able to read
and make some edit's to the code and then use it... woohooo!

My changes are annotated with ##:

data_file = open('G:\pythonRead.txt', 'r')
data_file.readline() ## this was to skip the first line
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
length = len(fields[3]) ## check how long the field is
N = 'P','N'
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?
fields[1] = fields[1][5:] + months[fields[1][2:5]] +
fields[1][:2]
fields[2] = fields[2].replace(':', '')
if length == 6: ## check for 6 if not add a 0
fields[3] = fields[3].replace('.', '')
else:
fields[3] = fields[3].replace('.', '') + '0'
print >>output_files[filename], ', '.join(fields[1:5])
for filename in output_files:
output_files[filename].close()
data_file.close()

The main changes were to create a check for the length of fields[3], I
wanted to normalize it at 6 digits... the problem I can seee with it
potentially is if I come across lengths < 5, but I have some ideas to
fix that. The other change I attempted was a criteria for what to print
based on the value of fields[8] and fields[6]. It didn't work so well.
I'm a little confused at how to structure booleans like that... I come
from a little experience in a Pascal type scripting language where "x
and y" would entail both having to be true before continuing and "x or
y" would mean either could be true before continuing. Python, unless
I'm misunderstanding (very possible), doesn't organize it as such. I
thought of perhaps using a set of if, elif, else statements for
processing the fileds, but didn't think that would be the most
elegant/efficient solution.

Anyway, any critiques/ideas are welcome... they'll most definitely help
me understand this language a bit better. Thank you all again for your
great replies and thank you Chirag for getting me up and going.

Lorn

Jul 18 '05 #8
Lorn Davies wrote:
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?
In Python you would need to phrase that as follows:
if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
or fields[6] == '1'):
or alternatively:
if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):
The main changes were to create a check for the length of fields[3],
I wanted to normalize it at 6 digits...


Well, you needn't really check the length - you could directly do this:
fields[3] = (fields[3].replace('.', '') + '000000')[:6]
(of course if there are more than 6 digits originally, they'd get
truncated in this case)

Chirag Wazir
http://chirag.freeshell.org

Jul 18 '05 #9

cw****@yahoo.com wrote:
Lorn Davies wrote:
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?


In Python you would need to phrase that as follows:
if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
or fields[6] == '1'):
or alternatively:
if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):


and given that the files are huge, a little bit of preprocessing
wouldn't go astray:

initially:

valid_8 = set(['N', 'P'])
valid_6 = set(['0', '1'])

then for each record:

if fields[8] in valid_8 and fields[6] in valid_6:

More meaningful names wouldn't go astray either :-)

Jul 18 '05 #10
John Machin wrote:
More meaningful names wouldn't go astray either :-)


I heartily concur!

Instead of starting with:
fields = line.strip().split(',')
you could use something like:
(f_name, f_date, f_time, ...) = line.strip().split(',')

Of course then you won't be able to use ', '.join(fields[1:])
for the output, but the rest of the program will be
MUCH more readable/maintainable.

Chirag Wazir
http://chirag.freeshell.org

Jul 18 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: hakhan | last post by:
Hello, I need to store huge(+/- 100MB) data. Furthermore, my GUI application must select data portions from these huge data files in order to do some post-processing. I wonder in which format I...
6
by: Jan | last post by:
Is there a program that easily let me view huge files without being slow? Norton Commander for DOS was able to immediately view any file of any size scrolling from beginning or end of file. No time...
8
by: lawrence | last post by:
I'm a beginner with Javascript and especially cross-browser Javascript. I got this working in IE, but not in Netscape 7. It seems like, in Netscape, every time I click on a button, the focus shifts...
4
by: knapak | last post by:
Hello I'm a self instructed amateur attempting to read a huge file from disk... so bear with me please... I just learned that reading a file in binary is faster than text. So I wrote the...
6
by: Daniel Walzenbach | last post by:
Hi, I have a web application which sometimes throws an “out of memory” exception. To get an idea what happens I traced some values using performance monitor and got the following values (for...
1
by: rickbear | last post by:
Hi Group, I have an xml file wich contains a rootelement and a subelement with a filename and a subelement with bindata. In each xml file there is only one entry for each element. Like this: ...
0
by: ranganadh | last post by:
Dear Group members, I am new to LINQ, pls help on the deeling with huge amount of data with the C# stand Alone application. I have two file, which contains more then 2 lacs lines in every...
9
by: NvrBst | last post by:
Whats the best way to count the lines? I'm using the following code at the moment: public long GetNumberOfLines(string fileName) { int buffSize = 65536; int streamSize = 65536; long...
6
by: josequinonesii | last post by:
I've searched, I've read, I've tested and re-read numerous post but to no avail yet... Quite simply, the settings I've applied to my httpd.conf, httpd-vhost.conf and my hosts files simply does not...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.