processing a Very Large file

DJTB

Hi,

I'm trying to manually parse a dataset stored in a file. The data should be
converted into Python objects.

Here is an example of a single line of a (small) dataset:

3 13 17 19 -626177023 -1688330994 -834622062 -409108332 297174549 955187488
589884464 -1547848504 857311165 585616830 -749910209 194940864 -1102778558
-1282985276 -1220931512 792256075 -340699912 1496177106 1760327384
-1068195107 95705193 1286147818 -416474772 745439854 1932457456 -1266423822
-1150051085 1359928308 129778935 1235905400 532121853

The first integer specifies the length of a tuple object. In this case, the
tuple has three element: (13, 17, 19)
The other values (-626177023 to 532121853) are elements of a Set.

I use the following code to process a file:
from time import time
from sets import Set
from string import split
file = 'pathtable_ht.dat'
result = []
start_time = time ()
f=open(file,'r')
for line in f:
splitres = line.split()
tuple_size = int(splitres[0])+1
path_tuple = tuple(splitres[1:tuple_size])
conflicts = Set(map(int,splitres[tuple_size:-1]))
# do something with 'path_tuple' and 'conflicts'
# ... do some processing ...
result.append(( path_tuple, conflicts))

f.close()
print time() - start_time
The elements (integer objects) in these Sets are being shared between the
sets, in fact, there are as many distinct element as there are lines in the
file (eg 1000 lines -> 1000 distinct set elements). AFAIK, the elements are
stored only once and each Set contains a pointer to the actual object

This works fine with relatively small datasets, but it doesn't work at all
with large datasets (4500 lines, 45000 chars per line).

After a few seconds of loading, all main memory is consumed by the Python
process and the computer starts swapping. After a few more seconds, CPU
usage drops from 99% to 1% and all swap memory is consumed:

Mem: 386540k total, 380848k used, 4692k free, 796k buffers
Swap: 562232k total, 562232k used, 0k free, 27416k cached

At this point, my computer becomes unusable.

I'd like to know if I should buy some more memory (a few GB?) or if it is
possible to make my code more memory efficient.

Thanks in advance,
Stan.

Jul 19 '05 #1

Subscribe Post Reply

2176

Tim Peters

[DJTB]

I'm trying to manually parse a dataset stored in a file. The data should be
converted into Python objects.

Here is an example of a single line of a (small) dataset:

3 13 17 19 -626177023 -1688330994 -834622062 -409108332 297174549 955187488 589884464 -1547848504 857311165 585616830 -749910209 194940864 -1102778558 -1282985276 -1220931512 792256075 -340699912 1496177106 1760327384 -1068195107 95705193 1286147818 -416474772 745439854 1932457456 -1266423822 -1150051085 1359928308 129778935 1235905400 532121853

The first integer specifies the length of a tuple object. In this case, the
tuple has three element: (13, 17, 19)
The other values (-626177023 to 532121853) are elements of a Set.

I use the following code to process a file:

from time import time
from sets import Set
from string import split
Note that you don't use string.split later.
file = 'pathtable_ht.dat'
result = []
start_time = time ()
f=open(file,'r')
for line in f:
splitres = line.split()
Since they're all integers, may as well:

splitres = map(int, line.split())

here and skip repeated int() calls later.
tuple_size = int(splitres[0])+1
path_tuple = tuple(splitres[1:tuple_size])
conflicts = Set(map(int,splitres[tuple_size:-1]))
Do you really mean to throw away the last value on the line? That is,
why is the slice here [tuple_size:-1] rather than [tuple_size:]?
# do something with 'path_tuple' and 'conflicts'
# ... do some processing ...
result.append(( path_tuple, conflicts))

f.close()
print time() - start_time

The elements (integer objects) in these Sets are being shared between the
sets, in fact, there are as many distinct element as there are lines in the
file (eg 1000 lines -> 1000 distinct set elements). AFAIK, the elements are
stored only once and each Set contains a pointer to the actual object
Only "small" integers are stored uniquely; e.g., these aren't:

100 * 100 is 100 * 100 False int("12345") is int("12345") False

You could manually do something akin to Python's "string interning" to
store ints uniquely, like:

int_table = {}
def uniqueint(i):
return int_table.setdefault(i, i)

Then, e.g.,
uniqueint(100 * 100) is uniqueint(100 * 100) True uniqueint(int("12345")) is uniqueint(int("12345"))

True

Doing Set(map(uniqueint, etc)) would then feed truly shared int
(and/or long) objects to the Set constructor.
This works fine with relatively small datasets, but it doesn't work at all
with large datasets (4500 lines, 45000 chars per line).
Well, chars/line doesn't mean anything to us. Knowing # of set
elements/line might help. Say there are 4500 per line. Then you've
got about 20 million integers. That will consume at least several 100
MB if you don't work to share duplicates. But if you do so work, it
should cut the memory burden by a factor of thousands.
After a few seconds of loading, all main memory is consumed by the Python
process and the computer starts swapping. After a few more seconds, CPU
usage drops from 99% to 1% and all swap memory is consumed:

Mem: 386540k total, 380848k used, 4692k free, 796k buffers
Swap: 562232k total, 562232k used, 0k free, 27416k cached

At this point, my computer becomes unusable.

I'd like to know if I should buy some more memory (a few GB?) or if it is
possible to make my code more memory efficient.

See above for the latter. If you have a 32-bit processor, you won't
be able to _address_ more than a few GB anyway. Still, 384MB of RAM
is on the light side these days <wink>.

Jul 19 '05 #2

Steve M

I'm surprised you didn't recommend to use ZODB. Seems like an ideal way
to manage this large amount of data as a collection of Python objects...

Jul 19 '05 #3

Gregory Bond

DJTB wrote:

Hi,

I'm trying to manually parse a dataset stored in a file. The data should be
converted into Python objects.

In addition to what the others have mentioned, this sort of problem is
pretty easy to do with a C coded extension type, if you have (or can
buy/borrow) any C skills. The result is waaaay more efficient in time
and memory, particularly if you actually don't wind up looking at most
elements.

But Robert's solution (use iterators/generators rather than store them
all in a list) is probably best/quickest if that is possible for your app.

Jul 19 '05 #4

DJTB

Tim Peters wrote:

tuple_size = int(splitres[0])+1
path_tuple = tuple(splitres[1:tuple_size])
conflicts = Set(map(int,splitres[tuple_size:-1]))
Do you really mean to throw away the last value on the line? That is,
why is the slice here [tuple_size:-1] rather than [tuple_size:]?

Thanks, you saved me from another bug-hunting hell...
(In a previous test version, split returned a '\n' as the last item in the
list...)

You could manually do something akin to Python's "string interning" to
store ints uniquely, like:

int_table = {}
def uniqueint(i):
return int_table.setdefault(i, i)

Then, e.g.,
uniqueint(100 * 100) is uniqueint(100 * 100) True uniqueint(int("12345")) is uniqueint(int("12345"))

True

Doing Set(map(uniqueint, etc)) would then feed truly shared int
(and/or long) objects to the Set constructor.

I've implemented this and it does seem to work, thanks.

Stan.

Jul 19 '05 #5

Similar topics

Processing huge datasets

by: Anders SÃ¸ndergaard | last post by:

Hi, I'm trying to process a large filesystem (+20 million files) and keep the directories along with summarized information about the files (sizes, modification times, newest file and the like)...

Python

ZODB memory problems (was: processing a Very Large file)

by: DJTB | last post by:

zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...

Python

XSLT: branching node processing with respect to node type possible?

by: Ralf Wahner | last post by:

Dear Masters of XSLT Could I ask you for a clue on the following question? I'd like to use XSLT to transform an XML source file to LaTeX. In the following small example the <para> Element...

.NET Framework

Program Design for Large volume file processing

by: soren juhu | last post by:

Hi, I am developing a C Program for reading over a million files of size 1 kilobytes each and sending the contents to another program using some middle ware. I need some help on designing the...

C / C++

Processing file input for large files[100+ MB] - Performance suggestions?

by: Maxim | last post by:

I am wondering if anyone could suggest come performance improvements for processing a very large file. THe processing taking place here is on 30-50MB chunks of the file. Performance is...

ASP.NET

fast text processing

by: Alexis Gallagher | last post by:

(I tried to post this yesterday but I think my ISP ate it. Apologies if this is a double-post.) Is it possible to do very fast string processing in python? My bioinformatics application needs to...

Python

What tool to use for processing large documents

by: Luc Mercier | last post by:

Hi Folks, I'm new here, and I need some advice for what tool to use. I'm using XML for benchmarking purposes. I'm writing some scientific programs which I want to analyze. My program generates...

.NET Framework

Perl DBI/XML processing versus PHP ?

by: surfivor | last post by:

I may be involved in a data migration project involving databases and creating XML feeds. Our site is PHP based, so I imagine the team might suggest PHP, but I had a look at the PHP documentation...

PHP

File Processing

by: Jeff | last post by:

Hello I want to read and process and rewrite a very large disk based file (>3Gbytes) as quickly as possible. The processing effectively involves finding certain strings and replacing them with...

C / C++

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA