Processing huge datasets

Anders SÃ¸ndergaard

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

I'm looking at rewriting the code to operate on parts of the hierarchy at a
time and store the processed data structure in another Berkeley DB so I can
query that afterwards. But I'd really prefer keeping all things in memory
due to the huge performance gain.

Any pointers?

Cheers, Anders

Jul 18 '05 #1

Subscribe Reply

1571

Aahz

In article <79************ *******@news1.n okia.com>,
Anders =?UTF-8?B?U8O4bmRlcmd hYXJk?= <an************ *****@nokia.com > wrote:

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.

Well, Don't Do That. ;-)

I don't understand the data structure you're describing; you'll either
need to choose something more appropriate for recursive processing or
switch to iterative processing (probably using generators).
--
Aahz (aa**@pythoncra ft.com) <*> http://www.pythoncraft.com/

Adopt A Process -- stop killing all your children!

Jul 18 '05 #2

Thomas Guettler

Am Mon, 10 May 2004 12:00:03 +0000 schrieb Anders Søndergaard:

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

Hi Anders,

I use ZODB.
http://zope.org/Wikis/ZODB/FrontPage/guide/index.html

HTH,
Thomas

Jul 18 '05 #3

Dennis Lee Bieber

On Mon, 10 May 2004 12:00:03 GMT, Anders Søndergaard
<an************ *****@nokia.com > declaimed the following in
comp.lang.pytho n:

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.
Assuming the modification time is a 32-bit integer, and sizes
are a 32-bit integer, you've got 8-bytes per file right there... or
160MB just for the timestamp/size for your 20million files... Add in
overhead for the data structures themselves (are you also keeping file
names? Even a fixed 8.3 format -- no count/terminator/"." -- will add
another 220MB, giving 380MB without overhead).

-- =============== =============== =============== =============== == <
wl*****@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
=============== =============== =============== =============== == <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.ne tcom.com/> <

Jul 18 '05 #4

Terry Reedy

"Anders SÃ¸ndergaard" <an************ *****@nokia.com > wrote in message
news:79******** ***********@new s1.nokia.com...

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database. Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

I would start with 2 gigs of RAM, which would allow about 90 bytes per
entry after allowing 200 megs for os and interpreter. Even that might not
be enough.

tjr

Jul 18 '05 #5

Anders S. Jensen

Aahz <aa**@pythoncra ft.com> wrote:

Well, Don't Do That. ;-)

I don't understand the data structure you're describing; you'll either
need to choose something more appropriate for recursive processing or
switch to iterative processing (probably using generators).

Let me explain the problem a bit more in detail.
(Sorry for confusing things by using my private email address)

I'm trying to produce a system that will analyze a filesystem
for 'dead leafs', large data consumption, pr. UID volume consumption,
trend analysis, etc.

To do that I first save each directory along with sum of C, M and A
timestamps for the files in the directory (all in epoch secs.), number
of files, volume of files, list of UID's, list of GID's, dict of volume
pr. UID, dict of volume pr. GID and newest file.

Then I build a hierarchy of instances that knows it's parents, children
and siblings.
Each object is populated with the summarized file information.
When that is done, I traverse the hierarchy from the bottom up,
accumulating average C, M and A times, Volumes and number of files.

This hierarchy allows me to instantly query, say, the average
modification time for any given point in the directory structure and
below. That'll show where files that havent been modified in a long time
hides, and how much space they take amongst other things.

The LCRS tree lends itself very well for recursion in terms of beauty
and elegance of the code. However keeping that amount of data in memory
obviously just doesn't fly.

I'm at a point where the system actually just *barely* might work. I'll
know tomorrow when it's done. But the system might be used to work on
much larger filesystems, and then the party is over.

I'm looking for the best suited way of attacking the problem. Is it
a memory map file? A Berkeley DB? A specially crafted metadata filesystem?
(that would be fun but probably overkill..:) or something completely
different.

The processing might be switched to iterative, but that's a 'minor' concern.
The main problem is how to handle the data in the fastest possible way.

Thanks for your pointers!

Cheers,
Anders

Jul 18 '05 #6

William Park

Anders S. Jensen <do****@freakou t.dk> wrote:

I'm trying to produce a system that will analyze a filesystem for
'dead leafs', large data consumption, pr. UID volume consumption,
trend analysis, etc.

To do that I first save each directory along with sum of C, M and A
timestamps for the files in the directory (all in epoch secs.), number
of files, volume of files, list of UID's, list of GID's, dict of
volume pr. UID, dict of volume pr. GID and newest file.

Then I build a hierarchy of instances that knows it's parents,
children and siblings. Each object is populated with the summarized
file information. When that is done, I traverse the hierarchy from
the bottom up, accumulating average C, M and A times, Volumes and
number of files.

This hierarchy allows me to instantly query, say, the average
modification time for any given point in the directory structure and
below. That'll show where files that havent been modified in a long
time hides, and how much space they take amongst other things.

The LCRS tree lends itself very well for recursion in terms of beauty
and elegance of the code. However keeping that amount of data in
memory obviously just doesn't fly.

I'm at a point where the system actually just *barely* might work.
I'll know tomorrow when it's done. But the system might be used to
work on much larger filesystems, and then the party is over.

I'm looking for the best suited way of attacking the problem. Is it a
memory map file? A Berkeley DB? A specially crafted metadata
filesystem? (that would be fun but probably overkill..:) or something
completely different.
How about good old textfile in filesystem, ie.
/some/dir/.timestamp
? This way, you don't use up memory, you get hierarchial data structure
automatically, no need to mess with database interface, simple search
mechanism, etc. And, most of all, no need to ask other people who have
absolutely no idea what you're talking about... :-)

The processing might be switched to iterative, but that's a 'minor'
concern. The main problem is how to handle the data in the fastest
possible way.

Thanks for your pointers!

Cheers, Anders

--
William Park, Open Geometry Consulting, <op**********@y ahoo.ca>
Linux solution/training/migration, Thin-client

Jul 18 '05 #7

Similar topics

2193

processing a Very Large file

by: DJTB | last post by:

Hi, I'm trying to manually parse a dataset stored in a file. The data should be converted into Python objects. Here is an example of a single line of a (small) dataset: 3 13 17 19 -626177023 -1688330994 -834622062 -409108332 297174549 955187488 589884464 -1547848504 857311165 585616830 -749910209 194940864 -1102778558 -1282985276...

Python

1965

Best Practice - XML processing??

by: Rich Wallace | last post by:

Hey all, I have an XML doc that I read into a SQL Server database from an integration feed.... ----------------XML snippet ---------------- <?xml version="1.0" encoding="us-ascii"?>  <Root> <Root RvcDate="2004-02-03" RcvTime="14.16.03.795135">

.NET Framework

3756

Preventing Key/Mouse Events from Interruping Processing

by: Bill Henning | last post by:

Does anyone know a good method of preventing keyboard and mouse events from interrupting processing? My situation is: 1) I need to track and handle all key and mouse events 2) I need to perform processing on certain key/mouse events 3) If key/mouse events interrupt processing, the events should not be discarded since they need to be...

C# / C Sharp

2441

Huge form with data binding

by: amanatio | last post by:

I have a huge form with many data bound controls on it and 34 tables in database (and of course 34 data adapters and 34 datasets). The form is extremely slow to design (huge delay when I go to code from design mode or vice versa) and to show. I didn't design the form but I WILL redisgn it from scratch. What would you propose me to do? The...

C# / C Sharp

8100

c# asp.net credit card processing

by: gl | last post by:

I have just started a project that's going to do very heavy credit card processing through asp.net and i had some questions. I've never really done any cc processing through code and I wasn't sure where to get started. Can anyone suggest a particular gateway or cc processor? I saw a component called ..net charge that seems really promising,...

C# / C Sharp

3777

Trouble with huge amount of State Server Sessions Timed out

by: Daniel Walzenbach | last post by:

Hi, I have a web application which sometimes throws an â€œout of memoryâ€ exception. To get an idea what happens I traced some values using performance monitor and got the following values (for one day): \\FFDS24\ASP.NET Applications(_LM_W3SVC_1_Root_ATV2004)\Errors During Execution: 7 \\FFDS24\ASP.NET Apps...

ASP.NET

3594

fast text processing

by: Alexis Gallagher | last post by:

(I tried to post this yesterday but I think my ISP ate it. Apologies if this is a double-post.) Is it possible to do very fast string processing in python? My bioinformatics application needs to scan very large ASCII files (80GB+), compare adjacent lines, and conditionally do some further processing. I believe the disk i/o is the main...

Python

2623

NLTK: Natural language processing in Python

by: Steven Bird | last post by:

NLTK — the Natural Language Toolkit — is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing. It comes with 50k lines of code, 300Mb of datasets, and a 360 page book which teaches both Python and Natural Language Processing. NLTK has been adopted in at least 40...

Python

3572

typed datasets vs. business objects

by: BillE | last post by:

I'm trying to decide if it is better to use typed datasets or business objects, so I would appreciate any thoughts from someone with more experience. When I use a business object to populate a gridview, for example, I loop through a datareader, populating an array list with instances of a custom class in the middle tier, and then send the...

.NET Framework

7911

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

8338

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7954

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

8215

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...

General

5710

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

5390

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3836

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...

Networking - Hardware / Configuration

1448

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

1179

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

General