473,396 Members | 2,061 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Processing huge datasets

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

I'm looking at rewriting the code to operate on parts of the hierarchy at a
time and store the processed data structure in another Berkeley DB so I can
query that afterwards. But I'd really prefer keeping all things in memory
due to the huge performance gain.

Any pointers?

Cheers, Anders
Jul 18 '05 #1
6 1560
In article <79*******************@news1.nokia.com>,
Anders =?UTF-8?B?U8O4bmRlcmdhYXJk?= <an*****************@nokia.com> wrote:

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.


Well, Don't Do That. ;-)

I don't understand the data structure you're describing; you'll either
need to choose something more appropriate for recursive processing or
switch to iterative processing (probably using generators).
--
Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/

Adopt A Process -- stop killing all your children!
Jul 18 '05 #2
Am Mon, 10 May 2004 12:00:03 +0000 schrieb Anders Sndergaard:
Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?


Hi Anders,

I use ZODB.
http://zope.org/Wikis/ZODB/FrontPage/guide/index.html

HTH,
Thomas

Jul 18 '05 #3
On Mon, 10 May 2004 12:00:03 GMT, Anders Sndergaard
<an*****************@nokia.com> declaimed the following in
comp.lang.python:
Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.
Assuming the modification time is a 32-bit integer, and sizes
are a 32-bit integer, you've got 8-bytes per file right there... or
160MB just for the timestamp/size for your 20million files... Add in
overhead for the data structures themselves (are you also keeping file
names? Even a fixed 8.3 format -- no count/terminator/"." -- will add
another 220MB, giving 380MB without overhead).

-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #4

"Anders Søndergaard" <an*****************@nokia.com> wrote in message
news:79*******************@news1.nokia.com...
I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database. Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?


I would start with 2 gigs of RAM, which would allow about 90 bytes per
entry after allowing 200 megs for os and interpreter. Even that might not
be enough.

tjr


Jul 18 '05 #5
Aahz <aa**@pythoncraft.com> wrote:
Well, Don't Do That. ;-)

I don't understand the data structure you're describing; you'll either
need to choose something more appropriate for recursive processing or
switch to iterative processing (probably using generators).


Let me explain the problem a bit more in detail.
(Sorry for confusing things by using my private email address)

I'm trying to produce a system that will analyze a filesystem
for 'dead leafs', large data consumption, pr. UID volume consumption,
trend analysis, etc.

To do that I first save each directory along with sum of C, M and A
timestamps for the files in the directory (all in epoch secs.), number
of files, volume of files, list of UID's, list of GID's, dict of volume
pr. UID, dict of volume pr. GID and newest file.

Then I build a hierarchy of instances that knows it's parents, children
and siblings.
Each object is populated with the summarized file information.
When that is done, I traverse the hierarchy from the bottom up,
accumulating average C, M and A times, Volumes and number of files.

This hierarchy allows me to instantly query, say, the average
modification time for any given point in the directory structure and
below. That'll show where files that havent been modified in a long time
hides, and how much space they take amongst other things.

The LCRS tree lends itself very well for recursion in terms of beauty
and elegance of the code. However keeping that amount of data in memory
obviously just doesn't fly.

I'm at a point where the system actually just *barely* might work. I'll
know tomorrow when it's done. But the system might be used to work on
much larger filesystems, and then the party is over.

I'm looking for the best suited way of attacking the problem. Is it
a memory map file? A Berkeley DB? A specially crafted metadata filesystem?
(that would be fun but probably overkill..:) or something completely
different.

The processing might be switched to iterative, but that's a 'minor' concern.
The main problem is how to handle the data in the fastest possible way.

Thanks for your pointers!

Cheers,
Anders
Jul 18 '05 #6
Anders S. Jensen <do****@freakout.dk> wrote:
I'm trying to produce a system that will analyze a filesystem for
'dead leafs', large data consumption, pr. UID volume consumption,
trend analysis, etc.

To do that I first save each directory along with sum of C, M and A
timestamps for the files in the directory (all in epoch secs.), number
of files, volume of files, list of UID's, list of GID's, dict of
volume pr. UID, dict of volume pr. GID and newest file.

Then I build a hierarchy of instances that knows it's parents,
children and siblings. Each object is populated with the summarized
file information. When that is done, I traverse the hierarchy from
the bottom up, accumulating average C, M and A times, Volumes and
number of files.

This hierarchy allows me to instantly query, say, the average
modification time for any given point in the directory structure and
below. That'll show where files that havent been modified in a long
time hides, and how much space they take amongst other things.

The LCRS tree lends itself very well for recursion in terms of beauty
and elegance of the code. However keeping that amount of data in
memory obviously just doesn't fly.

I'm at a point where the system actually just *barely* might work.
I'll know tomorrow when it's done. But the system might be used to
work on much larger filesystems, and then the party is over.

I'm looking for the best suited way of attacking the problem. Is it a
memory map file? A Berkeley DB? A specially crafted metadata
filesystem? (that would be fun but probably overkill..:) or something
completely different.
How about good old textfile in filesystem, ie.
/some/dir/.timestamp
? This way, you don't use up memory, you get hierarchial data structure
automatically, no need to mess with database interface, simple search
mechanism, etc. And, most of all, no need to ask other people who have
absolutely no idea what you're talking about... :-)

The processing might be switched to iterative, but that's a 'minor'
concern. The main problem is how to handle the data in the fastest
possible way.

Thanks for your pointers!

Cheers, Anders


--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
Linux solution/training/migration, Thin-client
Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: DJTB | last post by:
Hi, I'm trying to manually parse a dataset stored in a file. The data should be converted into Python objects. Here is an example of a single line of a (small) dataset: 3 13 17 19...
10
by: Rich Wallace | last post by:
Hey all, I have an XML doc that I read into a SQL Server database from an integration feed.... ----------------XML snippet ---------------- <?xml version="1.0" encoding="us-ascii"?>...
5
by: Bill Henning | last post by:
Does anyone know a good method of preventing keyboard and mouse events from interrupting processing? My situation is: 1) I need to track and handle all key and mouse events 2) I need to perform...
5
by: amanatio | last post by:
I have a huge form with many data bound controls on it and 34 tables in database (and of course 34 data adapters and 34 datasets). The form is extremely slow to design (huge delay when I go to code...
4
by: gl | last post by:
I have just started a project that's going to do very heavy credit card processing through asp.net and i had some questions. I've never really done any cc processing through code and I wasn't sure...
6
by: Daniel Walzenbach | last post by:
Hi, I have a web application which sometimes throws an “out of memory” exception. To get an idea what happens I traced some values using performance monitor and got the following values (for...
4
by: Alexis Gallagher | last post by:
(I tried to post this yesterday but I think my ISP ate it. Apologies if this is a double-post.) Is it possible to do very fast string processing in python? My bioinformatics application needs to...
1
by: Steven Bird | last post by:
NLTK the Natural Language Toolkit is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing. It comes with 50k lines...
12
by: BillE | last post by:
I'm trying to decide if it is better to use typed datasets or business objects, so I would appreciate any thoughts from someone with more experience. When I use a business object to populate a...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.