Faster os.walk()

fuzzylollipop

I am trying to get the number of bytes used by files in a directory.
I am using a large directory ( lots of stuff checked out of multiple
large cvs repositories ) and there is lots of wasted time doing
multiple os.stat() on dirs and files from different methods.

Jul 19 '05 #1

Subscribe Post Reply

3779

Laszlo Zsolt Nagy

fuzzylollipop wrote:

I am trying to get the number of bytes used by files in a directory.
I am using a large directory ( lots of stuff checked out of multiple
large cvs repositories ) and there is lots of wasted time doing
multiple os.stat() on dirs and files from different methods.

Do you need a precise value, or are you satisfied with approximations too?
Under which operating system? The 'du' command can be your firend.

man du

Best,

Laci 2.0

--
__________________________________________________ _______________
Laszlo Nagy web: http://designasign.biz
IT Consultant mail: ga*****@geochemsource.com

Python forever!

Jul 19 '05 #2

Peter Hansen

Laszlo Zsolt Nagy wrote:

fuzzylollipop wrote:
I am trying to get the number of bytes used by files in a directory.
I am using a large directory ( lots of stuff checked out of multiple
large cvs repositories ) and there is lots of wasted time doing
multiple os.stat() on dirs and files from different methods.

Do you need a precise value, or are you satisfied with approximations too?
Under which operating system? The 'du' command can be your firend.

How can "du" find the sizes without do os.stat() on each
file?

Jul 19 '05 #3

fuzzylollipop

du is faster than my code that does the same thing in python, it is
highly optomized at the os level.

that said, I profiled spawning an external process to call du and over
the large number of times I need to do this it is actually slower to
execute du externally than my os.walk() implementation.

du does not return the value I need anyway, I need files only not raw
blocks consumed which is what du returns. also I need to filter out
some files and dirs.

after extensive profiling I found out that the way that os.walk() is
implemented it calls os.stat() on the dirs and files multiple times and
that is where all the time is going.

I guess I need something like os.statcache() but that is deprecated,
and probably wouldn't fix my problem. I only walk the dir once and then
cache all bytes, it is the multiple calls to os.stat() that is kicked
off by the os.walk() command internally on all the isdir() and
getsize() and what not.

just wanted to check and see if anyone had already solved this problem.

Jul 19 '05 #4

Philippe C. Martin

How about rerouting stdout/err and 'popening" something like

/bin/find -name '*' -exec
a_script_or_cmd_that_does_what_i_want_with_the_fil e {} \;

?

Regards,

Philippe

fuzzylollipop wrote:

du is faster than my code that does the same thing in python, it is
highly optomized at the os level.

that said, I profiled spawning an external process to call du and over
the large number of times I need to do this it is actually slower to
execute du externally than my os.walk() implementation.

du does not return the value I need anyway, I need files only not raw
blocks consumed which is what du returns. also I need to filter out
some files and dirs.

after extensive profiling I found out that the way that os.walk() is
implemented it calls os.stat() on the dirs and files multiple times and
that is where all the time is going.

I guess I need something like os.statcache() but that is deprecated,
and probably wouldn't fix my problem. I only walk the dir once and then
cache all bytes, it is the multiple calls to os.stat() that is kicked
off by the os.walk() command internally on all the isdir() and
getsize() and what not.

just wanted to check and see if anyone had already solved this problem.

Jul 19 '05 #5

Kent Johnson

fuzzylollipop wrote:

after extensive profiling I found out that the way that os.walk() is
implemented it calls os.stat() on the dirs and files multiple times and
that is where all the time is going.

os.walk() is pretty simple, you could copy it and make your own version that calls os.stat() just
once for each item. The dirnames and filenames lists it yields could be lists of (name,
os.stat(path)) tuples so you would have the sizes available.

Kent

Jul 19 '05 #6

Nick Craig-Wood

fuzzylollipop <ja*************@gmail.com> wrote:

I am trying to get the number of bytes used by files in a directory.
I am using a large directory ( lots of stuff checked out of multiple
large cvs repositories ) and there is lots of wasted time doing
multiple os.stat() on dirs and files from different methods.

I presume you are saying that the os.walk() has to stat() each file to
see whether it is a directory or not, and that you are stat()-ing each
file to count its bytes?

If you want to just get away with the one stat() you'll have to
re-implement os.walk yourself.

Another trick for speeding up lots of stats is to chdir() to the
directory you are processing, and then just use the leafnames in
stat(). The OS then doesn't have to spend ages parsing lots of paths.

However even if you implement both the above, I don't reckon you'll
see a lot of improvement given that decent OSes have a very good cache
for stat results, and that parsing file names is very quick too,
compared to python.

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick

Jul 19 '05 #7

Lonnie Princehouse

If you're trying to track changes to files on (e.g. by comparing
current size with previously recorded size), fam might obviate a lot of
filesystem traversal.

http://python-fam.sourceforge.net/

Jul 19 '05 #8

fuzzylollipop

ding, ding, ding, we have a winner.

One of the guys on the team did just this, he re-implemented the
os.walk() logic and embedded the logic to the S_IFDIR, S_IFMT and
S_IFREG directly into the transversal code.

This is all going to run on unix or linux machines in production so
this is not a big deal.
All in all we went from 64+k function calls for 7070 files/dirs to 1
PER dir/file.

the new code is just a little bit more than twice as fast.

Huge improvement!

Jul 19 '05 #9

Similar topics

How To Do It Faster?!?

by: andrea.gavana | last post by:

Hello NG, in my application, I use os.walk() to walk on a BIG directory. I need to retrieve the files, in each sub-directory, that are owned by a particular user. Noting that I am on Windows...

Python

How To Do It Faster?!?

by: andrea_gavana | last post by:

Hello max & NG, >I don't quite understand what your program is doing. The user=a >looks really fragile/specific to a directory to me. I corrected it to user=a, it was my mistake. However, that...

Python

How To Do It Faster?!?

by: andrea_gavana | last post by:

Hello Jeremy & NG, >* Poke around in the Windows API for a function that does what you want, >and hope it can do it faster due to being in the kernel. I could try it, but I think I have to...

Python

A little assistance with os.walk please.

by: KraftDiner | last post by:

The os.walk function walks the operating systems directory tree. This seems to work, but I don't quite understand the tupple that is returned... Can someone explain please? for root, dirs,...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA