473,769 Members | 2,081 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

newb question: file searching

I'm new at Python and I need a little advice. Part of the script I'm
trying to write needs to be aware of all the files of a certain
extension in the script's path and all sub-directories. Can someone
set me on the right path to what modules and calls to use to do that?
You'd think that it would be a fairly simple proposition, but I can't
find examples anywhere. Thanks.

Aug 8 '06
29 2659
Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getc wd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startsw ith('.'):
if filename.lower( ).endswith('.jp g') and not
filename.starts with('.'):
imageList.appen d(os.path.join( dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.
The other is that if I run the script from my Desktop folder, it won't
find any files, and I make sure to have lots of jpegs in the Desktop
folder for the test. Can anyone figure this out?

ja*******@gmail .com wrote:
I'm new at Python and I need a little advice. Part of the script I'm
trying to write needs to be aware of all the files of a certain
extension in the script's path and all sub-directories. Can someone
set me on the right path to what modules and calls to use to do that?
You'd think that it would be a fairly simple proposition, but I can't
find examples anywhere. Thanks.
Aug 9 '06 #21
Something's really not reliable in my logic. I say this because if I
change the extension to .png then a file in a hidden directory (one the
starts with '.') shows up! The most frustrating part is that there are
..jpg files in the very same directory that don't show up when it
searches for jpegs.

I tried os.walk('.') and it works, so I'll be using that instead.

ja*******@gmail .com wrote:
Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getc wd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startsw ith('.'):
if filename.lower( ).endswith('.jp g') and not
filename.starts with('.'):
imageList.appen d(os.path.join( dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.
The other is that if I run the script from my Desktop folder, it won't
find any files, and I make sure to have lots of jpegs in the Desktop
folder for the test. Can anyone figure this out?

ja*******@gmail .com wrote:
I'm new at Python and I need a little advice. Part of the script I'm
trying to write needs to be aware of all the files of a certain
extension in the script's path and all sub-directories. Can someone
set me on the right path to what modules and calls to use to do that?
You'd think that it would be a fairly simple proposition, but I can't
find examples anywhere. Thanks.
Aug 9 '06 #22
At Tuesday 8/8/2006 21:11, ja*******@gmail .com wrote:

>Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getc wd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startsw ith('.'):
if
filename.lower( ).endswith('.jp g') and not
filename.start swith('.'):

imageList.appe nd(os.path.join (dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.
That's because of the double iteration. dirnames and filenames are
two distinct, complementary, lists. (If a directory entry is a
directory it goes into dirnames; if it's a file it goes into
filenames). So you have to process them one after another.
>def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getc wd()):
for filename in filenames:
if filename.lower( ).endswith('.jp g') and
not filename.starts with('.'):

imageList.appe nd(os.path.join (dirpath, filename))
for i in reversed(range( len(dirnames))) :
if dirnames[i].startswith('.' ): del dirnames[i]
return imageList
reversed() because you need to modify dirnames in-place, so it's
better to process the list backwards.

Gabriel Genellina
Softlab SRL

_______________ _______________ _______________ _____
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Aug 9 '06 #23
I've narrowed down the problem. All the problems start when I try to
eliminate the hidden files and directories. Is there a better way to
do this?
ja*******@gmail .com wrote:
I'm new at Python and I need a little advice. Part of the script I'm
trying to write needs to be aware of all the files of a certain
extension in the script's path and all sub-directories. Can someone
set me on the right path to what modules and calls to use to do that?
You'd think that it would be a fairly simple proposition, but I can't
find examples anywhere. Thanks.
Aug 9 '06 #24
That worked perfectly. Thank you. That was exactly what I was looking
for. However, can you explain to me what the following code actually
does?

reversed(range( len(dirnames)))
Gabriel Genellina wrote:
At Tuesday 8/8/2006 21:11, ja*******@gmail .com wrote:

Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getc wd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startsw ith('.'):
if
filename.lower( ).endswith('.jp g') and not
filename.starts with('.'):

imageList.appen d(os.path.join( dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.

That's because of the double iteration. dirnames and filenames are
two distinct, complementary, lists. (If a directory entry is a
directory it goes into dirnames; if it's a file it goes into
filenames). So you have to process them one after another.
def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getc wd()):
for filename in filenames:
if filename.lower( ).endswith('.jp g') and
not filename.starts with('.'):

imageList.appen d(os.path.join( dirpath, filename))
for i in reversed(range( len(dirnames))) :
if dirnames[i].startswith('.' ): del dirnames[i]
return imageList

reversed() because you need to modify dirnames in-place, so it's
better to process the list backwards.

Gabriel Genellina
Softlab SRL

_______________ _______________ _______________ _____
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
Aug 9 '06 #25
ja*******@gmail .com wrote:
I've narrowed down the problem. All the problems start when I try to
eliminate the hidden files and directories. Is there a better way to
do this?
Well you almost have it, but your problem is that you are trying to do
too many things in one function. (I bet I am starting to sound like a
broken record :-)) The four distinct things you are doing are:

* getting a list of all files in a tree
* combining a files directory with its name to give the full path
* ignoring hidden directories
* matching files based on their extension

If you split up each of those things into their own function you will
end up with smaller easier to test pieces, and separate, reusable
functions.

The core function would be basically what you already have:

def get_files(direc tory, include_hidden= False):
"""Return an expanded list of files for a directory tree
optionally not ignoring hidden directories"""
for path, dirs, files in os.walk(directo ry):
for fn in files:
full = os.path.join(pa th, fn)
yield full

if not include_hidden:
remove_hidden(d irs)

and remove_hidden is a short, but tricky function since the directory
list needs to be edited in place:

def remove_hidden(d irlist):
"""For a list containing directory names, remove
any that start with a dot"""

dirlist[:] = [d for d in dirlist if not d.startswith('. ')]

at this point, you can play with get_files on it's own, and test
whether or not the include_hidden parameter works as expected.

For the final step, I'd use an approach that pulls out the extension
itself, and checks to see if it is in a list(or better, a set) of
allowed filenames. globbing (*.foo) works as well, but if you are only
ever matching on the extension, I believe this will work better.

def get_files_by_ex t(directory, ext_list, include_hidden= False):
"""Return an expanded list of files for a directory tree
where the file ends with one of the extensions in ext_list"""
ext_list = set(ext_list)

for fn in get_files(direc tory, include_hidden) :
_, ext = os.path.splitex t(fn)
ext=ext[1:] #remove dot
if ext.lower() in ext_list:
yield fn

notice at this point we still haven't said anything about images! The
task of finding files by extension is pretty generic, so it shouldn't
be concerned about the actual extensions.

once that works, you can simply do

def get_images(dire ctory, include_hidden= False):
image_exts = ('jpg','jpeg',' gif','png','bmp ')
return get_files_by_ex t(directory, image_exts, include_hidden)

Hope this helps :-)

--
- Justin

Aug 9 '06 #26
I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.

def getFileList(*ex tensions):
import os
imageList = []
for dirpath, dirnames, files in os.walk('.'):
for filename in files:
name, ext = os.path.splitex t(filename)
if ext.lower() in extensions and not filename.starts with('.'):
imageList.appen d(os.path.join( dirpath, filename))
for dirname in reversed(range( len(dirnames))) :
if dirnames[dirname].startswith('.' ):
del dirnames[dirname]

return imageList

print getFileList('.j pg', '.gif', '.png')

The line I don't understand is:
reversed(range( len(dirnames)))
Justin Azoff wrote:
ja*******@gmail .com wrote:
I've narrowed down the problem. All the problems start when I try to
eliminate the hidden files and directories. Is there a better way to
do this?

Well you almost have it, but your problem is that you are trying to do
too many things in one function. (I bet I am starting to sound like a
broken record :-)) The four distinct things you are doing are:

* getting a list of all files in a tree
* combining a files directory with its name to give the full path
* ignoring hidden directories
* matching files based on their extension

If you split up each of those things into their own function you will
end up with smaller easier to test pieces, and separate, reusable
functions.

The core function would be basically what you already have:

def get_files(direc tory, include_hidden= False):
"""Return an expanded list of files for a directory tree
optionally not ignoring hidden directories"""
for path, dirs, files in os.walk(directo ry):
for fn in files:
full = os.path.join(pa th, fn)
yield full

if not include_hidden:
remove_hidden(d irs)

and remove_hidden is a short, but tricky function since the directory
list needs to be edited in place:

def remove_hidden(d irlist):
"""For a list containing directory names, remove
any that start with a dot"""

dirlist[:] = [d for d in dirlist if not d.startswith('. ')]

at this point, you can play with get_files on it's own, and test
whether or not the include_hidden parameter works as expected.

For the final step, I'd use an approach that pulls out the extension
itself, and checks to see if it is in a list(or better, a set) of
allowed filenames. globbing (*.foo) works as well, but if you are only
ever matching on the extension, I believe this will work better.

def get_files_by_ex t(directory, ext_list, include_hidden= False):
"""Return an expanded list of files for a directory tree
where the file ends with one of the extensions in ext_list"""
ext_list = set(ext_list)

for fn in get_files(direc tory, include_hidden) :
_, ext = os.path.splitex t(fn)
ext=ext[1:] #remove dot
if ext.lower() in ext_list:
yield fn

notice at this point we still haven't said anything about images! The
task of finding files by extension is pretty generic, so it shouldn't
be concerned about the actual extensions.

once that works, you can simply do

def get_images(dire ctory, include_hidden= False):
image_exts = ('jpg','jpeg',' gif','png','bmp ')
return get_files_by_ex t(directory, image_exts, include_hidden)

Hope this helps :-)

--
- Justin
Aug 9 '06 #27
ja*******@gmail .com wrote:
I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.
You miss the point. The functions I posted, up until get_files_by_ex t
which is the equivalent of your getFileList, total 17 actual lines.
The 5 extra lines give 3 extra features. Maybe in a while when you
need to do a similar file search you will realize why my way is better.

[snip]
The line I don't understand is:
reversed(range( len(dirnames)))
This is why I wrote and documented a separate remove_hidden function,
it can be tricky. If you broke it up into multiple lines, and added
print statements it would be clear what it does.

l = len(dirnames) # l is the number of elements in dirnames, e.g. 6
r = range(l) # r contains the numbers 0,1,2,3,4,5
rv = reversed(r) # rv contains the numbers 5,4,3,2,1,0

The problem arises from how to remove elements in a list as you are
going through it. If you delete element 0, element 1 then becomes
element 0, and funny things happen. That particular solution is
relatively simple, it just deletes elements from the end instead. That
complicated expression arises because python doesn't have "normal" for
loops. The version of remove_hidden I wrote is simpler, but relies on
the even more obscure lst[:] construct for re-assigning a list. Both
of them accomplish the same thing though, so if you wanted, you should
be able to replace those 3 lines with just

dirnames[:] = [d for d in dirnames if not d.startswith('. ')]
--
- Justin

Aug 9 '06 #28
ja*******@gmail .com wrote:
I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.

def getFileList(*ex tensions):
import os
imageList = []
for dirpath, dirnames, files in os.walk('.'):
for filename in files:
name, ext = os.path.splitex t(filename)
if ext.lower() in extensions and not filename.starts with('.'):
imageList.appen d(os.path.join( dirpath, filename))
for dirname in reversed(range( len(dirnames))) :
if dirnames[dirname].startswith('.' ):
del dirnames[dirname]

return imageList

print getFileList('.j pg', '.gif', '.png')

The line I don't understand is:
reversed(range( len(dirnames)))
For a start, change "dirname" to "dirindex" (without changing
"dirnames"! ) in that line and the next two lines -- this may help your
understanding.

The purpose of that loop is to eliminate from dirnames any entries
which start with ".". This needs to be done in-situ -- concocting a new
list and binding the name "dirnames" to it won't work.
The safest understandable way to delete entries from a list while
iterating over it is to do it backwards.

Doing it forwards doesn't always work; example:

#>>dirnames = ['foo', 'bar', 'zot']
#>>for x in range(len(dirna mes)):
.... if dirnames[x] == 'bar':
.... del dirnames[x]
....
Traceback (most recent call last):
File "<stdin>", line 2, in ?
IndexError: list index out of range

HTH,
John

Aug 9 '06 #29
I'm sorry. I didn't mean to offend you. I never thought your way was
inferior.
Justin Azoff wrote:
ja*******@gmail .com wrote:
I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.

You miss the point. The functions I posted, up until get_files_by_ex t
which is the equivalent of your getFileList, total 17 actual lines.
The 5 extra lines give 3 extra features. Maybe in a while when you
need to do a similar file search you will realize why my way is better.

[snip]
The line I don't understand is:
reversed(range( len(dirnames)))

This is why I wrote and documented a separate remove_hidden function,
it can be tricky. If you broke it up into multiple lines, and added
print statements it would be clear what it does.

l = len(dirnames) # l is the number of elements in dirnames, e.g. 6
r = range(l) # r contains the numbers 0,1,2,3,4,5
rv = reversed(r) # rv contains the numbers 5,4,3,2,1,0

The problem arises from how to remove elements in a list as you are
going through it. If you delete element 0, element 1 then becomes
element 0, and funny things happen. That particular solution is
relatively simple, it just deletes elements from the end instead. That
complicated expression arises because python doesn't have "normal" for
loops. The version of remove_hidden I wrote is simpler, but relies on
the even more obscure lst[:] construct for re-assigning a list. Both
of them accomplish the same thing though, so if you wanted, you should
be able to replace those 3 lines with just

dirnames[:] = [d for d in dirnames if not d.startswith('. ')]
--
- Justin
Aug 9 '06 #30

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
2132
by: claudel | last post by:
Hi I have a newb PHP/Javascript question regarding checkbox processing I'm not sure which area it falls into so I crossposted to comp.lang.php and comp.lang.javascript. I'm trying to construct a checkbox array in a survey form where one of the choices is "No Preference" which is checked by default. If the victim chooses other than "No Preference", I'd like to uncheck
5
2035
by: Alexandre | last post by:
Hi, Im a newb to dev and python... my first sefl assigned mission was to read a pickled file containing a list with DB like data and convert this to MySQL... So i wrote my first module which reads this pickled file and writes an XML file with list of tables and fields (... next step will the module who creates the tables according to details found in the XML file). If anyone has some minutes to spare, suggestions and comments would be...
24
2375
by: Apotheosis | last post by:
The problem professor gave us is: Write a program which reads two integer values. If the first is less than the second, print the message "up". If the second is less than the first, print the message "down" If the numbers are equal, print the message "equal" If there is an error reading the data, print a message containing the word "Error" and perform exit( 0 ); And this is what I wrote:
5
1911
by: none | last post by:
hi all, (i am running on win 2k pro). i saw a program i like on a website and when i went to download it it was just a load of 'c' code. now, i know very little about 'c' or programming but I downloaded 'miracle c' and pasted the code in and when i comiled it i got a load of errors. (below)
11
1999
by: The_Kingpin | last post by:
Hi all, I'm new to C programming and looking for some help. I have a homework project to do and could use every tips, advises, code sample and references I can get. Here's what I need to do. I have a file named books.txt that contains all the informations on the books. Each book is a struc containing 6 fields written on separated line in the
2
1197
by: JR | last post by:
I have tried searching boards but have not been able to find an answer. What is the best way to display text from a log.txt file and then display it in three seperate text boxes? I have a log file that is continually going to have 3 temperatures appended to it. I need to read that temperature in from the log file and display it to the user. The temperatures are seperated by commas. I was reading and is Streamreader
1
1000
by: jgibbens | last post by:
First, thank you for even looking at my question. I am using VB 2005 and I have a question about XML. what I want to do is: 1. keep adding to this file with out over writing it 2. be able to change stuff the XML without overwriting the rest of the file This is the XML I have:
0
9589
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10216
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10049
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8873
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5310
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.