467,077 Members | 976 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,077 developers. It's quick & easy.

FileSystem search algorithm needed!

Hi

What is the most efficient way to gather a list of files
under a given folder recursively with the ability to
specify exclusion file masks?

For example, say I wanted to get a list of all files
under \Program Files and it's subdirectories that meet
the following criteria:

Include *.exe recursively
Include *.dll recursively
Exclude Common Files\*.* recursively

Ie I want all exe and dlls under Program Files and
subdirectorys, with the exception of any under \Program
files\Common Files.

At the moment, I'm gathering all included files into an
ArrayList and all excluded into another, and then
NAND'ing the two lists. This is turning out to be very
slow.

Any guidance greatfully received.

Ben Fidge
Nov 15 '05 #1
  • viewed: 6788
Share:
4 Replies

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:09****************************@phx.gbl...
Hi

What is the most efficient way to gather a list of files
under a given folder recursively with the ability to
specify exclusion file masks?

For example, say I wanted to get a list of all files
under \Program Files and it's subdirectories that meet
the following criteria:

Include *.exe recursively
Include *.dll recursively
Exclude Common Files\*.* recursively

Ie I want all exe and dlls under Program Files and
subdirectorys, with the exception of any under \Program
files\Common Files.

At the moment, I'm gathering all included files into an
ArrayList and all excluded into another, and then
NAND'ing the two lists. This is turning out to be very
slow.

Any guidance greatfully received.
This is a fairly quick technique(timing was 04.6250000 seconds on my machine
without display, thats pretty much on par with cmd.exe which was ~6.5 secs
including listing):

string[] LoadFiles(string searchDirectory)
{
ArrayList fileList = new ArrayList();
ArrayList ignoreList = new ArrayList();
ignoreList.Add(searchDirectory + "\\Common Files");

string[] directorys =
System.IO.Directory.GetDirectories(searchDirectory );
string[] files;
foreach (string directory in directorys)
{
if (!ignoreList.Contains(directory))
{
files = LoadFiles(directory);
fileList.AddRange(files);
}
}
files = System.IO.Directory.GetFiles(searchDirectory,"*.ex e");
fileList.AddRange(files);
files = System.IO.Directory.GetFiles(searchDirectory,"*.dl l");
fileList.AddRange(files);

return (string[])fileList.ToArray(typeof(string));
}

For a more reusable method, you may want to push ignoreList out as a
parameter(or a class member, depending on the situation) as well as perhaps
providing an extension list instead of hardcoding them.
Ben Fidge

Nov 15 '05 #2
Daniele,

Thanks for the help. I should have specified my criteria
a little more clearly.

Basically, I've written a small backup utility that
allows the backing up of a web-sites content via a Web
Service running on that site.

The site is large (100,000+ aspx, xml, xsl etc files) and
the web service is taking too long to compile the list of
files to download.

The backup utility needs to specify filemasks for
including and excluding in the final result set. So,
hypothetically speaking, a backup set could be as follwos:

include inetpub\wwwroot\*.aspx recursively
include inetpub\wwwroot\*.xml recursively
include inetpub\wwwroot\*.xsl recursively
exclude inetpub\wwwroot\data\*.xml
exclude inetpub\wwwroot\data\savepage.aspx

Your method is certainly very fast, but is it possible to
tailor it for unlimited arbitrary include and exclude
filemasks?

The method I currently use is taking about 90 seconds to
execute on our web server and the Web Service call is
timing out, hence the need for the most efficient file
gathering algorithm.

Thanks for your time, you'll be on my Christmas Card list
if you can solve this one!!

Ben Fidge

-----Original Message-----

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:09****************************@phx.gbl...
Hi

What is the most efficient way to gather a list of files under a given folder recursively with the ability to
specify exclusion file masks?

For example, say I wanted to get a list of all files
under \Program Files and it's subdirectories that meet
the following criteria:

Include *.exe recursively
Include *.dll recursively
Exclude Common Files\*.* recursively

Ie I want all exe and dlls under Program Files and
subdirectorys, with the exception of any under \Program
files\Common Files.

At the moment, I'm gathering all included files into an
ArrayList and all excluded into another, and then
NAND'ing the two lists. This is turning out to be very
slow.

Any guidance greatfully received.
This is a fairly quick technique(timing was 04.6250000

seconds on my machinewithout display, thats pretty much on par with cmd.exe which was ~6.5 secsincluding listing):

string[] LoadFiles(string searchDirectory)
{
ArrayList fileList = new ArrayList();
ArrayList ignoreList = new ArrayList();
ignoreList.Add(searchDirectory + "\\Common Files");

string[] directorys =
System.IO.Directory.GetDirectories(searchDirector y);
string[] files;
foreach (string directory in directorys)
{
if (!ignoreList.Contains(directory))
{
files = LoadFiles(directory);
fileList.AddRange(files);
}
}
files = System.IO.Directory.GetFiles (searchDirectory,"*.exe"); fileList.AddRange(files);
files = System.IO.Directory.GetFiles (searchDirectory,"*.dll"); fileList.AddRange(files);

return (string[])fileList.ToArray(typeof(string));
}

For a more reusable method, you may want to push ignoreList out as aparameter(or a class member, depending on the situation) as well as perhapsproviding an extension list instead of hardcoding them.

Ben Fidge

.

Nov 15 '05 #3

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:08****************************@phx.gbl...
Daniele,

Thanks for the help. I should have specified my criteria
a little more clearly.

Basically, I've written a small backup utility that
allows the backing up of a web-sites content via a Web
Service running on that site.

The site is large (100,000+ aspx, xml, xsl etc files) and
the web service is taking too long to compile the list of
files to download.

The backup utility needs to specify filemasks for
including and excluding in the final result set. So,
hypothetically speaking, a backup set could be as follwos:

include inetpub\wwwroot\*.aspx recursively
include inetpub\wwwroot\*.xml recursively
include inetpub\wwwroot\*.xsl recursively
exclude inetpub\wwwroot\data\*.xml
exclude inetpub\wwwroot\data\savepage.aspx

Your method is certainly very fast, but is it possible to
tailor it for unlimited arbitrary include and exclude
filemasks?

The method I currently use is taking about 90 seconds to
execute on our web server and the Web Service call is
timing out, hence the need for the most efficient file
gathering algorithm.

Thanks for your time, you'll be on my Christmas Card list
if you can solve this one!!

Ben Fidge


I'm not to fond of this, but it seems to work to an extent. This code runs
across my C: drive(80k files) in 12 seconds. Perhaps filtering out to scan
each folder one time and using regex to match both includes and excludes
will be faster, but I don't have any more time to spend on this today, I
hope it helps:

string[] GetDirectories(string startingPoint, params string[]
searchPatterns)
{
ArrayList directories = new ArrayList();
if (startingPoint == "c:\\System Volume Information")
return (string[])directories.ToArray(typeof(string));
foreach (string directory in Directory.GetDirectories(startingPoint))
{
string[] results = GetDirectories(directory,searchPatterns);
if (directory == "c:\\System Volume Information")
continue;
foreach (string searchPattern in searchPatterns)
directories.Add(directory + "\\" + searchPattern);
directories.AddRange(results);
}
return (string[])directories.ToArray(typeof(string));
}
/// <summary>
/// Returns a fileset based on a list of include and exclude patterns
/// </summary>
/// <param name="include">An IList of wildcard patterns to include</param>
/// <param name="exclude">An IList of wildcard patterns to exclude</param>
/// <returns>An array of strings containing the resultant set</returns>
string[] LoadFiles(IList include, IList exclude)
{
ArrayList files = new ArrayList();
PrepareFilter(exclude);
foreach (string mask in include)
{
string directory = Path.GetDirectoryName(mask);
string fileMask = Path.GetFileName(mask);
string[] filesFound = Directory.GetFiles(directory,fileMask);
filesFound = Filter(filesFound,exclude);
files.AddRange(filesFound);
}

return (string[])files.ToArray(typeof(string));
}
/// <summary>
/// Filters out files based on a list of exclusions
/// </summary>
/// <param name="files">The files to filter</param>
/// <param name="exclude">An IList of patterns to use for exclusion
matching</param>
/// <returns>An array of strings that passed the filter</returns>
string[] Filter(string[] files, IList exclude)
{
ArrayList results = new ArrayList();
foreach (string result in files)
{
bool include = true;
foreach (string filter in exclude)
{

Regex regex;
regex = regexCache[filter] as Regex;

if (regex == null)
{
regex = PrepareFilter(filter,WildcardToRegex(filter));
}
if (regex.IsMatch(result))
{
include = false;
break;
}
}
if (include)
results.Add(result);
}
return (string[])results.ToArray(typeof(string));
}
/// <summary>
/// Prepares filtration by compiling and caching Regex statements
/// </summary>
/// <param name="exclude">A list of Wildcard patterns to prepare</param>
void PrepareFilter(IList patterns)
{
foreach (string path in patterns)
{
string result = WildcardToRegex(path);
PrepareFilter(path,result);
}
}
/// <summary>
/// Prepares and adds a single regex pattern to the regex cache
/// </summary>
/// <param name="name">the pattern name</param>
/// <param name="pattern">The pattern</param>
/// <returns>A Regex object to perform filtration for that
pattern</returns>
Regex PrepareFilter(string name, string pattern)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
Regex regex = new Regex(pattern,RegexOptions.Compiled |
RegexOptions.IgnoreCase );
PrepareFilter(name,regex);
return regex;
}
/// <summary>
/// Adds a Regex to the regexcache.
/// </summary>
/// <param name="name">The regex name</param>
/// <param name="regex">The Regex object to cache</param>
/// <returns>the Regex passed in as regex, this is done simply to provide
return type compatability with PrepareFilter(string,string)</returns>
Regex PrepareFilter(string name, Regex regex)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
regexCache.Add(name,regex);
return regex;
}
/// <summary>
/// Converts dos wildcard(*) string into regex in a generic manner
/// </summary>
/// <param name="pattern">The pattern to convert</param>
/// <returns>The converted regex string.</returns>
/// <remarks>
/// This method does not handle ? in patterns, if this support is needed
it should be added here.
/// </remarks>
string WildcardToRegex(string pattern)
{
pattern = pattern.Replace(@"\",@"\\");
pattern = pattern.Replace("^",@"\^");
pattern = pattern.Replace(".",@"\.");
pattern = pattern.Replace("*",".*");
pattern = pattern + "$";
return pattern;
}
IDictionary regexCache = new Hashtable();
Nov 15 '05 #4
Daniel,

Thanks for your time, I will give this a go.

Merry Christmas,

Ben
-----Original Message-----

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:08****************************@phx.gbl...
Daniele,

Thanks for the help. I should have specified my criteria a little more clearly.

Basically, I've written a small backup utility that
allows the backing up of a web-sites content via a Web
Service running on that site.

The site is large (100,000+ aspx, xml, xsl etc files) and the web service is taking too long to compile the list of files to download.

The backup utility needs to specify filemasks for
including and excluding in the final result set. So,
hypothetically speaking, a backup set could be as follwos:
include inetpub\wwwroot\*.aspx recursively
include inetpub\wwwroot\*.xml recursively
include inetpub\wwwroot\*.xsl recursively
exclude inetpub\wwwroot\data\*.xml
exclude inetpub\wwwroot\data\savepage.aspx

Your method is certainly very fast, but is it possible to tailor it for unlimited arbitrary include and exclude
filemasks?

The method I currently use is taking about 90 seconds to execute on our web server and the Web Service call is
timing out, hence the need for the most efficient file
gathering algorithm.

Thanks for your time, you'll be on my Christmas Card list if you can solve this one!!

Ben Fidge

I'm not to fond of this, but it seems to work to an

extent. This code runsacross my C: drive(80k files) in 12 seconds. Perhaps filtering out to scaneach folder one time and using regex to match both includes and excludeswill be faster, but I don't have any more time to spend on this today, Ihope it helps:

string[] GetDirectories(string startingPoint, params string[]searchPatterns)
{
ArrayList directories = new ArrayList();
if (startingPoint == "c:\\System Volume Information")
return (string[])directories.ToArray(typeof(string));
foreach (string directory in Directory.GetDirectories (startingPoint)) {
string[] results = GetDirectories (directory,searchPatterns); if (directory == "c:\\System Volume Information")
continue;
foreach (string searchPattern in searchPatterns)
directories.Add(directory + "\\" + searchPattern);
directories.AddRange(results);
}
return (string[])directories.ToArray(typeof(string));
}
/// <summary>
/// Returns a fileset based on a list of include and exclude patterns /// </summary>
/// <param name="include">An IList of wildcard patterns to include</param> /// <param name="exclude">An IList of wildcard patterns to exclude</param> /// <returns>An array of strings containing the resultant set</returns> string[] LoadFiles(IList include, IList exclude)
{
ArrayList files = new ArrayList();
PrepareFilter(exclude);
foreach (string mask in include)
{
string directory = Path.GetDirectoryName(mask);
string fileMask = Path.GetFileName(mask);
string[] filesFound = Directory.GetFiles (directory,fileMask); filesFound = Filter(filesFound,exclude);
files.AddRange(filesFound);
}

return (string[])files.ToArray(typeof(string));
}
/// <summary>
/// Filters out files based on a list of exclusions
/// </summary>
/// <param name="files">The files to filter</param>
/// <param name="exclude">An IList of patterns to use for exclusionmatching</param>
/// <returns>An array of strings that passed the filter</returns> string[] Filter(string[] files, IList exclude)
{
ArrayList results = new ArrayList();
foreach (string result in files)
{
bool include = true;
foreach (string filter in exclude)
{

Regex regex;
regex = regexCache[filter] as Regex;

if (regex == null)
{
regex = PrepareFilter(filter,WildcardToRegex (filter)); }
if (regex.IsMatch(result))
{
include = false;
break;
}
}
if (include)
results.Add(result);
}
return (string[])results.ToArray(typeof(string));
}
/// <summary>
/// Prepares filtration by compiling and caching Regex statements /// </summary>
/// <param name="exclude">A list of Wildcard patterns to prepare</param> void PrepareFilter(IList patterns)
{
foreach (string path in patterns)
{
string result = WildcardToRegex(path);
PrepareFilter(path,result);
}
}
/// <summary>
/// Prepares and adds a single regex pattern to the regex cache /// </summary>
/// <param name="name">the pattern name</param>
/// <param name="pattern">The pattern</param>
/// <returns>A Regex object to perform filtration for thatpattern</returns>
Regex PrepareFilter(string name, string pattern)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
Regex regex = new Regex(pattern,RegexOptions.Compiled |RegexOptions.IgnoreCase );
PrepareFilter(name,regex);
return regex;
}
/// <summary>
/// Adds a Regex to the regexcache.
/// </summary>
/// <param name="name">The regex name</param>
/// <param name="regex">The Regex object to cache</param> /// <returns>the Regex passed in as regex, this is done simply to providereturn type compatability with PrepareFilter (string,string)</returns> Regex PrepareFilter(string name, Regex regex)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
regexCache.Add(name,regex);
return regex;
}
/// <summary>
/// Converts dos wildcard(*) string into regex in a generic manner /// </summary>
/// <param name="pattern">The pattern to convert</param> /// <returns>The converted regex string.</returns>
/// <remarks>
/// This method does not handle ? in patterns, if this support is neededit should be added here.
/// </remarks>
string WildcardToRegex(string pattern)
{
pattern = pattern.Replace(@"\",@"\\");
pattern = pattern.Replace("^",@"\^");
pattern = pattern.Replace(".",@"\.");
pattern = pattern.Replace("*",".*");
pattern = pattern + "$";
return pattern;
}
IDictionary regexCache = new Hashtable();
.

Nov 15 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

10 posts views Thread by pembed2003 | last post: by
4 posts views Thread by Tarique Jawed | last post: by
28 posts views Thread by joshc | last post: by
12 posts views Thread by Divick | last post: by
4 posts views Thread by Dameon | last post: by
14 posts views Thread by S | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.