473,387 Members | 1,561 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

FileSystem search algorithm needed!

Hi

What is the most efficient way to gather a list of files
under a given folder recursively with the ability to
specify exclusion file masks?

For example, say I wanted to get a list of all files
under \Program Files and it's subdirectories that meet
the following criteria:

Include *.exe recursively
Include *.dll recursively
Exclude Common Files\*.* recursively

Ie I want all exe and dlls under Program Files and
subdirectorys, with the exception of any under \Program
files\Common Files.

At the moment, I'm gathering all included files into an
ArrayList and all excluded into another, and then
NAND'ing the two lists. This is turning out to be very
slow.

Any guidance greatfully received.

Ben Fidge
Nov 15 '05 #1
4 7072

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:09****************************@phx.gbl...
Hi

What is the most efficient way to gather a list of files
under a given folder recursively with the ability to
specify exclusion file masks?

For example, say I wanted to get a list of all files
under \Program Files and it's subdirectories that meet
the following criteria:

Include *.exe recursively
Include *.dll recursively
Exclude Common Files\*.* recursively

Ie I want all exe and dlls under Program Files and
subdirectorys, with the exception of any under \Program
files\Common Files.

At the moment, I'm gathering all included files into an
ArrayList and all excluded into another, and then
NAND'ing the two lists. This is turning out to be very
slow.

Any guidance greatfully received.
This is a fairly quick technique(timing was 04.6250000 seconds on my machine
without display, thats pretty much on par with cmd.exe which was ~6.5 secs
including listing):

string[] LoadFiles(string searchDirectory)
{
ArrayList fileList = new ArrayList();
ArrayList ignoreList = new ArrayList();
ignoreList.Add(searchDirectory + "\\Common Files");

string[] directorys =
System.IO.Directory.GetDirectories(searchDirectory );
string[] files;
foreach (string directory in directorys)
{
if (!ignoreList.Contains(directory))
{
files = LoadFiles(directory);
fileList.AddRange(files);
}
}
files = System.IO.Directory.GetFiles(searchDirectory,"*.ex e");
fileList.AddRange(files);
files = System.IO.Directory.GetFiles(searchDirectory,"*.dl l");
fileList.AddRange(files);

return (string[])fileList.ToArray(typeof(string));
}

For a more reusable method, you may want to push ignoreList out as a
parameter(or a class member, depending on the situation) as well as perhaps
providing an extension list instead of hardcoding them.
Ben Fidge

Nov 15 '05 #2
Daniele,

Thanks for the help. I should have specified my criteria
a little more clearly.

Basically, I've written a small backup utility that
allows the backing up of a web-sites content via a Web
Service running on that site.

The site is large (100,000+ aspx, xml, xsl etc files) and
the web service is taking too long to compile the list of
files to download.

The backup utility needs to specify filemasks for
including and excluding in the final result set. So,
hypothetically speaking, a backup set could be as follwos:

include inetpub\wwwroot\*.aspx recursively
include inetpub\wwwroot\*.xml recursively
include inetpub\wwwroot\*.xsl recursively
exclude inetpub\wwwroot\data\*.xml
exclude inetpub\wwwroot\data\savepage.aspx

Your method is certainly very fast, but is it possible to
tailor it for unlimited arbitrary include and exclude
filemasks?

The method I currently use is taking about 90 seconds to
execute on our web server and the Web Service call is
timing out, hence the need for the most efficient file
gathering algorithm.

Thanks for your time, you'll be on my Christmas Card list
if you can solve this one!!

Ben Fidge

-----Original Message-----

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:09****************************@phx.gbl...
Hi

What is the most efficient way to gather a list of files under a given folder recursively with the ability to
specify exclusion file masks?

For example, say I wanted to get a list of all files
under \Program Files and it's subdirectories that meet
the following criteria:

Include *.exe recursively
Include *.dll recursively
Exclude Common Files\*.* recursively

Ie I want all exe and dlls under Program Files and
subdirectorys, with the exception of any under \Program
files\Common Files.

At the moment, I'm gathering all included files into an
ArrayList and all excluded into another, and then
NAND'ing the two lists. This is turning out to be very
slow.

Any guidance greatfully received.
This is a fairly quick technique(timing was 04.6250000

seconds on my machinewithout display, thats pretty much on par with cmd.exe which was ~6.5 secsincluding listing):

string[] LoadFiles(string searchDirectory)
{
ArrayList fileList = new ArrayList();
ArrayList ignoreList = new ArrayList();
ignoreList.Add(searchDirectory + "\\Common Files");

string[] directorys =
System.IO.Directory.GetDirectories(searchDirector y);
string[] files;
foreach (string directory in directorys)
{
if (!ignoreList.Contains(directory))
{
files = LoadFiles(directory);
fileList.AddRange(files);
}
}
files = System.IO.Directory.GetFiles (searchDirectory,"*.exe"); fileList.AddRange(files);
files = System.IO.Directory.GetFiles (searchDirectory,"*.dll"); fileList.AddRange(files);

return (string[])fileList.ToArray(typeof(string));
}

For a more reusable method, you may want to push ignoreList out as aparameter(or a class member, depending on the situation) as well as perhapsproviding an extension list instead of hardcoding them.

Ben Fidge

.

Nov 15 '05 #3

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:08****************************@phx.gbl...
Daniele,

Thanks for the help. I should have specified my criteria
a little more clearly.

Basically, I've written a small backup utility that
allows the backing up of a web-sites content via a Web
Service running on that site.

The site is large (100,000+ aspx, xml, xsl etc files) and
the web service is taking too long to compile the list of
files to download.

The backup utility needs to specify filemasks for
including and excluding in the final result set. So,
hypothetically speaking, a backup set could be as follwos:

include inetpub\wwwroot\*.aspx recursively
include inetpub\wwwroot\*.xml recursively
include inetpub\wwwroot\*.xsl recursively
exclude inetpub\wwwroot\data\*.xml
exclude inetpub\wwwroot\data\savepage.aspx

Your method is certainly very fast, but is it possible to
tailor it for unlimited arbitrary include and exclude
filemasks?

The method I currently use is taking about 90 seconds to
execute on our web server and the Web Service call is
timing out, hence the need for the most efficient file
gathering algorithm.

Thanks for your time, you'll be on my Christmas Card list
if you can solve this one!!

Ben Fidge


I'm not to fond of this, but it seems to work to an extent. This code runs
across my C: drive(80k files) in 12 seconds. Perhaps filtering out to scan
each folder one time and using regex to match both includes and excludes
will be faster, but I don't have any more time to spend on this today, I
hope it helps:

string[] GetDirectories(string startingPoint, params string[]
searchPatterns)
{
ArrayList directories = new ArrayList();
if (startingPoint == "c:\\System Volume Information")
return (string[])directories.ToArray(typeof(string));
foreach (string directory in Directory.GetDirectories(startingPoint))
{
string[] results = GetDirectories(directory,searchPatterns);
if (directory == "c:\\System Volume Information")
continue;
foreach (string searchPattern in searchPatterns)
directories.Add(directory + "\\" + searchPattern);
directories.AddRange(results);
}
return (string[])directories.ToArray(typeof(string));
}
/// <summary>
/// Returns a fileset based on a list of include and exclude patterns
/// </summary>
/// <param name="include">An IList of wildcard patterns to include</param>
/// <param name="exclude">An IList of wildcard patterns to exclude</param>
/// <returns>An array of strings containing the resultant set</returns>
string[] LoadFiles(IList include, IList exclude)
{
ArrayList files = new ArrayList();
PrepareFilter(exclude);
foreach (string mask in include)
{
string directory = Path.GetDirectoryName(mask);
string fileMask = Path.GetFileName(mask);
string[] filesFound = Directory.GetFiles(directory,fileMask);
filesFound = Filter(filesFound,exclude);
files.AddRange(filesFound);
}

return (string[])files.ToArray(typeof(string));
}
/// <summary>
/// Filters out files based on a list of exclusions
/// </summary>
/// <param name="files">The files to filter</param>
/// <param name="exclude">An IList of patterns to use for exclusion
matching</param>
/// <returns>An array of strings that passed the filter</returns>
string[] Filter(string[] files, IList exclude)
{
ArrayList results = new ArrayList();
foreach (string result in files)
{
bool include = true;
foreach (string filter in exclude)
{

Regex regex;
regex = regexCache[filter] as Regex;

if (regex == null)
{
regex = PrepareFilter(filter,WildcardToRegex(filter));
}
if (regex.IsMatch(result))
{
include = false;
break;
}
}
if (include)
results.Add(result);
}
return (string[])results.ToArray(typeof(string));
}
/// <summary>
/// Prepares filtration by compiling and caching Regex statements
/// </summary>
/// <param name="exclude">A list of Wildcard patterns to prepare</param>
void PrepareFilter(IList patterns)
{
foreach (string path in patterns)
{
string result = WildcardToRegex(path);
PrepareFilter(path,result);
}
}
/// <summary>
/// Prepares and adds a single regex pattern to the regex cache
/// </summary>
/// <param name="name">the pattern name</param>
/// <param name="pattern">The pattern</param>
/// <returns>A Regex object to perform filtration for that
pattern</returns>
Regex PrepareFilter(string name, string pattern)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
Regex regex = new Regex(pattern,RegexOptions.Compiled |
RegexOptions.IgnoreCase );
PrepareFilter(name,regex);
return regex;
}
/// <summary>
/// Adds a Regex to the regexcache.
/// </summary>
/// <param name="name">The regex name</param>
/// <param name="regex">The Regex object to cache</param>
/// <returns>the Regex passed in as regex, this is done simply to provide
return type compatability with PrepareFilter(string,string)</returns>
Regex PrepareFilter(string name, Regex regex)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
regexCache.Add(name,regex);
return regex;
}
/// <summary>
/// Converts dos wildcard(*) string into regex in a generic manner
/// </summary>
/// <param name="pattern">The pattern to convert</param>
/// <returns>The converted regex string.</returns>
/// <remarks>
/// This method does not handle ? in patterns, if this support is needed
it should be added here.
/// </remarks>
string WildcardToRegex(string pattern)
{
pattern = pattern.Replace(@"\",@"\\");
pattern = pattern.Replace("^",@"\^");
pattern = pattern.Replace(".",@"\.");
pattern = pattern.Replace("*",".*");
pattern = pattern + "$";
return pattern;
}
IDictionary regexCache = new Hashtable();
Nov 15 '05 #4
Daniel,

Thanks for your time, I will give this a go.

Merry Christmas,

Ben
-----Original Message-----

"Ben Fidge" <be*******@btopenworld.com> wrote in message
news:08****************************@phx.gbl...
Daniele,

Thanks for the help. I should have specified my criteria a little more clearly.

Basically, I've written a small backup utility that
allows the backing up of a web-sites content via a Web
Service running on that site.

The site is large (100,000+ aspx, xml, xsl etc files) and the web service is taking too long to compile the list of files to download.

The backup utility needs to specify filemasks for
including and excluding in the final result set. So,
hypothetically speaking, a backup set could be as follwos:
include inetpub\wwwroot\*.aspx recursively
include inetpub\wwwroot\*.xml recursively
include inetpub\wwwroot\*.xsl recursively
exclude inetpub\wwwroot\data\*.xml
exclude inetpub\wwwroot\data\savepage.aspx

Your method is certainly very fast, but is it possible to tailor it for unlimited arbitrary include and exclude
filemasks?

The method I currently use is taking about 90 seconds to execute on our web server and the Web Service call is
timing out, hence the need for the most efficient file
gathering algorithm.

Thanks for your time, you'll be on my Christmas Card list if you can solve this one!!

Ben Fidge

I'm not to fond of this, but it seems to work to an

extent. This code runsacross my C: drive(80k files) in 12 seconds. Perhaps filtering out to scaneach folder one time and using regex to match both includes and excludeswill be faster, but I don't have any more time to spend on this today, Ihope it helps:

string[] GetDirectories(string startingPoint, params string[]searchPatterns)
{
ArrayList directories = new ArrayList();
if (startingPoint == "c:\\System Volume Information")
return (string[])directories.ToArray(typeof(string));
foreach (string directory in Directory.GetDirectories (startingPoint)) {
string[] results = GetDirectories (directory,searchPatterns); if (directory == "c:\\System Volume Information")
continue;
foreach (string searchPattern in searchPatterns)
directories.Add(directory + "\\" + searchPattern);
directories.AddRange(results);
}
return (string[])directories.ToArray(typeof(string));
}
/// <summary>
/// Returns a fileset based on a list of include and exclude patterns /// </summary>
/// <param name="include">An IList of wildcard patterns to include</param> /// <param name="exclude">An IList of wildcard patterns to exclude</param> /// <returns>An array of strings containing the resultant set</returns> string[] LoadFiles(IList include, IList exclude)
{
ArrayList files = new ArrayList();
PrepareFilter(exclude);
foreach (string mask in include)
{
string directory = Path.GetDirectoryName(mask);
string fileMask = Path.GetFileName(mask);
string[] filesFound = Directory.GetFiles (directory,fileMask); filesFound = Filter(filesFound,exclude);
files.AddRange(filesFound);
}

return (string[])files.ToArray(typeof(string));
}
/// <summary>
/// Filters out files based on a list of exclusions
/// </summary>
/// <param name="files">The files to filter</param>
/// <param name="exclude">An IList of patterns to use for exclusionmatching</param>
/// <returns>An array of strings that passed the filter</returns> string[] Filter(string[] files, IList exclude)
{
ArrayList results = new ArrayList();
foreach (string result in files)
{
bool include = true;
foreach (string filter in exclude)
{

Regex regex;
regex = regexCache[filter] as Regex;

if (regex == null)
{
regex = PrepareFilter(filter,WildcardToRegex (filter)); }
if (regex.IsMatch(result))
{
include = false;
break;
}
}
if (include)
results.Add(result);
}
return (string[])results.ToArray(typeof(string));
}
/// <summary>
/// Prepares filtration by compiling and caching Regex statements /// </summary>
/// <param name="exclude">A list of Wildcard patterns to prepare</param> void PrepareFilter(IList patterns)
{
foreach (string path in patterns)
{
string result = WildcardToRegex(path);
PrepareFilter(path,result);
}
}
/// <summary>
/// Prepares and adds a single regex pattern to the regex cache /// </summary>
/// <param name="name">the pattern name</param>
/// <param name="pattern">The pattern</param>
/// <returns>A Regex object to perform filtration for thatpattern</returns>
Regex PrepareFilter(string name, string pattern)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
Regex regex = new Regex(pattern,RegexOptions.Compiled |RegexOptions.IgnoreCase );
PrepareFilter(name,regex);
return regex;
}
/// <summary>
/// Adds a Regex to the regexcache.
/// </summary>
/// <param name="name">The regex name</param>
/// <param name="regex">The Regex object to cache</param> /// <returns>the Regex passed in as regex, this is done simply to providereturn type compatability with PrepareFilter (string,string)</returns> Regex PrepareFilter(string name, Regex regex)
{
if (regexCache.Contains(name))
return regexCache[name] as Regex;
regexCache.Add(name,regex);
return regex;
}
/// <summary>
/// Converts dos wildcard(*) string into regex in a generic manner /// </summary>
/// <param name="pattern">The pattern to convert</param> /// <returns>The converted regex string.</returns>
/// <remarks>
/// This method does not handle ? in patterns, if this support is neededit should be added here.
/// </remarks>
string WildcardToRegex(string pattern)
{
pattern = pattern.Replace(@"\",@"\\");
pattern = pattern.Replace("^",@"\^");
pattern = pattern.Replace(".",@"\.");
pattern = pattern.Replace("*",".*");
pattern = pattern + "$";
return pattern;
}
IDictionary regexCache = new Hashtable();
.

Nov 15 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
by: pembed2003 | last post by:
Hi all, I asked this question in the C group but no one seems to be interested in answering it. :-( Basically, I wrote a search and replace function so I can do: char source = "abcd?1234?x";...
4
by: Tarique Jawed | last post by:
Alright I needed some help regarding a removal of a binary search tree. Yes its for a class, and yes I have tried working on it on my own, so no patronizing please. I have most of the code working,...
28
by: joshc | last post by:
If I have an array of data that I know to be sorted in increasing order, and the array is less than 50 elements, and I want to find the first element greater than a certain value, is a simple...
60
by: Julie | last post by:
What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string. The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't...
12
by: Divick | last post by:
Hi all, does any one know of any data structure in which I can search the key and value equally fast? Though I could use hashes, but I can only search in constant time on key but not on value. If...
4
by: Dameon | last post by:
Hi All, I have a process where I'd like to search the contents of a file(in a dir) for all occurences (or the count of) of a given string. My goal is to focus more on performance, as some of the...
9
by: Rick | last post by:
I have a large list of objects where each object has a unique (non-overlapping) date range. The list is sorted chronologically. What is the most efficient way to search this list for a single...
14
by: S | last post by:
Any idea on how I would be able to do a search within C# that does ranges or words For example I want to search for Chicken in the string string s1 = "This is Great Chicken";
6
Kelicula
by: Kelicula | last post by:
Why?: One commonly used algorithm is the binary search. If you don't already know it, you should read on. Very helpful. Saves much CPU. Reduces computations exponentially. When searching...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.