471,624 Members | 2,133 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,624 software developers and data experts.

Storing file information in memory

I'm writing a command line utility to move some files. I'm dealing with
thousands of files and I was wondering if anyone had any suggestions.

This is what I have currently:

$arrayVirtualFile =
array( 'filename'=>'filename',
'basename'=>'filename.ext',
'extension'=>'ext',
'size'=>0,
'dirname'=>'',
'uxtimestamp'=>'');

I then loop through a directory and for each file I populate the $arrayVirtualFile
and add it to $arrayOfVirtualFiles.
A directory of ~2500 files takes up about ~1.7 MB of memory when I run
the script.
Anyone have any suggestions as to how to take up less space?

Thanks!!
Posted by NewsLook (Trial Licence) from http://www.ghytred.com/NewsLook/about.aspx

Nov 15 '07 #1
10 2162

"deciacco" <a@awrote in message
news:c6******************************@ghytred.com. ..
I'm writing a command line utility to move some files. I'm dealing with
thousands of files and I was wondering if anyone had any suggestions.

This is what I have currently:

$arrayVirtualFile =
array( 'filename'=>'filename',
'basename'=>'filename.ext',
'extension'=>'ext',
'size'=>0,
'dirname'=>'',
'uxtimestamp'=>'');

I then loop through a directory and for each file I populate the
$arrayVirtualFile
and add it to $arrayOfVirtualFiles.
A directory of ~2500 files takes up about ~1.7 MB of memory when I run
the script.
Anyone have any suggestions as to how to take up less space?
well, that all depends what you're doing with that information. plus, your
array structure is a must point. why not just store the file names in an
array. when you need all that info, just use the pathinfo() function. with
just that, so far you have the file name, basename, extension, path...all
you need now is to call fstat() to get the size and the touch time. that
should knock down your memory consumption monumentally. plus, using pathinfo
and fstat will give you a bunch more information that your current
structure.

so, store minimally what you need. then use functions to get the info when
you need it. but again, you should really define what you're doing this all
for...as in, once you have that info, what are you doing?
Nov 15 '07 #2
thanks for the reply steve...

basically, i want to collect the file information into memory so that I can
then do analysis, like compare file times and sizes. it's much faster to do
this in memory than to do it from disk. should have mentioned this earlier
as you said...

"Steve" <no****@example.comwrote in message
news:W9***************@newsfe02.lga...
>
"deciacco" <a@awrote in message
news:c6******************************@ghytred.com. ..
>I'm writing a command line utility to move some files. I'm dealing with
thousands of files and I was wondering if anyone had any suggestions.

This is what I have currently:

$arrayVirtualFile =
array( 'filename'=>'filename',
'basename'=>'filename.ext',
'extension'=>'ext',
'size'=>0,
'dirname'=>'',
'uxtimestamp'=>'');

I then loop through a directory and for each file I populate the
$arrayVirtualFile
and add it to $arrayOfVirtualFiles.
A directory of ~2500 files takes up about ~1.7 MB of memory when I run
the script.
Anyone have any suggestions as to how to take up less space?

well, that all depends what you're doing with that information. plus, your
array structure is a must point. why not just store the file names in an
array. when you need all that info, just use the pathinfo() function. with
just that, so far you have the file name, basename, extension, path...all
you need now is to call fstat() to get the size and the touch time. that
should knock down your memory consumption monumentally. plus, using
pathinfo and fstat will give you a bunch more information that your
current structure.

so, store minimally what you need. then use functions to get the info when
you need it. but again, you should really define what you're doing this
all for...as in, once you have that info, what are you doing?

Nov 15 '07 #3
deciacco wrote:
thanks for the reply steve...

basically, i want to collect the file information into memory so that I can
then do analysis, like compare file times and sizes. it's much faster to do
this in memory than to do it from disk. should have mentioned this earlier
as you said...
Why do you care how much memory it takes?

1.7MB is not very much.
Nov 16 '07 #4

"The Natural Philosopher" <a@b.cwrote in message
news:11***************@proxy00.news.clara.net...
deciacco wrote:
>thanks for the reply steve...

basically, i want to collect the file information into memory so that I
can then do analysis, like compare file times and sizes. it's much faster
to do this in memory than to do it from disk. should have mentioned this
earlier as you said...

Why do you care how much memory it takes?

1.7MB is not very much.
why do you care if he cares?

solve the problem!
Nov 16 '07 #5
These days memory is not an issue, but that does not mean we shouldn't write
good, efficient code that utilizes memory well.

While 1.7MB is not much, that is what is generated when I look at ~2500
files. I have approximately 175000 files to look at and my script uses up
about 130MB. I was simply wondering if someone out there with more
experience, had a better way of doing this that would utilize less memory.

"The Natural Philosopher" <a@b.cwrote in message
news:11***************@proxy00.news.clara.net...
deciacco wrote:
>thanks for the reply steve...

basically, i want to collect the file information into memory so that I
can then do analysis, like compare file times and sizes. it's much faster
to do this in memory than to do it from disk. should have mentioned this
earlier as you said...

Why do you care how much memory it takes?

1.7MB is not very much.

Nov 16 '07 #6
deciacco wrote:
"The Natural Philosopher" <a@b.cwrote in message
news:11***************@proxy00.news.clara.net...
>deciacco wrote:
>>thanks for the reply steve...

basically, i want to collect the file information into memory so that I
can then do analysis, like compare file times and sizes. it's much faster
to do this in memory than to do it from disk. should have mentioned this
earlier as you said...
Why do you care how much memory it takes?

1.7MB is not very much.

These days memory is not an issue, but that does not mean we shouldn't
write good, efficient code that utilizes memory well.
There is also something known as "premature optimization".
While 1.7MB is not much, that is what is generated when I look at
~2500 files. I have approximately 175000 files to look at and my
script uses up about 130MB. I was simply wondering if someone out
there with more experience, had a better way of doing this that would
utilize less memory.
(Top posting fixed)

How are you figuring your 1.7Mb? If you're just looking at how much
memory is being used by the process, for instance, there will be a lot
of other things in there, also - like your code.

1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
seems rather a bit large to me. But it also depends on just how much
you're storing in the array (i.e. how long are your path names).

I also wonder why you feel a need to store so much info in memory, but
I'm sure you have a good reason.

P.S. Please don't top post. Thanks.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 16 '07 #7
"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:Oa******************************@comcast.com. ..
deciacco wrote:
>"The Natural Philosopher" <a@b.cwrote in message
news:11***************@proxy00.news.clara.net.. .
>>deciacco wrote:
thanks for the reply steve...
basically, i want to collect the file information into memory so
that I can then do analysis, like compare file times and sizes.
it's much faster to do this in memory than to do it from disk.
should have mentioned this earlier as you said...
Why do you care how much memory it takes?
1.7MB is not very much.
These days memory is not an issue, but that does not mean we shouldn't
write good, efficient code that utilizes memory well.
There is also something known as "premature optimization".
>While 1.7MB is not much, that is what is generated when I look at
~2500 files. I have approximately 175000 files to look at and my
script uses up about 130MB. I was simply wondering if someone out
there with more experience, had a better way of doing this that would
utilize less memory.
(Top posting fixed)
How are you figuring your 1.7Mb? If you're just looking at how much
memory is being used by the process, for instance, there will be a lot of
other things in there, also - like your code.
1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
seems rather a bit large to me. But it also depends on just how much
you're storing in the array (i.e. how long are your path names).
I also wonder why you feel a need to store so much info in memory, but I'm
sure you have a good reason.
P.S. Please don't top post. Thanks.
Jerry...

I use Outlook Express and it does top-posting by default. Didn't realize
top-posting was bad.

To answer your questions:

"Premature Optimization"
I first noticed this problem in my first program. It was running much slower
and taking up 5 times as much memory. I realized I needed to rethink my
code.

"Figuring Memory Use"
To get the amount of memory used, I take a reading with memory_get_usage()
at the start of the code in question and then take another reading at the
end of the snippet. I then take the difference and that should give me a
good idea of the amount of memory my code is utilizing.

"Feel the Need"
The first post shows you an array of the type of data I store. This array
gets created for each file and added as an item to another array. In other
words, an array of arrays. As I mentioned in a fallow-up posting, the reason
I'm doing this is because I want to do some analysis of file information,
like comparing file times and sizes from two seperate directories. This is
much faster in memory than on disk.


Nov 16 '07 #8

"deciacco" <a@awrote in message
news:Xr******************************@giganews.com ...
"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:Oa******************************@comcast.com. ..
>deciacco wrote:
>>"The Natural Philosopher" <a@b.cwrote in message
news:11***************@proxy00.news.clara.net. ..
deciacco wrote:
thanks for the reply steve...
basically, i want to collect the file information into memory so
that I can then do analysis, like compare file times and sizes.
it's much faster to do this in memory than to do it from disk.
should have mentioned this earlier as you said...
Why do you care how much memory it takes?
1.7MB is not very much.
These days memory is not an issue, but that does not mean we shouldn't
write good, efficient code that utilizes memory well.
There is also something known as "premature optimization".
>>While 1.7MB is not much, that is what is generated when I look at
~2500 files. I have approximately 175000 files to look at and my
script uses up about 130MB. I was simply wondering if someone out
there with more experience, had a better way of doing this that would
utilize less memory.
(Top posting fixed)
How are you figuring your 1.7Mb? If you're just looking at how much
memory is being used by the process, for instance, there will be a lot of
other things in there, also - like your code.
1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
seems rather a bit large to me. But it also depends on just how much
you're storing in the array (i.e. how long are your path names).
I also wonder why you feel a need to store so much info in memory, but
I'm sure you have a good reason.
P.S. Please don't top post. Thanks.

Jerry...

I use Outlook Express and it does top-posting by default. Didn't realize
top-posting was bad.
i use oe too. just hit ctrl+end immediately after hitting 'reply group'. a
usenet thread isn't like an email conversation where both parties already
know what was said in the previous coorespondence. top posting in usenet
forces *everyone* to start reading a post from the bottom up. this is
particularly painful when in-line responses are made...you have to not only
read from the bottom up, but find the start of a reponse, read down to see
the in-line response(s), then scroll back up past the start of that post
again.

tons of other reasons. we just ask that you know and try to follow as best
you can what usenet considers uniform/standard netiquette.
To answer your questions:
<snip>
"Feel the Need"
The first post shows you an array of the type of data I store. This array
gets created for each file and added as an item to another array. In other
words, an array of arrays. As I mentioned in a fallow-up posting, the
reason I'm doing this is because I want to do some analysis of file
information, like comparing file times and sizes from two seperate
directories. This is much faster in memory than on disk.
ok, for the comparisons...consider speed and memory consumption. if you were
to get a list of file names, your memory consumption would be at its bare
minimum (almost). when doing the comparison, you can vastly improve your
performance *and* maintainability by iterating through the files, getting
the file info, putting that info into a db, and then run queries against the
table. the db will beat your php comparison algorythms any day of the week.
plus, sql is formalized...so everyone will understand how you are making
your comparisons.

the only way to get lower memory consumption would be to, during the process
of listing files, DON'T store the file but immediately put all the
information into the db at that point. that will be the theoretical best
performance and memory utilization combination there can be.

btw, i posted this function in another group and someone asked today what
the hell it does. since it directly relates to what you're doing AND uses
pathinfo and fstat, which i mentioned to you briefly in this thread before,
i thought i'd post this example to help:

==============

<?
function listFiles($path = '.', $extension = array(), $combine = false)
{
$wd = getcwd();
$path .= substr($path, -1) != '/' ? '/' : '';
if (!chdir($path)){ return array(); }
if (!$extension){ $extension = array('*'); }
if (!is_array($extension)){ $extension = array($extension); }
$extensions = '*.{' . implode(',', $extension) . '}';
$files = glob($extensions, GLOB_BRACE);
chdir($wd);
if (!$files){ return array(); }
$list = array();
$path = $combine ? $path : '';
foreach ($files as $file)
{
$list[] = $path . $file;
}
return $list;
}
$files = listFiles('c:/inetpub/wwwroot/images', 'jpg', true);
$images = array();
foreach ($files as $file)
{
$fileInfo = pathinfo($file);
$handle = fopen($file, 'r');
$fileInfo = array_merge($fileInfo, fstat($handle));
fclose($handle);
for ($i = 0; $i < 13; $i++){ unset($fileInfo[$i]); }
echo '<pre>' . print_r($fileInfo, true) . '</pre>';
}
?>
Nov 16 '07 #9
deciacco wrote:
"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:Oa******************************@comcast.com. ..
>deciacco wrote:
>>"The Natural Philosopher" <a@b.cwrote in message
news:11***************@proxy00.news.clara.net. ..
deciacco wrote:
thanks for the reply steve...
basically, i want to collect the file information into memory so
that I can then do analysis, like compare file times and sizes.
it's much faster to do this in memory than to do it from disk.
should have mentioned this earlier as you said...
Why do you care how much memory it takes?
1.7MB is not very much.
These days memory is not an issue, but that does not mean we shouldn't
write good, efficient code that utilizes memory well.
There is also something known as "premature optimization".
>>While 1.7MB is not much, that is what is generated when I look at
~2500 files. I have approximately 175000 files to look at and my
script uses up about 130MB. I was simply wondering if someone out
there with more experience, had a better way of doing this that would
utilize less memory.
(Top posting fixed)
How are you figuring your 1.7Mb? If you're just looking at how much
memory is being used by the process, for instance, there will be a lot of
other things in there, also - like your code.
1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
seems rather a bit large to me. But it also depends on just how much
you're storing in the array (i.e. how long are your path names).
I also wonder why you feel a need to store so much info in memory, but I'm
sure you have a good reason.
P.S. Please don't top post. Thanks.

Jerry...

I use Outlook Express and it does top-posting by default. Didn't realize
top-posting was bad.
No problem. Recommendation - get Thunderbird. Much superior, and free :-)
To answer your questions:

"Premature Optimization"
I first noticed this problem in my first program. It was running much slower
and taking up 5 times as much memory. I realized I needed to rethink my
code.
OK, so you've identified a problem. Good.
"Figuring Memory Use"
To get the amount of memory used, I take a reading with memory_get_usage()
at the start of the code in question and then take another reading at the
end of the snippet. I then take the difference and that should give me a
good idea of the amount of memory my code is utilizing.
At last - someone who knows how to figure memory usage correctly! :-)

But I'm still confused why it would take almost 700 bytes per entry on
average. The array overhead shouldn't be *that* bad.

"Feel the Need"
The first post shows you an array of the type of data I store. This array
gets created for each file and added as an item to another array. In other
words, an array of arrays. As I mentioned in a fallow-up posting, the reason
I'm doing this is because I want to do some analysis of file information,
like comparing file times and sizes from two seperate directories. This is
much faster in memory than on disk.

Yes, it would be faster to do the comparisons in memory. However, you
also need to consider the amount of time it takes to create your arrays.
It isn't minor compared to some other operations.

When you're searching for files on the disk, as you get the file info,
the first one will take a while because the system has to (probably)
fetch the info from disk. But this caches several file entries, so the
next few will be relatively quick, until the system has to hit the disk
again (a big enough cache and that might never happen).

However, at the same time, if you just read one file from each directory
(assuming you're comparing the same file names) and compare them, then
go to the next file, the cache will still probably be valid, unless your
system is heavily loaded with high CPU and disk utilization. So in that
case your current algorithm probably will be slower than reading one at
a time and comparing.

Of course, if you're doing multiple compares, i.a. 'a' from the first
directory with 'x', 'y' and 'z' from the second directory, this wouldn't
be the case.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 16 '07 #10
Jerry Stuckle wrote:
deciacco wrote:
>"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:Oa******************************@comcast.com ...
>>deciacco wrote:
"The Natural Philosopher" <a@b.cwrote in message
news:11***************@proxy00.news.clara.net.. .
deciacco wrote:
>thanks for the reply steve...
>basically, i want to collect the file information into memory so
>that I can then do analysis, like compare file times and sizes.
>it's much faster to do this in memory than to do it from disk.
>should have mentioned this earlier as you said...
Why do you care how much memory it takes?
1.7MB is not very much.
These days memory is not an issue, but that does not mean we shouldn't
write good, efficient code that utilizes memory well.
There is also something known as "premature optimization".
While 1.7MB is not much, that is what is generated when I look at
~2500 files. I have approximately 175000 files to look at and my
script uses up about 130MB. I was simply wondering if someone out
there with more experience, had a better way of doing this that would
utilize less memory.
(Top posting fixed)
How are you figuring your 1.7Mb? If you're just looking at how much
memory is being used by the process, for instance, there will be a
lot of other things in there, also - like your code.
1.7Mb for 2500 files comes out to just under 700 bytes per entry,
which seems rather a bit large to me. But it also depends on just
how much you're storing in the array (i.e. how long are your path
names).
I also wonder why you feel a need to store so much info in memory,
but I'm sure you have a good reason.
P.S. Please don't top post. Thanks.

Jerry...

I use Outlook Express and it does top-posting by default. Didn't
realize top-posting was bad.

No problem. Recommendation - get Thunderbird. Much superior, and free :-)
Coming to you from Thunderbird. I had given up on it since there was
some talk to discontinue it/put it on the back burner at Mozilla. I got
it installed and configured as a newsreader only. Pretty cool!
>
>To answer your questions:

"Premature Optimization"
I first noticed this problem in my first program. It was running much
slower and taking up 5 times as much memory. I realized I needed to
rethink my code.

OK, so you've identified a problem. Good.
Yeah, was a real eye open too. I figured I didn't need to worry. It's
PHP after all, right!
>
>"Figuring Memory Use"
To get the amount of memory used, I take a reading with
memory_get_usage() at the start of the code in question and then take
another reading at the end of the snippet. I then take the difference
and that should give me a good idea of the amount of memory my code is
utilizing.

At last - someone who knows how to figure memory usage correctly! :-)
Thank you!
>
But I'm still confused why it would take almost 700 bytes per entry on
average. The array overhead shouldn't be *that* bad.
Hmm.. I will have to do some digging and try to pay closer attention.
Right now the focus was to simply get it down to a more reasonable
amount. The current solution is much faster, in the few seconds instead
of few minutes, and the memory use is much lower. If I stick in the
100,000 to 200,000 file range I will be more than fine.
>
>"Feel the Need"
The first post shows you an array of the type of data I store. This
array gets created for each file and added as an item to another
array. In other words, an array of arrays. As I mentioned in a
fallow-up posting, the reason I'm doing this is because I want to do
some analysis of file information, like comparing file times and sizes
from two seperate directories. This is much faster in memory than on
disk.


Yes, it would be faster to do the comparisons in memory. However, you
also need to consider the amount of time it takes to create your arrays.
It isn't minor compared to some other operations.

When you're searching for files on the disk, as you get the file info,
the first one will take a while because the system has to (probably)
fetch the info from disk. But this caches several file entries, so the
next few will be relatively quick, until the system has to hit the disk
again (a big enough cache and that might never happen).

However, at the same time, if you just read one file from each directory
(assuming you're comparing the same file names) and compare them, then
go to the next file, the cache will still probably be valid, unless your
system is heavily loaded with high CPU and disk utilization. So in that
case your current algorithm probably will be slower than reading one at
a time and comparing.

Of course, if you're doing multiple compares, i.a. 'a' from the first
directory with 'x', 'y' and 'z' from the second directory, this wouldn't
be the case.

Thanks to you and everyone else for the input on this post.
Nov 16 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

22 posts views Thread by Wynand Winterbach | last post: by
1 post views Thread by Ritu | last post: by
4 posts views Thread by kanones | last post: by
1 post views Thread by XIAOLAOHU | last post: by
reply views Thread by leo001 | last post: by
1 post views Thread by ZEDKYRIE | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.