By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,665 Members | 1,761 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,665 IT Pros & Developers. It's quick & easy.

which is the better option for directory hashing to store large number of image files?

P: n/a
Hi All,

I am not sure if this is the right place to ask this question but i am
very sure you may have faced this problem, i have already found some
post related to this but not the answer i am looking for.

My problem is that i have to upload images and store them. I am using
filesystem for that.

setup is something like this, their will be items/groups/user each can
have upto 6 images which needs to be scaled to 4 different sizes ie
every item can have upto 24 images of varying sizes.

now the standard way of storing these files would be to store them in
subdirectories based on some hash.

my partial solution is to split the four types of files into four
fixed base folders for each dimension,

since filename is in format "YmdHis" i decided to use directory
structure as Y/m/d/<filename>.
but i realize that even this could be inefficient.

so now i am thinking about going one more level by creating Y/m/d/H/i/
<filenamedirectory structure.

now my question is how to go about creating subdirectories below base
folders, will my scheme hold or should i use md5 hash as suggested by
others, over the filename and then take 2-3 characters and create one
or two level of directory structure and then store the files?

Regards,
Amit

Sep 17 '07 #1
Share this Question
Share on Google+
8 Replies


P: n/a
theCancerus wrote:
Hi All,

I am not sure if this is the right place to ask this question but i am
very sure you may have faced this problem, i have already found some
post related to this but not the answer i am looking for.

My problem is that i have to upload images and store them. I am using
filesystem for that.

setup is something like this, their will be items/groups/user each can
have upto 6 images which needs to be scaled to 4 different sizes ie
every item can have upto 24 images of varying sizes.

now the standard way of storing these files would be to store them in
subdirectories based on some hash.

my partial solution is to split the four types of files into four
fixed base folders for each dimension,

since filename is in format "YmdHis" i decided to use directory
structure as Y/m/d/<filename>.
but i realize that even this could be inefficient.

so now i am thinking about going one more level by creating Y/m/d/H/i/
<filenamedirectory structure.

now my question is how to go about creating subdirectories below base
folders, will my scheme hold or should i use md5 hash as suggested by
others, over the filename and then take 2-3 characters and create one
or two level of directory structure and then store the files?

Regards,
Amit
I use databases for this.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================
Sep 17 '07 #2

P: n/a
I personally use something like /images/front/controller/row_id/ -
that way I can only store the name of the image.

On Sep 17, 2:49 pm, Jerry Stuckle <jstuck...@attglobal.netwrote:
theCancerus wrote:
Hi All,
I am not sure if this is the right place to ask this question but i am
very sure you may have faced this problem, i have already found some
post related to this but not the answer i am looking for.
My problem is that i have to upload images and store them. I am using
filesystem for that.
setup is something like this, their will be items/groups/user each can
have upto 6 images which needs to be scaled to 4 different sizes ie
every item can have upto 24 images of varying sizes.
now the standard way of storing these files would be to store them in
subdirectories based on some hash.
my partial solution is to split the four types of files into four
fixed base folders for each dimension,
since filename is in format "YmdHis" i decided to use directory
structure as Y/m/d/<filename>.
but i realize that even this could be inefficient.
so now i am thinking about going one more level by creating Y/m/d/H/i/
<filenamedirectory structure.
now my question is how to go about creating subdirectories below base
folders, will my scheme hold or should i use md5 hash as suggested by
others, over the filename and then take 2-3 characters and create one
or two level of directory structure and then store the files?
Regards,
Amit

I use databases for this.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstuck...@attglobal.net
==================

Sep 17 '07 #3

P: n/a
Moral: Programming, as well as life, is not always an either-or.
Sometimes a compromise/hybrid is the best solution.

--
Shelly
ahhh, but shelly, the thing i like most is that in programming, it is always
either/or: on/off. to say otherwise is to not know programming. the same
holds true for life. you either do or do not. any notions about the nobility
or superiority of human action in his contemplation of life are simply
false, save the fact that there is none of either. do or do not is all that
remains and that directly linked to his own survivability - as is the
impetous of all animals.

compromise. chuckle.
Sep 17 '07 #4

P: n/a

"Steve" <no****@example.comwrote in message
news:3r**************@newsfe05.lga...
>Moral: Programming, as well as life, is not always an either-or.
Sometimes a compromise/hybrid is the best solution.

--
Shelly

ahhh, but shelly, the thing i like most is that in programming, it is
always either/or: on/off. to say otherwise is to not know programming. the
same holds true for life. you either do or do not. any notions about the
nobility or superiority of human action in his contemplation of life are
simply false, save the fact that there is none of either. do or do not is
all that remains and that directly linked to his own survivability - as is
the impetous of all animals.

compromise. chuckle.
So, I take it that if you fed a meal which is a wonderfully prepared, 10
pound, filet mignon you either (a) eat all of it or (b) eat none of it?

or,

If you are faced with a court appearance for excessive speeding in your car
you should either be acquitted or should get the death sentence?

On one project about 25 years ago I needed to modify a very large
application that was written in Fortran. I needed dynamic allocation.
According to you, I should have been faced with two choices. One was to
emulate dynamic allocation by setting aside a large part of memory and doing
my own allocation from that memory heap. A second would have been to
totally rewrite that entire (largggggeeeee) application in C. I chose a
"compromise". I wrote a small module in C and used that in conjunction with
the rest of the Fortran code.

The point here is that there are two extremes in handling his situation.
Either avoid a database and just use the file system, or avoid the file
system and put all of the contents of the file into a blob field in the
database. Often, the better way is to use the database as a rapid search
engine for a file in the file system.

I guess you aren't married? I have been for over four decades. Believe me,
"all or nothing" just doesn't work. Even with a swich for the lights you
can always add a dimmer.

By the way, I have been programming four over forty years. We are not
talking ones and zeros, true or false, here. We are talking design
philosophy -- and that if usually a compromise among various alternatives to
achieve the most efficient results in the shortest time for the least cost.

Shelly
Sep 17 '07 #5

P: n/a
On Mon, 17 Sep 2007 00:09:14 -0700, theCancerus <th*********@gmail.comwrote:
>My problem is that i have to upload images and store them. I am using
filesystem for that.

setup is something like this, their will be items/groups/user each can
have upto 6 images which needs to be scaled to 4 different sizes ie
every item can have upto 24 images of varying sizes.

now the standard way of storing these files would be to store them in
subdirectories based on some hash.

my partial solution is to split the four types of files into four
fixed base folders for each dimension,

since filename is in format "YmdHis" i decided to use directory
structure as Y/m/d/<filename>.
but i realize that even this could be inefficient.

so now i am thinking about going one more level by creating Y/m/d/H/i/
<filenamedirectory structure.

now my question is how to go about creating subdirectories below base
folders, will my scheme hold or should i use md5 hash as suggested by
others, over the filename and then take 2-3 characters and create one
or two level of directory structure and then store the files?
Splitting the files by date (down to whatever resolution) is potentially still
susceptible to a large number arriving at the same time, and ending up with a
large number of files in a single directory. If the goal is to spread the files
across a number of directories, then you probably want the value that
determines the directories to be approximately randomly distributed, and to
have a bounded and resonable number of possible directory names.

md5 of some property (name? or even contents?) likely fits this reasonably
well. The number of bytes you use for subdirectories depends on however many
images you have. If you don't actually expose the
hash-used-for-storage-directory in the URL, then you're free to re-hash the
images' directories if you end up needing more levels to split the directories
(if it was in the URL, then it would change the URLs of all your images, which
is something to be avoided).

Substrings of just the name may work as well, although there could be a bias
to particular letters or numbers depending on where the names come from and
what language they're in.
There's more than one way to do it, as ever, and the way to go depends on what
exactly you're doing. Have you checked whether your initial assumption is true,
though? Whilst "large number of entries in a directory is slow" is true in many
filesystems, it's not a universal truth. What's the threshold for your
filesystem, and are you planning on getting anywhere close to it in the
forseeable future? (after overestimating it a bit to be safely pessimistic)

--
Andy Hassall :: an**@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Sep 17 '07 #6

P: n/a
Shelly wrote:
"Steve" <no****@example.comwrote in message
news:3r**************@newsfe05.lga...
>>Moral: Programming, as well as life, is not always an either-or.
Sometimes a compromise/hybrid is the best solution.

--
Shelly
ahhh, but shelly, the thing i like most is that in programming, it is
always either/or: on/off. to say otherwise is to not know programming. the
same holds true for life. you either do or do not. any notions about the
nobility or superiority of human action in his contemplation of life are
simply false, save the fact that there is none of either. do or do not is
all that remains and that directly linked to his own survivability - as is
the impetous of all animals.

compromise. chuckle.

So, I take it that if you fed a meal which is a wonderfully prepared, 10
pound, filet mignon you either (a) eat all of it or (b) eat none of it?
(a). (b) is not even an option!
or,

If you are faced with a court appearance for excessive speeding in your car
you should either be acquitted or should get the death sentence?
No, but I should either be acquitted or found guilty. And if found
guilty, I should receive the appropriate punishment. The death sentence
is not appropriate for all infractions.
On one project about 25 years ago I needed to modify a very large
application that was written in Fortran. I needed dynamic allocation.
According to you, I should have been faced with two choices. One was to
emulate dynamic allocation by setting aside a large part of memory and doing
my own allocation from that memory heap. A second would have been to
totally rewrite that entire (largggggeeeee) application in C. I chose a
"compromise". I wrote a small module in C and used that in conjunction with
the rest of the Fortran code.
What is your point?
The point here is that there are two extremes in handling his situation.
Either avoid a database and just use the file system, or avoid the file
system and put all of the contents of the file into a blob field in the
database. Often, the better way is to use the database as a rapid search
engine for a file in the file system.
Sure, there are extremes. But have you actually tried storing the data
in a blob field and tuning your database for it? I thought not. Access
is quite fast - virtually always faster than a mix of the two, because
you don't have to make both a database and a file system call. Less
overhead - the database returns the blob just as effectively as it does
a file name.
I guess you aren't married? I have been for over four decades. Believe me,
"all or nothing" just doesn't work. Even with a swich for the lights you
can always add a dimmer.
Sure it does. If I don't let my wife have her own way ALL the time, I
get "nothing". :-)
By the way, I have been programming four over forty years. We are not
talking ones and zeros, true or false, here. We are talking design
philosophy -- and that if usually a compromise among various alternatives to
achieve the most efficient results in the shortest time for the least cost.

Shelly

Sure we are. Everything in programming comes down to ones and zeros.
It's just the approach to getting there that differs.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================
Sep 17 '07 #7

P: n/a
On Sep 17, 11:29 pm, Andy Hassall <a...@andyh.co.ukwrote:
On Mon, 17 Sep 2007 00:09:14 -0700, theCancerus <thecance...@gmail.comwrote:
My problem is that i have to upload images and store them. I am using
filesystem for that.
setup is something like this, their will be items/groups/user each can
have upto 6 images which needs to be scaled to 4 different sizes ie
every item can have upto 24 images of varying sizes.
now the standard way of storing these files would be to store them in
subdirectories based on some hash.
my partial solution is to split the four types of files into four
fixed base folders for each dimension,
since filename is in format "YmdHis" i decided to use directory
structure as Y/m/d/<filename>.
but i realize that even this could be inefficient.
so now i am thinking about going one more level by creating Y/m/d/H/i/
<filenamedirectory structure.
now my question is how to go about creating subdirectories below base
folders, will my scheme hold or should i use md5 hash as suggested by
others, over the filename and then take 2-3 characters and create one
or two level of directory structure and then store the files?

Splitting the files by date (down to whatever resolution) is potentially still
susceptible to a large number arriving at the same time, and ending up with a
large number of files in a single directory. If the goal is to spread the files
across a number of directories, then you probably want the value that
determines the directories to be approximately randomly distributed, and to
have a bounded and resonable number of possible directory names.

md5 of some property (name? or even contents?) likely fits this reasonably
well. The number of bytes you use for subdirectories depends on however many
images you have. If you don't actually expose the
hash-used-for-storage-directory in the URL, then you're free to re-hash the
images' directories if you end up needing more levels to split the directories
(if it was in the URL, then it would change the URLs of all your images, which
is something to be avoided).

Substrings of just the name may work as well, although there could be a bias
to particular letters or numbers depending on where the names come from and
what language they're in.

There's more than one way to do it, as ever, and the way to go depends on what
exactly you're doing. Have you checked whether your initial assumption is true,
though? Whilst "large number of entries in a directory is slow" is true in many
filesystems, it's not a universal truth. What's the threshold for your
filesystem, and are you planning on getting anywhere close to it in the
forseeable future? (after overestimating it a bit to be safely pessimistic)

--
Andy Hassall :: a...@andyh.co.uk ::http://www.andyh.co.ukhttp://www.and....co.uk/space:: disk and FTP usage analysis tool
hi Andy,

thanks for sensible reply.
we need to upload around 2.5 million images as seed data for the
website. we are using linux system(centos ) so any ideas what would be
the reasonable number of files per directory?

and unless thousands of users want to upload images at the same time i
am sure it will never happen that their are large number of files in
one directory every minute.

anyways i have decided to go with MD5 as 3/3 leter combination gives
me good spread for long time :)

Sep 18 '07 #8

P: n/a
On Tue, 18 Sep 2007 05:26:12 -0000, theCancerus <th*********@gmail.comwrote:
>On Sep 17, 11:29 pm, Andy Hassall <a...@andyh.co.ukwrote:
>>
There's more than one way to do it, as ever, and the way to go depends on what
exactly you're doing. Have you checked whether your initial assumption is true,
though? Whilst "large number of entries in a directory is slow" is true in many
filesystems, it's not a universal truth. What's the threshold for your
filesystem, and are you planning on getting anywhere close to it in the
forseeable future? (after overestimating it a bit to be safely pessimistic)
thanks for sensible reply.
we need to upload around 2.5 million images as seed data for the
website. we are using linux system(centos ) so any ideas what would be
the reasonable number of files per directory?
So, you're probably using the ext3 filesystem? This has an option for "hashed
b-tree" storage of directory entries, which helps with the
large-number-of-files issue (at least, the relevant part of it - obviously it
still takes a while to iterate through them all, but accessing one file that
you already know the filename of doesn't have the same problems as older
filesystems that do a linear scan every time).

On my CentOS system:

# tune2fs -l /dev/mapper/VolGroup00-LogVol00 | grep features
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery sparse_super large_file

The "dir_index" option says it's turned on for me, and I didn't change it, so
it must be the default.

I don't know what the limits of this are, though.

--
Andy Hassall :: an**@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Sep 18 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.