Filesystem vs. Postgres for images

Michal Hlavac

Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.

Second question is, what kind of hardware I need for storing in DB. Now
I have Intel(R) Pentium(R) 4 CPU 1.70GHz with 512MB RAM and 120GB HDD.

thanx for advices...

miso

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #1

Subscribe Reply

8184

Development - multi.art.studio

Hello,

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.
but postres also stores its data on the filesystem.
maybe take a better FS like XFS (xfs is very nice and performes good),
imho other filesystems like reiser have some version-problems ;)
so storing images in postres as blob imho is not as fast as XFS, but
maybe you have to do some performance tests.
maybe you can store an index table for your images with path and
filenames - if did not so far, this should speed up your file-search.
also having a lot of ram, at least 1Gb for such a big portal, and scsi
is a good idea, if you want to boost it up, take a dual proz system with
scsi-raid and a lot of ram - and costs a lot of money also :((.
another tip is to activate proxy or httpd-cache and compression or other
caching stuff in php, java etc.

sorry for my broken english.
volker

Michal Hlavac wrote:
Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.

Second question is, what kind of hardware I need for storing in DB. Now
I have Intel(R) Pentium(R) 4 CPU 1.70GHz with 512MB RAM and 120GB HDD.

thanx for advices...

miso

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #2

Holger Klawitter

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

imho other filesystems like reiser have some version-problems ;)

Oh please. Reiser is as unstable as postgres is slow - in other words, both
have to suffer prejudice which used to be true loooong ago. ;-)

In cases of large directories ext2/3 perform extremely bad (as in the original
post) So this guy will be better off with anything but ext2/3. That's why I
switched from ext2 to reiser ~2 years ago (without any problems since).

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists <at> klawitter <dot> de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAe8dP1Xdt0HKSwgYRAkLqAJ4vmqkDGkFYDL67aPMAK6 qGAavEQgCfekvV
JCht52XAoXE8DrbXX24B8gc=
=XjOz
-----END PGP SIGNATURE-----
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #3

Cott Lang

On Tue, 2004-04-13 at 01:44, Michal Hlavac wrote:

Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.

Consider breaking your directories up, i.e.:

/ads/(ID % 1000)/ID

I use that for a system with several million images, works great. I
really don't think putting them in the database will do anything
positive for you. :)

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #4

Jeremiah Jahn

There has got to be some sort of standard way to do this. We have the
same problem where I work. Terabytes of images, but the question is
still sort of around "BLOBs or Files?" Our final decision was to use the
file system. We found that you didn't really gain anything by storing
the images in the DB, other than having one place to get the data from.
The file system approach is much easier to backup, because each image
can be archived separately as well as browsed by 3rd party tools.

-jj-
On Tue, 2004-04-13 at 07:40, Cott Lang wrote:

On Tue, 2004-04-13 at 01:44, Michal Hlavac wrote:
Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.

Consider breaking your directories up, i.e.:

/ads/(ID % 1000)/ID

I use that for a system with several million images, works great. I
really don't think putting them in the database will do anything
positive for you. :)

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Jeremiah Jahn <je******@cs.earlham.edu>
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #5

Joshua D. Drake

Hello,

No standard way that I know of :). We tend to use BLOBS because we can
have associated tables
with metadata about the images that can be searched etc.... Of course
you could that with the filesystem
as well but we find blobs easier.

I will say we tend to use BLOBS or Bytea.

J
Jeremiah Jahn wrote:

There has got to be some sort of standard way to do this. We have the
same problem where I work. Terabytes of images, but the question is
still sort of around "BLOBs or Files?" Our final decision was to use the
file system. We found that you didn't really gain anything by storing
the images in the DB, other than having one place to get the data from.
The file system approach is much easier to backup, because each image
can be archived separately as well as browsed by 3rd party tools.

-jj-
On Tue, 2004-04-13 at 07:40, Cott Lang wrote:

On Tue, 2004-04-13 at 01:44, Michal Hlavac wrote:

Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.

Consider breaking your directories up, i.e.:

/ads/(ID % 1000)/ID

I use that for a system with several million images, works great. I
really don't think putting them in the database will do anything
positive for you. :)

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #6

Christopher Petrilli

On Apr 13, 2004, at 9:40 AM, Jeremiah Jahn wrote:

There has got to be some sort of standard way to do this. We have the
same problem where I work. Terabytes of images, but the question is
still sort of around "BLOBs or Files?" Our final decision was to use
the
file system. We found that you didn't really gain anything by storing
the images in the DB, other than having one place to get the data from.
The file system approach is much easier to backup, because each image
can be archived separately as well as browsed by 3rd party tools.

This is a pretty "classic problem," of performance modeling. While it
wasn't images, I worked on a system that had several million small
files (5-100K) that needed to be stored. The performance bottleneck
was a couple of things, in storing them in the FS (the bottleneck is
similar in PostgreSQL):

1. Directory name lookups do not scale well, so keep the number of
files in a directory to a manageable number (100-500).
2. Retrieval time is limited not by disk bandwidth, but by I/O seek
performance. More spindles = more concurrent I/O in flight. Also, this
is where SCSI takes a massive lead with tag-command-queuing.

In our case, we ended up using a three-tier directory structure, so
that we could manage the number of files per directory, and then
because load was relatively even across the top 20 "directories", we
split them onto 5 spindle-pairs (i.e. RAID-1). This is a place where
RAID-5 is your enemy. RAID-1, when implemented with read-balancing, is
a substantial performance increase.

Hope this helps. Some of these things apply to PostgreSQL, except
until there's better manageability of TABLESPACE, and the ability to
split tables across multiple spaces, it's going to be hard to hit those
numbers. This is a place where the "big databases" are better. But
then, that's the top 5% of installs. Tradeoffs.

Chris
--
| Christopher Petrilli
| petrilli (at) amber.org
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #7

Mattias Kregert

> Hi,

imho other filesystems like reiser have some version-problems ;)

Oh please. Reiser is as unstable as postgres is slow - in other words, both
have to suffer prejudice which used to be true loooong ago. ;-)

In cases of large directories ext2/3 perform extremely bad (as in the original
post) So this guy will be better off with anything but ext2/3. That's whyI
switched from ext2 to reiser ~2 years ago (without any problems since).

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --

I use reiserfs, too. Large directories (hundreds of thousand files) does not slow down file retrival, and i never had any problems with stability.

/Mattias
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #8

Jeremiah Jahn

I tried the bytea types, but the parsing done by the system on insert
etc. was so bad that it made it usable for me. Our solution is to keep
all of the metadata in the db plus an id and then a web service that
gets the image from the FS.

On Tue, 2004-04-13 at 09:05, Joshua D. Drake wrote:

Hello,

No standard way that I know of :). We tend to use BLOBS because we can
have associated tables
with metadata about the images that can be searched etc.... Of course
you could that with the filesystem
as well but we find blobs easier.

I will say we tend to use BLOBS or Bytea.

J
Jeremiah Jahn wrote:
There has got to be some sort of standard way to do this. We have the
same problem where I work. Terabytes of images, but the question is
still sort of around "BLOBs or Files?" Our final decision was to use the
file system. We found that you didn't really gain anything by storing
the images in the DB, other than having one place to get the data from.
The file system approach is much easier to backup, because each image
can be archived separately as well as browsed by 3rd party tools.

-jj-
On Tue, 2004-04-13 at 07:40, Cott Lang wrote:

On Tue, 2004-04-13 at 01:44, Michal Hlavac wrote:
Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.
Consider breaking your directories up, i.e.:

/ads/(ID % 1000)/ID

I use that for a system with several million images, works great. I
really don't think putting them in the database will do anything
positive for you. :)

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #9

Phil Endecott

> I am working on web portal. There are some ads. We have about 200 000

ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.

Certainly the problem you are experiencing is because you have 200,000
directories, and directory lookups are not scaling well.

I had a look at this a few weeks ago for an email storage application.
Using a filesystem with better directory lookup performance (? xfs,
resiserfs, jfs)is one obvious solution, as is storing the data in the
database. If you want to use files in an ext2/3 filesystem, you need to
break up the directories into a hierarchy.

I did some web research trying to find numbers for how many entries you
can get away with in an ext2/3 filesystem before the lookup time starts
to bite. I didn't find very much useful data. The best answer I got
was "between 100 and 1000". Since my identifiers are decimail numbers,
I had a choice of breaking them up into groups of two or three (i.e.
12/34/56/78 or 012/345/678). I went for groups of two and it works
well. Certainly this is not the limiting factor in the system as a
whole.

Looking back, I wonder if I should have gone for groups of three. Is
the lookup time a function of the number of entries in the directory, or
the size (in bytes) of the directory? Since my directory names are
short in this scheme, I get more directory entries per disk block.

One other thing to mention: have you turned off access time (atime)
logging for the filesystem? (man mount)

--Phil.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #10

scott.marlowe

On Tue, 13 Apr 2004, Christopher Petrilli wrote:

2. Retrieval time is limited not by disk bandwidth, but by I/O seek
performance. More spindles = more concurrent I/O in flight. Also, this
is where SCSI takes a massive lead with tag-command-queuing.

In our case, we ended up using a three-tier directory structure, so
that we could manage the number of files per directory, and then
because load was relatively even across the top 20 "directories", we
split them onto 5 spindle-pairs (i.e. RAID-1). This is a place where
RAID-5 is your enemy. RAID-1, when implemented with read-balancing, is
a substantial performance increase.

Please explain why RAID 5 is so bad here. I would think that on a not
very heavily updated fs, RAID-5 would be the functional equivalent of a
RAID 0 array with one fewer disks, wouldn't it? Or is RAID 0 also a bad
idea (other than the unreliability of it) because it only puts the data on
one spindle, unlike RAID-1 which puts it on many.

In that case >2 drive RAID 1 setups might be a huge win. The linux kernel
certainly supports them, and I think some RAID cards do too.

Just wondering.
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #11

Christopher Petrilli

On Apr 13, 2004, at 11:27 AM, scott.marlowe wrote:

On Tue, 13 Apr 2004, Christopher Petrilli wrote:
2. Retrieval time is limited not by disk bandwidth, but by I/O seek
performance. More spindles = more concurrent I/O in flight. Also, this
is where SCSI takes a massive lead with tag-command-queuing.

In our case, we ended up using a three-tier directory structure, so
that we could manage the number of files per directory, and then
because load was relatively even across the top 20 "directories", we
split them onto 5 spindle-pairs (i.e. RAID-1). This is a place where
RAID-5 is your enemy. RAID-1, when implemented with read-balancing, is
a substantial performance increase.

Please explain why RAID 5 is so bad here. I would think that on a not
very heavily updated fs, RAID-5 would be the functional equivalent of a
RAID 0 array with one fewer disks, wouldn't it? Or is RAID 0 also a
bad
idea (other than the unreliability of it) because it only puts the
data on
one spindle, unlike RAID-1 which puts it on many.

In that case >2 drive RAID 1 setups might be a huge win. The linux
kernel
certainly supports them, and I think some RAID cards do too.

The issue comes down to read and write strategies. If your files are
bigger than the stripe size and begin to involve multiple drives, then
the rotational latency of each drive can come into play. This is often
hidden under caching during those wonderful comparison reviews, but
when you're talking about near random distributed access of more
information than could fit in the cache, then you have to face the
rotational issues of drives. Since the spindles are not locked
together, they drift apart in location, and you often end up with
worst-case latency in the drive subsystem. Mirroring doesn't face
this, especially when you can distribute the READS across all the
drives.

For example, if you ran triplex RAID-0, meaning 3 copies of the data,
which is often done in large environments so that you can take one copy
offline for a backup, while maintaining 2 copies online, then you can
basically handle 3 reads for the cost of 1, increasing the number of
read ops you can handle. This doesn't work with RAID-0, or RAID-5.

Chris
--
| Christopher Petrilli
| petrilli (at) amber.org
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #12

Alex

Hi,
is the file system approach really easier and faster? What if you need
to protect the image data e.g. you dont want users just to just dowload
the pictures directly from your website?

-a

Jeremiah Jahn wrote:

There has got to be some sort of standard way to do this. We have the
same problem where I work. Terabytes of images, but the question is
still sort of around "BLOBs or Files?" Our final decision was to use the
file system. We found that you didn't really gain anything by storing
the images in the DB, other than having one place to get the data from.
The file system approach is much easier to backup, because each image
can be archived separately as well as browsed by 3rd party tools.

-jj-
On Tue, 2004-04-13 at 07:40, Cott Lang wrote:

On Tue, 2004-04-13 at 01:44, Michal Hlavac wrote:

Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.

Consider breaking your directories up, i.e.:

/ads/(ID % 1000)/ID

I use that for a system with several million images, works great. I
really don't think putting them in the database will do anything
positive for you. :)

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #13

Jeremiah Jahn

Your code is retrieving the file from the file system. It doesn't have
to be accessible from the web server at all. Our current design uses a
JDBC connection to the database for the metadata (digital
signature,path,name,file type, etc..) and a SOAP call to the same server
(but doesn't have to be) to retrieve/store the image data.
-jj-
On Wed, 2004-04-14 at 08:15, Alex wrote:

Hi,
is the file system approach really easier and faster? What if you need
to protect the image data e.g. you dont want users just to just dowload
the pictures directly from your website?

-a

Jeremiah Jahn wrote:
There has got to be some sort of standard way to do this. We have the
same problem where I work. Terabytes of images, but the question is
still sort of around "BLOBs or Files?" Our final decision was to use the
file system. We found that you didn't really gain anything by storing
the images in the DB, other than having one place to get the data from.
The file system approach is much easier to backup, because each image
can be archived separately as well as browsed by 3rd party tools.

-jj-
On Tue, 2004-04-13 at 07:40, Cott Lang wrote:

On Tue, 2004-04-13 at 01:44, Michal Hlavac wrote:
Hello,

I am working on web portal. There are some ads. We have about 200 000
ads. Every ad have own directory called ID, where is 5 subdirectories
with various sizes of 5 images.

Filesystem is too slow. But I don't know, if I store these images into
postgres, performace will grow.
Consider breaking your directories up, i.e.:

/ads/(ID % 1000)/ID

I use that for a system with several million images, works great. I
really don't think putting them in the database will do anything
positive for you. :)

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Jeremiah Jahn <je******@cs.earlham.edu>
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #14

Steve Atkins

On Wed, Apr 14, 2004 at 10:15:51PM +0900, Alex wrote:

Hi,
is the file system approach really easier and faster? What if you need
to protect the image data e.g. you dont want users just to just dowload
the pictures directly from your website?

It can be much faster, if implemented correctly, to put the large
files directly on the filesystem. It makes it a little harder to cluster,
but it can significantly reduce DB overhead.

There's no issue with the users downloading images directly, as you
normally wouldn't mount them directly into the URL namespace. Instead
the URL would point to a script that would lookup the image in the
database, and check permissions. If the user is allowed to load the
image the script will close it's connection to the database, and start
shoveling bytes from the filesystem to the http connection. Most
decent web application platforms have some amount of support for this
sort of thing built in.

That has a number of other advantages too - it can take a long time
for a user to download a large file, and you really don't want the
thread handling them to tie up a database connection for all that
time. If you're on a platform that supports nice things like
sendfile(2) you can even have the kernel do almost all the work.

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #15

Development - multi.art.studio

hello all,

if somebody is interested in a script using blobs on pgsql and php,
leave me a mail at ka***@erdtrabant.de
this little skript can upload files to filesystem and directly into db,
release files from db to filesystem, and store files from filesystem to
database.
its tested with php 4.3 and postgres 7.1

greetings,
volker

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #16

Development - multi.art.studio

hi all,
maybe somewhat off-topic , but here it is....

http://www.erdtrabant.de/index.php?i=500200104

a little php-script to demonstrate how to store files as blobs into
postgres (tested with v7.1) as a base for testing etc., not very
beautiful script-style, but useable
feel free to download and change what and use where ever you want

thanks for looking
volker

Development - multi.art.studio wrote:

hello all,

if somebody is interested in a script using blobs on pgsql and php,
leave me a mail at ka***@erdtrabant.de
this little skript can upload files to filesystem and directly into
db, release files from db to filesystem, and store files from
filesystem to database.
its tested with php 4.3 and postgres 7.1

greetings,
volker

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #17

Filesystem vs. Postgres for images

Similar topics