Enough RAM for entire Database.. cost aside, is this going to be fastest?

Andy B

Hello,

Sorry for this newbish question.

Briefly, my problem:
------------------
I expect the database I'm working on to reach something in the order of
12-16 Gigabytes, and I am interested in understanding as much as I can about
how I can make this go as fast as possible on a linux system. I haven't run
such a large database before. The nature of the database is such that
successive queries are very likely to lead to poor cache performance.

I have lots of experience with disks, operating systems, caching, virtual
memory, RAM etc. - just none running gargantuan databases very quickly!
-------------------

I've read all the performance tuning and configuration stuff I can, but
there is one basic question I can't get an answer to:

My question:
--------------------
If I can afford to have the entire database residing in RAM (within the
postgresql shared buffer cache, swapped into real memory) without impacting
other services on the machine, is there any reason why I shouldn't do it,
other than cost? (!)
--------------------
Basically, I'm finding it almost impossible to predict 'how much RAM is
right'. I know I don't need the *entire* database to sit in RAM, and a lot
of this answer depends on a lot of things - the speed of IO, the nature of
queries etc. But when you get to a certain amount of RAM, (specifically, the
amount where nothing needs to be swapped out), then surely things get a bit
more certain... or do they?

So, could I, for example, setup postgresql with a 16 GB shared buffer cache
and expect the postgresql backend processes to fly like the wind (CPU, RAM
and disk write speed permitting)?

I understand that writes can delay the progression of updates if setup in a
certain way, and that's ok - I'm really just wondering if there are some
*other* boundaries that will get in the way. I've read that I should be able
to configure a linux box (with the right processors) to address up to 64GB
of RAM, and I'm aware of more esoteric boxes like the SGI Altix 3000 which
can go far higher, but maybe that's overkill..

If there are any resources out there that point to other experiences of
others trying to coerce a huge database to run largely from RAM, I'd be
grateful for the links.

Many thanks
Andy
____

Nov 23 '05 #1

Subscribe Post Reply

2710

Shridhar Daithankar

Andy B wrote:

My question:
--------------------
If I can afford to have the entire database residing in RAM (within the
postgresql shared buffer cache, swapped into real memory) without impacting
other services on the machine, is there any reason why I shouldn't do it,
other than cost? (!)
--------------------
There is no reason why you should not do it. How remains to be a point of
disagreement though. You don't allocate 16GB of shared buffers to postgresql.
That won't give you performance you need.
Basically, I'm finding it almost impossible to predict 'how much RAM is
right'. I know I don't need the *entire* database to sit in RAM, and a lot
of this answer depends on a lot of things - the speed of IO, the nature of
queries etc. But when you get to a certain amount of RAM, (specifically, the
amount where nothing needs to be swapped out), then surely things get a bit
more certain... or do they?
Yes. If you have database size on disk + 256MB, then you should be done for a
dedicated database server machine/
So, could I, for example, setup postgresql with a 16 GB shared buffer cache
and expect the postgresql backend processes to fly like the wind (CPU, RAM
and disk write speed permitting)?
Umm.. you could assign 32MB of shared buffers and still achieve the effect you
want. Postgresql would not ever get 16GB of shared buffers. See other post by
Tom in last few hours.
If there are any resources out there that point to other experiences of
others trying to coerce a huge database to run largely from RAM, I'd be
grateful for the links.

Well, I haven't run it that large anytime but it is very easy to guess. An
opteron with 16GB of RAM and a 64 bit linux will get you there in cheapest
fashion. Add the disk array of your choice.

HTH

Shridhar

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #2

Andy B

Hello Shridhar,

Thanks for the reply.

There is no reason why you should not do it. How remains to be a point of
disagreement though. You don't allocate 16GB of shared buffers to postgresql. That won't give you performance you need.

I think in the other thread, Tom was alluding to this too. What is it about
the shared buffer cache behaviour that makes it inefficient when it is very
large? (assuming that the address space it occupies is allocated to RAM
pages)

Is there a good place I could look for some in depth details of its
behaviour?

Many thanks,
Andy

Nov 23 '05 #3

Mike Rylander

<posted & mailed>

Andy B wrote:

Hello Shridhar,

Thanks for the reply.
There is no reason why you should not do it. How remains to be a point of
disagreement though. You don't allocate 16GB of shared buffers to postgresql.
That won't give you performance you need.

I think in the other thread, Tom was alluding to this too. What is it
about the shared buffer cache behaviour that makes it inefficient when it
is very large? (assuming that the address space it occupies is allocated
to RAM pages)

It's not that making the cache bigger is inefficient, it's that the cache is
not used the way you are thinking. Postgres does not try to create its own
large persistent cache of recently used data, because the OS (especially
Linux, and especially on an Opteron and compiled for 64 bit) is really much
better at caching. In fact, other that herding programs and implementing
security, optimizing the use of resources is what the OS is for.

Is there a good place I could look for some in depth details of its
behaviour?
There's a good bit of depth in the archives of this list. I would start
searching back for discussions of effective_cache_size, as that is involved
in *costing* the caching job that the OS is doing.

Many thanks,
Andy

--miker

Nov 23 '05 #4

Andy B

> There's a good bit of depth in the archives of this list. I would start

searching back for discussions of effective_cache_size, as that is involved in *costing* the caching job that the OS is doing.

Thanks - that's just what I need to sink my teeth into. I'll have a trawl
and get back later.

Regards
Andy

Nov 23 '05 #5

Andy B

It's not that making the cache bigger is inefficient, it's that the cache is not used the way you are thinking.

Ok, I think I've got it now. The missing piece of the puzzle was the
existence of the Linux buffer cache. So that's what the
effective_buffer_cache value is for(!)

I read Shridhar Daithankar and Josh Berkus's 'Tuning Postgresql for
performance' document, and some things stike me (though only with respect to
my RAM much bigger than database scenario.)

I think I'm a bit more depressed than I was earlier today!

1. Postgresql is a two tiered cache mechanism. The first tier - the
postgresql shared buffer cache sits on the second, larger tier, the linux
buffer cache. So bits of the same data end up being in memory...twice, and
two cache mechanisms operate at the same time. (That's how I understand it).

2. Even if the linux buffer cache contains all the data required for an
execution of a plan, there is still a load of memory copying to do between
these two tiers. Though memory copying is faster than disk access, it is
still an overhead, and isn't there the real problem of thrashing between
these two tiers if the plan can't fit all the data into the top tier, even
if the thrashing is restricted to the memory system?

3. The OS will implement a 'Least recently Used' cache replacement strategy
on data in its cache. This isn't the optimal cache replacement strategy for
the database. If postgresql is in charge of all its RAM resident data, it
can make more intelligent decisions about which stuff isn't needed once
used.

This still leads me to think that the best thing to do in my 'much bigger
RAM than database size' scenario would be for postgresql to allocate a
shared buffer cache big enough for all the data + a bit.

By doing this, there would be some performance enhancing gains, including:

1. If the planner *knew* (rather than guessed) that all the data was
effectively 'there' in RAM, in the right place (i.e. its workspace),
wouldn't it make choosing the optimal plan easier? (I see that
effective_buffer_cache is a stab in that direction, but just because
postgresql guesses the data will have been cached by linux, it doesn't mean
it actually is - surely an important distinction.)

2. You'd avoid a whole layer of caching, along with the not necessarily
aligned memory copies and other overheads that this introduces. Even in the
optimal case where all the data needed does reside in RAM, it's not in the
right bit of RAM. (I may have misunderstood this relationship between the
shared buffer cache and the linux buffer cache - if no memory copying
actually occurs - then I'll go away for a bit!)

Two interesting things I dug up today:

www.linuxjournal.com/article.php?sid=5482 (to do with a high performance DB
living in an OS controlled environment)

and

http://www.ixora.com.au/tips/avoid_buffered_io.htm (to do with Oracle
tuning)

The way things are, I see that postgresql will 'gracefully degrade' in the
very common scenario where the database shares a machine with limited
resources and lots of processes, all of them hammering the machine, but in
my 'optimal RAM optimised db server ' scenario, the scheme seems actually to
reduce the potential for a blistering system.

So given all of that, I can't help swinging back to my original question
about whether it's a bad thing to make the shared buffer cache huge, so long
as you have much more RAM than the size you choose.

It'll still require all that nasty Linux cache <->postgresql cache memory
copying, but it will at least reduce postgresql's exposure to the 'not
particularly clever' least recently used cache replacement strategy employed
by the OS.

Am I getting any closer to understanding the way things are?

Thanks for your tolerance,
Andy
p.s.

(It would be great if you could just load the data directly into the shared
buffer cache with direct IO, since that OS file buffer->shared buffer cache
transfer is just a waste of effort in this scenario), and direct IO allows
spectacular IO gains when you're working with certain raid configurations
connected with a suitable fat wire.)

Ok - I'll shutup now

Nov 23 '05 #6

Mike Rylander

<posted & mailed>

Andy B wrote:

It's not that making the cache bigger is inefficient, it's that the cache is
not used the way you are thinking.

Ok, I think I've got it now. The missing piece of the puzzle was the
existence of the Linux buffer cache. So that's what the
effective_buffer_cache value is for(!)

I read Shridhar Daithankar and Josh Berkus's 'Tuning Postgresql for
performance' document, and some things strike me (though only with respect
to my RAM much bigger than database scenario.)

I think I'm a bit more depressed than I was earlier today!

1. Postgresql is a two tiered cache mechanism. The first tier - the
postgresql shared buffer cache sits on the second, larger tier, the linux
buffer cache. So bits of the same data end up being in memory...twice, and
two cache mechanisms operate at the same time. (That's how I understand
it).

2. Even if the linux buffer cache contains all the data required for an
execution of a plan, there is still a load of memory copying to do between
these two tiers. Though memory copying is faster than disk access, it is
still an overhead, and isn't there the real problem of thrashing between
these two tiers if the plan can't fit all the data into the top tier, even
if the thrashing is restricted to the memory system?

Perhaps I'm a bit naive about complex data structure caching strategies, but
it seems to me that the overhead of tracking tuples (which is what you
would want if you are going to manage your own cache, as opposed to simply
caching disk pages as the OS does) would be memory-prohibitive except in
the most extreme cases of RAM>DB SIZE.

3. The OS will implement a 'Least recently Used' cache replacement
strategy on data in its cache. This isn't the optimal cache replacement
strategy for the database. If postgresql is in charge of all its RAM
resident data, it can make more intelligent decisions about which stuff
isn't needed once used.

I find that experience does not bear this out. There is a saying a coworker
of mine has about apps that try to solve problems, in this case caching,
that are well understood and generally handled well at other levels of the
"software stack"... he calls them "too smart by half" :)
This still leads me to think that the best thing to do in my 'much bigger
RAM than database size' scenario would be for postgresql to allocate a
shared buffer cache big enough for all the data + a bit.

The problem this this assumption, as Tom Lane has said, is that generally
speaking, a kernel is much faster with respect to cache *lookup* than
postgres is. Postgres assumes that only a subset of the data stored in a
cluster will be needed at any one time. Under this assumption (that it
will need to store and shovel through thousands, instead of millions or
more, tuples) some optimizations have been made that would cause
performance degradation if used over the entire (potentially enormous)
dataset.
By doing this, there would be some performance enhancing gains, including:

1. If the planner *knew* (rather than guessed) that all the data was
effectively 'there' in RAM, in the right place (i.e. its workspace),
wouldn't it make choosing the optimal plan easier? (I see that
effective_buffer_cache is a stab in that direction, but just because
postgresql guesses the data will have been cached by linux, it doesn't
mean it actually is - surely an important distinction.)

2. You'd avoid a whole layer of caching, along with the not necessarily
aligned memory copies and other overheads that this introduces. Even in
the optimal case where all the data needed does reside in RAM, it's not in
the right bit of RAM. (I may have misunderstood this relationship between
the shared buffer cache and the linux buffer cache - if no memory copying
actually occurs - then I'll go away for a bit!)

Two interesting things I dug up today:

www.linuxjournal.com/article.php?sid=5482 (to do with a high performance
DB living in an OS controlled environment)

Come now, Nov. 2001? Most of the issues brought up here have been solved
*in the kernel* since then. (In the general case, with the bevy of IO
schedulers, and the interuptable kernel.)
and

http://www.ixora.com.au/tips/avoid_buffered_io.htm (to do with Oracle
tuning)
Well, Oracle isn't PG, and I can't speak to their buffering and caching
implementation.

The way things are, I see that postgresql will 'gracefully degrade' in the
very common scenario where the database shares a machine with limited
resources and lots of processes, all of them hammering the machine, but in
my 'optimal RAM optimized db server ' scenario, the scheme seems actually
to reduce the potential for a blistering system.

So given all of that, I can't help swinging back to my original question
about whether it's a bad thing to make the shared buffer cache huge, so
long as you have much more RAM than the size you choose.
See above...

It'll still require all that nasty Linux cache <->postgresql cache memory
copying, but it will at least reduce postgresql's exposure to the 'not
particularly clever' least recently used cache replacement strategy
employed by the OS.
<rant> EVIDENCE, PLEASE </rant>

Am I getting any closer to understanding the way things are?

Thanks for your tolerance,
Andy
p.s.

(It would be great if you could just load the data directly into the
shared buffer cache with direct IO, since that OS file buffer->shared
buffer cache transfer is just a waste of effort in this scenario), and
<rant> again: EVIDENCE, PLEASE </rant>
direct IO allows spectacular IO gains when you're working with certain
raid configurations connected with a suitable fat wire.)

select count(*) from table;

Put that in a loop inside a function, select the table names from
INFORMATION_SCHEMA.TABLES.
Ok - I'll shutup now

We (well, I) don't want you to shut up. Infact I enjoy discussions like
this, because it freshens my memory on just why PG rocks so hard! ;)

--miker

Nov 23 '05 #7

Andy B

Hello again Mike,

Thanks for the replies! Here's my next salvo!

Perhaps I'm a bit naive about complex data structure caching strategies, but it seems to me that the overhead of tracking tuples (which is what you
would want if you are going to manage your own cache, as opposed to simply
caching disk pages as the OS does) would be memory-prohibitive except in
the most extreme cases of RAM>DB SIZE.
But doesn't postgresql already make a decision about what needs to be in its
workspace prior to executing a plan? I'm not actually suggesting it do
anything different. In fact, the only thing I'm suggesting is that the
workspace need not be limited in size.

Remember - I'm only talking about the scenario where the entire database -
every single tuple - can reside in the workspace (the shared buffer cache)
at the same time. There is no extra 'managing your own cache' as you put it.
Just let postgresql's postmaster do what it does. (assuming it doesn't kick
stuff out of its shared buffer cache unless it needs the space for something
else... ok -that may be an incorrect assumption - I need to find out for
sure)

But I think the overall effort is less, not more. Once in the shared buffer
cache, a tuple never needs to be kicked out again to make way for another.
That's what is so nice. Everything is just there... No need to calculate a
least damaging cache replacement strategy, no need to wait for blocks of
data to work their way from the disk or linux buffer cache to the postgresql
workspace -it all ends up there, and stays there, with only syncing activity
needing to take place when updates happen.

I find that experience does not bear this out. There is a saying a coworker of mine has about apps that try to solve problems, in this case caching,
that are well understood and generally handled well at other levels of the
"software stack"... he calls them "too smart by half" :)
Well, I guess I'd agree and disagree with your friend. What a political
animal I am!

Here's why.

Imagine this scenario. Your application uses 5 bits of data - A B C D and E.
Only 3 can sit in memory at once.

You want to do two transactions. The first combines A, B and C, so you load
those, in order.

Once you're finished with that first transaction, you then require to
combine A, C and D.

Now, an application driven cache will see immediately that the optimal cache
replacement is to load D into the slot taken by B. With the OS in charge,
you'll probably end up with a cache thrash.

This is simplistic, but it is just this sort of smart caching that can lead
to dramatic speedups, and I've got quite a bit of experience of seeing just
that.

I don't want to devalue the linux cache strategy. It is very appropriate for
almost all situations. I'm talking about a very specific scenario here
though.

Does that make sense?
The problem this this assumption, as Tom Lane has said, is that generally
speaking, a kernel is much faster with respect to cache *lookup* than
postgres is. Postgres assumes that only a subset of the data stored in a
cluster will be needed at any one time. Under this assumption (that it
will need to store and shovel through thousands, instead of millions or
more, tuples) some optimizations have been made that would cause
performance degradation if used over the entire (potentially enormous)
dataset.

Again, in my scenario, postgres is *relieved* of the task of having to
decide what data to choose to be in the workspace, since ALL of the data is
accessible there in my scenario.

The point about postgres having to shovel through millions of tuples is a
different point. If I do a query which can only be executed by looking at a
million tuples, then that is what has to happen, no matter what caching
strategy you employ. Obviously, if you're shovelling data in and out of
memory, then the *way* you traverse those tuples becomes important to
efficiency, as loading sequential stuff off a disk is always faster. If it's
*all* in the workspace, well it doesn't matter at all then - the planner can
forget about disks and choose the fastest algorithmic option.

Again - this is a simplification, but I hope you can see what I'm getting
at.
Come now, Nov. 2001? Most of the issues brought up here have been solved
*in the kernel* since then. (In the general case, with the bevy of IO
schedulers, and the interuptable kernel.)
The point is that if your data is sitting in two parts of memory at the same
time, then that's a waste of memory, and it's extra time copying all that
stuff around. The date of the article isn't relevant to this, I don't think.

They make the point that no matter how efficient the OS kernel is, in the
sort of scenario I am interested in, it just ends up being an unnecessary
step.
Well, Oracle isn't PG, and I can't speak to their buffering and caching
implementation.
I'm just saying that these issues are generic issues for any high
performance database server, regardless of whether it is Oracle, Postgresql,
Mysql etc. The things they highlight in that article are the very things I
allude to here for my special scenario. Even if you're a postgresql fan (and
I chose to use it too), it is worth poking your head over the parapet and
seeing how others are dealing with these issues too.

(It would be great if you could just load the data directly into the
shared buffer cache with direct IO, since that OS file buffer->shared
buffer cache transfer is just a waste of effort in this scenario), and

<rant> again: EVIDENCE, PLEASE </rant>

In this case, I can speak with a little authority (for once) : this is just
best practice for a high performance system. If you want an application to
get at its data quickly, you use block aligned direct IO to do a single
transfer from the disk subsystem (or logical volume or whatever) to the
address space where the app wants the data. In fact, the larger you can make
the size of a single direct IO transfer, the nearer to the theoretical
bandwidth IO limit you can reach. Sticking a generic linux style buffer
cache in the middle destroys any chance you will have of touching the
performance you could otherwise get from a high performance IO subsystem.
You'll just have to trust me on this one.. :) Of course, in my scenario, the
only interest in the high speed loading is just to populate the workspace
when the server starts up initially, though optimising write performance is
something that will benefit you from then on.

I've spent years getting data on and off disks quickly, and even something
as simple as a non machine page boundary aligned transfer can slow things
down by an order of magnitude. It's just a question of doing the
straightest, most 'hardware optimised' route.

We (well, I) don't want you to shut up. Infact I enjoy discussions like
this, because it freshens my memory on just why PG rocks so hard! ;)

Cool! I'm glad that you're really into PG, because I'm sort of counting on
it too, so that makes me feel happier, and it's really good to see the
newsgroups for the database being so active.

Thanks anyway for the replies. I'm going to do a lot more digging and see if
I can find anyone who has got their hands dirty trying to do this
'everything in RAM' scenario.

I'll come back with anything more interesting that I find.

Regards
Andy

Nov 23 '05 #8

Mike Rylander

<posted & mailed>

Andy B wrote:

Hello again Mike,

Thanks for the replies! Here's my next salvo! -- snip --

Imagine this scenario. Your application uses 5 bits of data - A B C D and
E. Only 3 can sit in memory at once.

You want to do two transactions. The first combines A, B and C, so you
load those, in order.

Once you're finished with that first transaction, you then require to
combine A, C and D.

Now, an application driven cache will see immediately that the optimal
cache replacement is to load D into the slot taken by B. With the OS in
charge, you'll probably end up with a cache thrash.

If all the data fits into memory, then this cache thrashing won't occur,
yes?
This is simplistic, but it is just this sort of smart caching that can
lead to dramatic speedups, and I've got quite a bit of experience of
seeing just that.

I don't want to devalue the linux cache strategy. It is very appropriate
for almost all situations. I'm talking about a very specific scenario here
though.

Does that make sense?

It does. And I will support your strategy as soon as there is some
empirical evidence surrounding PG. I'm all for more speed!

Again, in my scenario, postgres is *relieved* of the task of having to
decide what data to choose to be in the workspace, since ALL of the data
is accessible there in my scenario.

It's not the deciding what to pull into the workspace, it's the finding of
the target data in the workspace. As I've said, I'm no expert in caching
strategies (I will leave that to the likes of Tome Lane and Linus... and
you, if you would like to submit a patch :).
The point about postgres having to shovel through millions of tuples is a
different point. If I do a query which can only be executed by looking at
a million tuples, then that is what has to happen, no matter what caching
strategy you employ. Obviously, if you're shovelling data in and out of
memory, then the *way* you traverse those tuples becomes important to
efficiency, as loading sequential stuff off a disk is always faster. (we both want it all in memory... I never want to hit a disk for read!) If
it's *all* in the workspace, well it doesn't matter at all then - the
planner can forget about disks and choose the fastest algorithmic option.

Well, I think I have a way for you to test this...:

Test of BIG shared_buffers:
------------------------------
1) inflate shared buffers as as much as possible.
2) deflate effective cache size as much as possible.
3) set random_page_cost to 0.01, or even lower (force planner to think disk
= ram)
4) for all relevent tables, do SELECT count(*) FROM table;
5) run queries involving many tables using EXPLAIN ANALYZE and collect the
times

Test of currently recommended shared_buffers:
-------------------------------------------------
1) set shared_buffers to 20000 (goodly setting for production DB...)
2) set effective cache size using current recommendations (see General Bits,
etc)
3) proceed to steps 4 and 5 above

Unfortunately, I don't have a machine on which to test this at the moment,
but if anyone out there does, I would find the results very interesting!
Again - this is a simplification, but I hope you can see what I'm getting
at.
Come now, Nov. 2001? Most of the issues brought up here have been solved
*in the kernel* since then. (In the general case, with the bevy of IO
schedulers, and the interuptable kernel.)
The point is that if your data is sitting in two parts of memory at the
same time, then that's a waste of memory, and it's extra time copying all
that stuff around. The date of the article isn't relevant to this, I don't
think.

I understand what you are saying. My point was simply that the specific
issues raised in the article have been largely addressed in kernel space on
linux.
They make the point that no matter how efficient the OS kernel is, in the
sort of scenario I am interested in, it just ends up being an unnecessary
step.
That is true, and will be until PG implements it's own cache memory
management AND on-disk storage. I'd be willing to bet that any patches
that dramatically increase speed for the majority of use cases would be
considered strongly.

Well, Oracle isn't PG, and I can't speak to their buffering and caching
implementation.

I'm just saying that these issues are generic issues for any high
performance database server, regardless of whether it is Oracle,
Postgresql, Mysql etc. The things they highlight in that article are the
very things I allude to here for my special scenario. Even if you're a
postgresql fan (and I chose to use it too), it is worth poking your head
over the parapet and seeing how others are dealing with these issues too.

I understand and agree. In truth, that was an off-handed comment coming
more from my own ignorance of the issue at hand then from my desire to stay
within my "walls"... the article presents interesting points, and is worth
investigating.

I've spent years getting data on and off disks quickly, and even
something
as simple as a non machine page boundary aligned transfer can slow things
down by an order of magnitude. It's just a question of doing the
straightest, most 'hardware optimised' route.

I couldn't agree more, though I think at this point we are arguing around
the problem. We both want all the data in RAM... it's where the data stays
for the majority of its off-disk time we disagree on.

If all the data is in memory, OS or app cache, then the point is who is
better at looking up that data.

Finding target data (complex lookup in the PG shared buffers) will take
longer than transferring disk pages from the OS cache, based on index
information, and then doing a simple, fast scan. This is the current
working assumption for postres.

Your contention, correct me if I've gotten this wrong, is that transferring
it from the OS cache will be slower than finding it using a more complex
lookup, but without moving the data from the OS cache to the PG workspace.

In both scenarios all data is in RAM.

I see what you mean about a waste of memory, but if the costing parameters
are tuned well for the use of the specific DB then current evidence with
postgres shows that the former strategy is faster. I would be interested
to see if the latter scenario could be made to work with the current PG
codebase, or even if there is a simple patch that could speed up the
shared_buffers case.

I guess, though, that the main problem I see with the big shared_buffers
idea on linux specifically is that SHMEM is slow on linux. Rumor has it
that the SHMEM implementation on Solaris is much faster than on linux, so
PG on Solaris may very well benefit from using your approach, but I just
don't see it on linux, which is where this discussion started IIRC.

We (well, I) don't want you to shut up. Infact I enjoy discussions like
this, because it freshens my memory on just why PG rocks so hard! ;)

(sorry, just watched "School of Rock" last night...)
Cool! I'm glad that you're really into PG, because I'm sort of counting on
it too, so that makes me feel happier, and it's really good to see the
newsgroups for the database being so active.

Thanks anyway for the replies. I'm going to do a lot more digging and see
if I can find anyone who has got their hands dirty trying to do this
'everything in RAM' scenario.

I'll come back with anything more interesting that I find.
I will wait with baited breath ;)

--miker

Regards
Andy

Nov 23 '05 #9

Andy B

If all the data fits into memory, then this cache thrashing won't occur, yes?

No - it *can* occur in a 2 tier cache strategy.

The critical question here is: *If* the data postgresql needs is in the
linux buffer cache, what (if anything) does the OS have to do to make it
available to the postmaster? The worst case would be an actual memory copy
of the data from the linux cache to where postgresql actually asked for the
file data to be loaded. I'll need to find out more about the linux buffer
cache to answer this question for myself.

If a memory copy is required, then this is a good way of visualising it:

I live in London, but have to work in 5th avenue, NYC for a week.
My company stupidly sets me up in a smelly hotel by JFK airport.

London is 'The disk system'
5th Avenue is 'The postgresql workspace'
My smelly hotel at JFK is the 'linux buffer cache'.

Clearly, it's far quicker to commute to work each day from JFK, than it is
to commute from London each day.
But wouldn't it be faster still if I had a room in the swanky hotel on Fifth
avenue above the office?

Yes, I'd still have to arrive at JFK, but then I could stay in 5th avenue
all week until I'm finished.

And as far as direct IO is concerned, it gets better: that would be like
taking a plane right straight from London to 5th avenue, avoiding the
tedious immigration line at JFK.

Having said all of that, I'll go and find more about the linux buffer cache
before I pester the Postrgresql cache experts.
<snipped your test plan>

Thanks for that test plan...

I'm in the same position with you w.r.t. machinery - hence my thread (I was
hoping someone had tried this!). But I will get my hands on something quite
soon and I'll do your test and post the results. I knew I'd have no option
than to become very familiar with the behaviour of whatever DB I chose, so
I'm looking forward to it (sort of)

That is true, and will be until PG implements it's own cache memory management AND on-disk storage. I'd be willing to bet that any patches
that dramatically increase speed for the majority of use cases would be
considered strongly.
I have to find out for sure, but since linux is being used in lots of HPC
clusters now and for some real time applications, I'd be very surprised if
there wasn't already a direct IO strategy in the kernel, but that's one of
the things I'm looking at now. That being the case, and assuming that the
postgresql cache operates on arbitrary sized pages (it seems set at 8KB
pages by default - which, depending on the platform may be good, assuming
also that the address space is appropriately machine page aligned), then it
might be 'very easy indeed' to upgrade postgresql's cache manager to fire
the necessary IO calls at the OS. Again - this is really something I need to
talk to a pg cache expert about, but I've got a lot of catching up to do
first.

By the way - Direct IO doesn't mean the application has to know about the
disk configuration or anything messy like that. It is still a block IO
operation directed through the kernel. From the programmer point of view the
changes are trivial. It just bypasses the OS's own generic file cache, and
does a super efficient single DMA memory transfer between the IO system and
the host RAM - all of this managed has to be managed by the OS, or you'll
end up with a platform specific implementation that is of no real use.

I would run a mile if this work required Postgresql to know about actual
storage.
Finding target data (complex lookup in the PG shared buffers) will take longer than transferring disk pages from the OS cache, based on index
information, and then doing a simple, fast scan. This is the current
working assumption for postres.

This is what scares me a little. Why would the PG cache lookup be complex?
That's a question for me to find the answer to... I'll wade through the
developer resources and source code etc. before bothering the experts.
Your contention, correct me if I've gotten this wrong, is that transferring it from the OS cache will be slower than finding it using a more complex
lookup, but without moving the data from the OS cache to the PG workspace.
In both scenarios all data is in RAM.
Well if the PG cache is slower at dealing with large data, then yes, I
concede that this changes the picture. But at this point in time, my gut
reaction would be to assume that the PG cache is blazingly efficient. It is,
after all, the heart of the DB, and all the DBs I've seen have a similar
architecture.
I see what you mean about a waste of memory, but if the costing parameters are tuned well for the use of the specific DB then current evidence with
postgres shows that the former strategy is faster. I would be interested
to see if the latter scenario could be made to work with the current PG
codebase, or even if there is a simple patch that could speed up the
shared_buffers case.

Well it's still not clear that the memory does get wasted, in which case
this 'problem' doesn't exist (See my comment about whether a copy occurs
between the linux buffer cache and the postmaster's address space.) If linux
just returns instantly and it looks like the data was copied, but it just
ended up being a vmem pointer reference change, then the only issue is that
of making the IO much quicker (when the disk actually needs to be read from
or written to). Anyway.. as I said. I'm in learning mode right now!
I guess, though, that the main problem I see with the big shared_buffers idea on linux specifically is that SHMEM is slow on linux. Rumor has it
that the SHMEM implementation on Solaris is much faster than on linux, so
PG on Solaris may very well benefit from using your approach, but I just
don't see it on linux, which is where this discussion started IIRC.

Hmm... again more reading for me. It occurs to me that I haven't *fully*
decided on linux at all. Linux is great for developing on a shoestring. At
the end of the day, I'm quite committed to postgresql in my project. I like
a lot of its features and would be loathed to move away from it, so it might
be that the final production DB server ends up running pgsql on an OS other
than linux.

Still, it would be nice if it did work under linux.
I will wait with baited breath ;)

Hehe - I've just read back what I've written, and it seems I have to
suddenly understand the depths of both the pg kernel and the linux kernel.
And I didn't even know about the linux buffer cache until yesterday!

So it'll take a wee while, but I'll be back!

Thanks again for your comments,

Regards,
Andy

Nov 23 '05 #10

Soeren Gerlach

Briefly, my problem:
------------------
I expect the database I'm working on to reach something in the order of
12-16 Gigabytes, and I am interested in understanding as much as I can
about how I can make this go as fast as possible on a linux system. I
haven't run such a large database before. The nature of the database is
such that successive queries are very likely to lead to poor cache
performance.

I have lots of experience with disks, operating systems, caching, virtual
memory, RAM etc. - just none running gargantuan databases very quickly!

Two things come into my mind, although I don't have experience with both of
them ,-))

1) Use a kind of "silicon disk". Ususally it's battery backuped DDR-Ram with
a kind of IDE-interface ready to attach to the IDE bus. This is not only
fast in terms of transfer rates but also in BLAZING fast in terms of seek
times.

2) Put the database completely in a RAM disk and replicate it live to a disk
based DB server - if the server crashes you still have the "copy" on
another server on the disk. I don't know how much overhead is imposed with
such replicating for Postgres, but for commercial DBs you normally get
happy with ~5% addon. This depends strongly from the usage pattern of the
DB, i.e. a OLTP-usage imposes stronger overhead then a read-only-DB (MIS
type).

Some years ago I've committed lots of work in tuning a commercial database
product for 10 Gig+ databases. The problem was really always that tuning on
a database level always needed some help from the application development
too to really work good. And when the database changed (schemes, indexes
and some costly operations) you needed to re-tune again for some bigger
changes - always needing lots of testing, approving etc.etc.
As RAM got cheaper and cheaper (even for the big and evil big iron servers
<g>>) the last years I can see an increasing tendency to buy more RAM for
the usage of caches, ramdisks etc. because: This normally helps always,
fast and is not that expensive nowadays. Although I personally think that
tuning is much more fun than putting some ram sticks into the mobo, the
latter one becomes more effective in terms of "bang for the bucks".

This explains why my two solutions have nothing really specific to do with
Postgres but more with a general approach. If you decide to switch from
Linux to Solaris (or other way round) in some time you get likely lots of
work to do if you now stay to a specific solution.
Best regards,
Soeren Gerlach
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #11

Shridhar Daithankar

Hi,

Andy B wrote:

1. Postgresql is a two tiered cache mechanism. The first tier - the
postgresql shared buffer cache sits on the second, larger tier, the linux
buffer cache. So bits of the same data end up being in memory...twice, and
two cache mechanisms operate at the same time. (That's how I understand it).
That is correct. But I would advise you to see shared buffers as workspace
rather than cache.
2. Even if the linux buffer cache contains all the data required for an
execution of a plan, there is still a load of memory copying to do between
these two tiers. Though memory copying is faster than disk access, it is
still an overhead, and isn't there the real problem of thrashing between
these two tiers if the plan can't fit all the data into the top tier, even
if the thrashing is restricted to the memory system?
That is certainly not correct. I don't have hard sources to back it up, but if
you open a file that you jus close it, linux does not go copying it from it's
cache to the process address space. It would rather juggle the page table to mak
e memory pages available to your process.

By that argument, there would be three caches. Postgresql shared buffers, files
mapped into process address space and linux buffers. I think that defeats the
purpose of caching.
3. The OS will implement a 'Least recently Used' cache replacement strategy
on data in its cache. This isn't the optimal cache replacement strategy for
the database. If postgresql is in charge of all its RAM resident data, it
can make more intelligent decisions about which stuff isn't needed once
used.
There are many things to note about approach postgresql adopts

- If the machine is shared between multiple services such as web server,
database and mail server etc., this approach make postgresql good citizen by
letting the OS decide which is most hard hit app. that needs memory.
- On a dedicated machine, this approach anyway yields almost entire buffer cache
to postgresql
- Getting caching right w.r.t execution is a hrd thing to do. Postgresql still
has some room to go before it can claim to be making best use of all available
information in query plan. Till that time, taking over file system caching is
not such a good idea. Hackers are already working on making postgresql planner
smarter than it already is.
- Postgresql or no postgresql, OS has to manage buffer cache. Why duplicate the
efforts which somebody is already better at? If OS improves, everybody benefits.
I can recall quite a few posts benefitting from moving to 2.6.x kernel from 2.4.x.

Again, shared buffers is not cache. It is a workspace. Leave it around 64MB, no
matter how big your data set is.

1. If the planner *knew* (rather than guessed) that all the data was
effectively 'there' in RAM, in the right place (i.e. its workspace),
wouldn't it make choosing the optimal plan easier? (I see that
effective_buffer_cache is a stab in that direction, but just because
postgresql guesses the data will have been cached by linux, it doesn't mean
it actually is - surely an important distinction.)
Not with current planner. Please correct me if I am wrong. And given the
assumption that OS buffer cache->postgresql shared buffers is a low/no cost
operation, this is not an optimization that would win high priority on TODO list.
2. You'd avoid a whole layer of caching, along with the not necessarily
aligned memory copies and other overheads that this introduces. Even in the
optimal case where all the data needed does reside in RAM, it's not in the
right bit of RAM. (I may have misunderstood this relationship between the
shared buffer cache and the linux buffer cache - if no memory copying
actually occurs - then I'll go away for a bit!)
I believe there is no memory copying. At the most page table juglary at kernel
level and mmaps at libc level. But again this is a guess.
Two interesting things I dug up today:

www.linuxjournal.com/article.php?sid=5482 (to do with a high performance DB
living in an OS controlled environment)

and

http://www.ixora.com.au/tips/avoid_buffered_io.htm (to do with Oracle
tuning)

The way things are, I see that postgresql will 'gracefully degrade' in the
very common scenario where the database shares a machine with limited
resources and lots of processes, all of them hammering the machine, but in
my 'optimal RAM optimised db server ' scenario, the scheme seems actually to
reduce the potential for a blistering system.
I would say it asymptotically approaches the best performance.
So given all of that, I can't help swinging back to my original question
about whether it's a bad thing to make the shared buffer cache huge, so long
as you have much more RAM than the size you choose.

Well, best thing I can tell you is test yourself and find out the sweet spot. It
should be between 64-128MB by general wisdom. If you find otherwise, we would be
more than glad to hear.

I read another thread where you posted your actual requirements etc. I would be
inclined to split it into multiple database servers with application making
decision how to consolidate data from various database servers. It might take
some development but could save you lot on hardware.

Just a thought..

HTH

Shridhar

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #12

Marc Slemko

On Thu, 08 Jul 2004 14:27:29 +0530, Shridhar Daithankar
<sh******@frodo.hserus.net> wrote:

- Postgresql or no postgresql, OS has to manage buffer cache. Why duplicate the
efforts which somebody is already better at? If OS improves, everybody benefits.
I can recall quite a few posts benefitting from moving to 2.6.x kernel from 2.4.x.

One big reason why a large memory space to cache things in your
database server can be beneficial is that it allows for some sometimes
very effective on the fly optimizations to use data structures
designed around in-memory access instead of on-disk access.

Anything in the OS's buffer cache must be represented in the on-disk format.

There are a variety of highly specialized databases that are designed
around being able to build certain structures such as indexes in
memory, and their claim to fame is being much faster for certain types
of queries on in-memory databases. There are many commodity databases
such as innodb that can do things such as build, on the fly, an in
memory hash of a btree index (or a range from a btree index) if
queries are frequently doing lookups that could benefit from it.

As the cost of having large quantities of data present in main memory
drops, postgresql's approach makes it hard to adopt significant
optimizations that minimize the difference between algorithms designed
for in-memory database access and disk based access.

How postgresql approaches these things isn't a bad way, and works
quite well for many people and does allow more effort to be focused in
other areas, but it still is a limitation in certain use cases.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #13

Edmund Dengler

Greetings!

On Fri, 2 Jul 2004, Mike Rylander wrote:

I find that experience does not bear this out. There is a saying a coworker
of mine has about apps that try to solve problems, in this case caching,
that are well understood and generally handled well at other levels of the
"software stack"... he calls them "too smart by half" :)

But on the other hand, general algorithms which are designed to work under
a wide variety of circumstances may fail in specific cases. I am thinking
of VACUUM which would kill most caching algorithms simply because we
cannot tell the O/S "by the by, this set of pages will not be used again,
and therefore it would be fine to use almost none of the general cache to
store this". All algorithms have assumptions of value distribution and
usages. Caches depend on locality of reference, and we do not have an easy
way to say when this is broken.

Regards!
Ed

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #14

Andy B

Hi,

That is certainly not correct. I don't have hard sources to back it up, but if you open a file that you jus close it, linux does not go copying it from it's cache to the process address space. It would rather juggle the page table to mak e memory pages available to your process.
Phew. I wasn't sure about this. At least the CPUS can concentrate on all
that page swapping.
- Postgresql or no postgresql, OS has to manage buffer cache. Why duplicate the efforts which somebody is already better at? If OS improves, everybody benefits. I can recall quite a few posts benefitting from moving to 2.6.x kernel from 2.4.x.

I did read that the Linux kernel now tries something better than just LRU
for cache replacement. (Least Frequently Used? Or a combination - I can't
remember exactly).

That given, I understand the arguments you give when thinking about systems
where most of the database can't fit on the machine... and that is generally
the case now.

But when, as will become more and more common over the next 3 or 4 years,
the whole database *easily* fits into RAM on a dedicated DB machine, with
room for everything else to operate without swapping, then there is a real
opportunity to improve performance by simply getting the data into RAM in an
optimised indexed layout. There are some really interesting 64 bit linux
multi CPU server solutions coming out now, with huge addressing and RAM
potential.

Well, best thing I can tell you is test yourself and find out the sweet spot. It should be between 64-128MB by general wisdom. If you find otherwise, we would be more than glad to hear.
Well I'll start with that, and see how it goes and let you know -it's a
while off.
I read another thread where you posted your actual requirements etc. I would be inclined to split it into multiple database servers with application making decision how to consolidate data from various database servers. It might

take

Thanks for the suggestion - it's something to bear in mind and hope I don't
ever need to go down that road... :(

Regards,
Andy

Nov 23 '05 #15

Andy B

Ok, this is very interesting. I had assumed that PG's cache already built up
such in memory structures from the data it gets off disk (aka OS buffer, God
willing).

Which particular DB's out there would be worth checking out for my
particular 'all in RAM scenario'?

(Though I need geographical index support or the ability to add my
own...that was the thing that originally made me look at PG - that and
having tried to use Mysql)

One big reason why a large memory space to cache things in your
database server can be beneficial is that it allows for some sometimes
very effective on the fly optimizations to use data structures
designed around in-memory access instead of on-disk access.

Anything in the OS's buffer cache must be represented in the on-disk format.
There are a variety of highly specialized databases that are designed
around being able to build certain structures such as indexes in
memory, and their claim to fame is being much faster for certain types
of queries on in-memory databases. There are many commodity databases
such as innodb that can do things such as build, on the fly, an in
memory hash of a btree index (or a range from a btree index) if
queries are frequently doing lookups that could benefit from it.

As the cost of having large quantities of data present in main memory
drops, postgresql's approach makes it hard to adopt significant
optimizations that minimize the difference between algorithms designed
for in-memory database access and disk based access.

How postgresql approaches these things isn't a bad way, and works
quite well for many people and does allow more effort to be focused in
other areas, but it still is a limitation in certain use cases.

Nov 23 '05 #16

Harald Fuchs

In article <Pi******************************@cyclops4.esentir e.com>,
Edmund Dengler <ed*****@eSentire.com> writes:

But on the other hand, general algorithms which are designed to work under
a wide variety of circumstances may fail in specific cases. I am thinking
of VACUUM which would kill most caching algorithms simply because we
cannot tell the O/S "by the by, this set of pages will not be used again,
and therefore it would be fine to use almost none of the general cache to
store this".

We _can_ tell that the OS: man madvise.
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #17

Marco Colombo

On Thu, 8 Jul 2004, Shridhar Daithankar wrote:

Hi,

Andy B wrote:
1. Postgresql is a two tiered cache mechanism. The first tier - the
postgresql shared buffer cache sits on the second, larger tier, the linux
buffer cache. So bits of the same data end up being in memory...twice, and
two cache mechanisms operate at the same time. (That's how I understand it).
That is correct. But I would advise you to see shared buffers as workspace
rather than cache.

Hmm, I'm not sure that's true. The first time you read the data,
maybe it gets copied twice (but I don't know the details of the
implementation of buffers in PostgreSQL, I'm making wild guesses here).

Later, things are not so simple. Since we're considering nested caches
here, I think that whatever is "hot" in the PostgreSQL buffers, will
automatically be "cold" in the Linux page cache, and will be a good
canditate for eviction. You don't access _both_ copies for sure.
If you find the data in a buffer, Linux won't notice you accessed it,
and slowly mark its copy as "not recently used".

So, on the long run, I think that "hot" data stays (only) in some
application buffer, "warm" data in the Linux cache, "cold" data on disk.
Multiple copies occur rarely, and for a relatively short time. Of course,
I'm assuming there's some kind of memory pressure. If not, unless
copies of data may stay in RAM "forever".

2. Even if the linux buffer cache contains all the data required for an
execution of a plan, there is still a load of memory copying to do between
these two tiers. Though memory copying is faster than disk access, it is
still an overhead, and isn't there the real problem of thrashing between
these two tiers if the plan can't fit all the data into the top tier, even
if the thrashing is restricted to the memory system?

That is certainly not correct. I don't have hard sources to back it up, but if
you open a file that you jus close it, linux does not go copying it from it's
cache to the process address space. It would rather juggle the page table to mak
e memory pages available to your process.

I'm not familiar with recent kernel development. For sure, the kernel
used copy_from/to_user() a lot in the past. You seem to overestimate
the cost of RAM-to-RAM copy vs. the cost of messing with VM mappings.

The open()/close() sequence won't copy anything, agreed. It's read()
we're considering here.
By that argument, there would be three caches. Postgresql shared buffers, files
mapped into process address space and linux buffers. I think that defeats the
purpose of caching.

[...]

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #18

Mike Rylander

On Thursday 08 July 2004 02:12 pm, you wrote:

Greetings!

On Fri, 2 Jul 2004, Mike Rylander wrote:
I find that experience does not bear this out. There is a saying a
coworker of mine has about apps that try to solve problems, in this case
caching, that are well understood and generally handled well at other
levels of the "software stack"... he calls them "too smart by half" :)
But on the other hand, general algorithms which are designed to work under
a wide variety of circumstances may fail in specific cases. I am thinking
of VACUUM which would kill most caching algorithms simply because we
cannot tell the O/S "by the by, this set of pages will not be used again,
and therefore it would be fine to use almost none of the general cache to
store this".

True enough! I don't know the internals of the Linux file cache, but I wonder
if there would be a way that we could ignore post-VACUUM useless pages so
they would evenutally fall out of the cache... perhaps a position pointer
within the file that points to the next "valid" page? Then we request that
particular page from the OS and it skips loading the interveening pages, and
pushes the "invalid"pages out of the cache over time...
All algorithms have assumptions of value distribution and
usages. Caches depend on locality of reference, and we do not have an easy
way to say when this is broken.

Absolutely. I don't want to give the impression that I think the Linux file
cache is the be-all-end-all of caching alogs, but the initial post was
regarding a DB that does fit entirely in RAM, be it in the OS cache or the
shared_buffers workspace.
Regards!
Right back a'cha!
--miker
Ed

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #19

Scott Marlowe

On Thu, 2004-07-08 at 13:19, Andy B wrote:

Ok, this is very interesting. I had assumed that PG's cache already built up
such in memory structures from the data it gets off disk (aka OS buffer, God
willing).

Which particular DB's out there would be worth checking out for my
particular 'all in RAM scenario'?

(Though I need geographical index support or the ability to add my
own...that was the thing that originally made me look at PG - that and
having tried to use Mysql)

You're going about this the wrong way. Build a reasonable large test
database and see how postgresql does. Examining the underlying tech in
a database is all well and good for pointers, but until you really
benchmark it with realistic load, you don't know what it will actually
do. You're gonna buy a big honkin box anyway, so after you get it, test
the different engines to see what they can do.
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #20

Enough RAM for entire Database.. cost aside, is this going to be fastest?

Similar topics