Connecting Tech Pros Worldwide Forums | Help | Site Map

Method for compressing varchar/nvarchar columns?

Junkmail
Guest
 
Posts: n/a
#1: Jul 20 '05
I have an application with highly compressable strings (gzip encoding
usually does somewhere between 20-50X reduction.) My base 350MB
database is mostly made up of these slowly (or even static) strings. I
would like to compress these so that my disk I/O and memory footprint
is greatly reduced.

Some databases have the ability to provide a compressed
table, compressed column, or provide a user defined function to
compress an indvidual Field with a user defined function
[ala. COMPRESS() and DECOMPRESS() ].

I could right a UDF with an extended prodcedure if I need to but I'm
wondering if there are any other known methods to do this in MS SQL
Server 2000 today?

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)





No Junkmail
Guest
 
Posts: n/a
#2: Jul 20 '05

re: Method for compressing varchar/nvarchar columns?


Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then
the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)


"Simon Hayes" <sql@hayes.ch> wrote in message
news:3f2bb107$1_5@news.bluewin.ch...[color=blue]
> "Junkmail" <junkmail@anywhere.com> wrote in message
> news:EQHWa.46033$o%2.23461@sccrnsc02...[color=green]
> > I have an application with highly compressable strings (gzip encoding
> > usually does somewhere between 20-50X reduction.) My base 350MB
> > database is mostly made up of these slowly (or even static) strings. I
> > would like to compress these so that my disk I/O and memory footprint
> > is greatly reduced.
> >
> > Some databases have the ability to provide a compressed
> > table, compressed column, or provide a user defined function to
> > compress an indvidual Field with a user defined function
> > [ala. COMPRESS() and DECOMPRESS() ].
> >
> > I could right a UDF with an extended prodcedure if I need to but I'm
> > wondering if there are any other known methods to do this in MS SQL
> > Server 2000 today?
> >
> > --
> > Frederick Staats
> > frederick dot w dot staats at intel dot com (I hate junk mail :-)
> >[/color]
>
> I'm not sure that I understand your question - are you saying that you'd
> like to stored the compressed text in a column, and dynamically
> compress/decompress on INSERT/SELECT? If so, I don't think this is such a
> good approach - disk space is very cheap, and a 350MB database is very
> small. In general, the overhead of compression/decompression would[/color]
outweigh[color=blue]
> any benefits of a smaller database. In addition, you wouldn't be able to
> index or search the column efficiently, although you may not need to if[/color]
you[color=blue]
> always access it via a key, and never use the column in a WHERE clause.
>
> But since it's not clear from your post what the real issue is that you're
> trying to solve, it's possible you may have a good reason for trying to do
> this. There is no built-in functionality in MSSQL to do the
> compression/decompression, so an extended proc is the obvious server-side
> solution, or handle it in the client application.
>
> Simon
>
>[/color]


DHatheway
Guest
 
Posts: n/a
#3: Jul 20 '05

re: Method for compressing varchar/nvarchar columns?


Just out of curiosity, which other DBMS's that you know of support
compression?


"No Junkmail" <nojunkmail@anywhere.com> wrote in message
news:NjuXa.63017$uu5.6655@sccrnsc04...[color=blue]
> Based on a co-workers experience with similar compression algorithms
> as long as the compressed data isn't being used as an index directly then
> the performance slow down would be more than made up by shifting the
> bulk of the I/O from disk read/write to cached pages in memory. The
> 20-50X reduction of data size really makes a difference (5 to >1000X
> speed improvement depending on the caching scheme.)
>
> It's a both a performance and a disk storage issue. Although a single
> database is ~350MBytes I might want to have dozens of them on a
> single machine... not everything is a hardware problem :-).
>
> It sounds like there is no native or add-on support for compressing
> columns in SQL Server 2000. There is such support in the OLAP
> cubes when stored in their native cube format (reading the docs
> yesterday so I'll look into that too.) Most of the other DBMS's out
> there have native compression support on relational tables or are
> adding it so I can hope.
>
> --
> Frederick Staats[/color]



No Junkmail
Guest
 
Posts: n/a
#4: Jul 20 '05

re: Method for compressing varchar/nvarchar columns?


MySQL has compressed tables (read only) and is adding COMPRESS()
and DECOMPRESS() UDF in version 5.0.

MS does compression on it's native OLAP cubes.

Oracle 9i can do data compression on Enterprise Edition relational tables
and
OLAP cubes.

Data compression seems to be such a basic staple when working with
large data of any kind I was surprised that MS did not have it as a
standard feature in the current release

"DHatheway" <dlhatheway@mmm.com.nospam> wrote in message
news:bgm90p$gbb$1@tuvok3.mmm.com...[color=blue]
> Just out of curiosity, which other DBMS's that you know of support
> compression?
>
>
> "No Junkmail" <nojunkmail@anywhere.com> wrote in message
> news:NjuXa.63017$uu5.6655@sccrnsc04...[color=green]
> > Based on a co-workers experience with similar compression algorithms
> > as long as the compressed data isn't being used as an index directly[/color][/color]
then[color=blue][color=green]
> > the performance slow down would be more than made up by shifting the
> > bulk of the I/O from disk read/write to cached pages in memory. The
> > 20-50X reduction of data size really makes a difference (5 to >1000X
> > speed improvement depending on the caching scheme.)
> >
> > It's a both a performance and a disk storage issue. Although a single
> > database is ~350MBytes I might want to have dozens of them on a
> > single machine... not everything is a hardware problem :-).
> >
> > It sounds like there is no native or add-on support for compressing
> > columns in SQL Server 2000. There is such support in the OLAP
> > cubes when stored in their native cube format (reading the docs
> > yesterday so I'll look into that too.) Most of the other DBMS's out
> > there have native compression support on relational tables or are
> > adding it so I can hope.
> >
> > --
> > Frederick Staats[/color]
>
>
>[/color]


No Junkmail
Guest
 
Posts: n/a
#5: Jul 20 '05

re: Method for compressing varchar/nvarchar columns?


Just a little more detail if anyone cares:

On Oracle you can look at the following white paper on indexed
tables (the feature is call "key compression".)
http://otn.oracle.com/products/oracl...ts/iot_ds.html

On MySQL http://www.mysql.com/doc/en/index.html an look for
the key word compress. You will have to look into the prerelease
documentation for the new UDF functions to come.

"No Junkmail" <nojunkmail@anywhere.com> wrote in message
news:ONEXa.67708$o%2.32778@sccrnsc02...[color=blue]
> MySQL has compressed tables (read only) and is adding COMPRESS()
> and DECOMPRESS() UDF in version 5.0.
>
> MS does compression on it's native OLAP cubes.
>
> Oracle 9i can do data compression on Enterprise Edition relational tables
> and
> OLAP cubes.
>
> Data compression seems to be such a basic staple when working with
> large data of any kind I was surprised that MS did not have it as a
> standard feature in the current release
>
> "DHatheway" <dlhatheway@mmm.com.nospam> wrote in message
> news:bgm90p$gbb$1@tuvok3.mmm.com...[color=green]
> > Just out of curiosity, which other DBMS's that you know of support
> > compression?
> >
> >
> > "No Junkmail" <nojunkmail@anywhere.com> wrote in message
> > news:NjuXa.63017$uu5.6655@sccrnsc04...[color=darkred]
> > > Based on a co-workers experience with similar compression algorithms
> > > as long as the compressed data isn't being used as an index directly[/color][/color]
> then[color=green][color=darkred]
> > > the performance slow down would be more than made up by shifting the
> > > bulk of the I/O from disk read/write to cached pages in memory. The
> > > 20-50X reduction of data size really makes a difference (5 to >1000X
> > > speed improvement depending on the caching scheme.)
> > >
> > > It's a both a performance and a disk storage issue. Although a single
> > > database is ~350MBytes I might want to have dozens of them on a
> > > single machine... not everything is a hardware problem :-).
> > >
> > > It sounds like there is no native or add-on support for compressing
> > > columns in SQL Server 2000. There is such support in the OLAP
> > > cubes when stored in their native cube format (reading the docs
> > > yesterday so I'll look into that too.) Most of the other DBMS's out
> > > there have native compression support on relational tables or are
> > > adding it so I can hope.
> > >
> > > --
> > > Frederick Staats[/color]
> >
> >
> >[/color]
>
>[/color]


DHatheway
Guest
 
Posts: n/a
#6: Jul 20 '05

re: Method for compressing varchar/nvarchar columns?


Thanks for the pointers. I looked at both.

The Oracle feature seems to be a key structure optimization only.

The MySQL readonly compressed table would be OK for situations that are,
well, readonly.

You could probably get away with setting up a Microsoft SQL Server database
that was "readonly" on a compressed volume, too. I do note that MS SQL
Server does NOT support databases on compressed volumes but, according to
the KB article I read, it has do do with sector-aligned writes on the log
device and a "readonly" database wouldn't get into trouble that way, so I
imagine it would work.

I couldn't find the prerelease documentation for the MySQL COMPRESS() and
DECOMPRESS() functions but I'm guessing they would be used this way:

insert mytable (mykey, mycomprsdcol) values (1,compress('data to store in
compressed form')

and then

select * from mytable where DECOMPRESS(mycomprsdcol) like '%store%'

.... in other words, lots of CPU overhead, a tendency to force scans and
programmer awareness of the compressed structure? You'd have to save a LOT
of space and only do comparisons on the compressed columns very rarely to
make this pay off.

I suppose one could write one's own COMPRESS() and DECOMPRESS() functions in
MS SQL Server and have functionality equivalent to MySQL, along with the
equivalent tradeoffs.

I also looked at the DB2 site... Compression there appears to take advantage
of special hardware available on the IBM mainframe systems. I'm impressed
with that. However, if you do compression on a system that does not have
the special hardware, your CPU will take the hit. I suppose that would
include all Intel-based servers (and maybe this feature is only supported at
all on MVS DB2s).

Although it's available in some DBMS systems, it's different in each one and
seems to be of limited utility in some and completely untransparent in
others; I'd hesitate to call it a "basic staple" at this point.

If I were to request this feature from MS, I think I'd ask for it in terms
of "support database devices on compressed volumes." That way, you could
establish the highly compressible part of your database on a filegroup
that's compressed and the parts of your database where you don't think
compression will work well for you could be on uncompressed volumes. Since
NTFS compression already exists, I'd also think this was the fastest way to
get the feature to market. In fact, explicitly limited support, e.g., "no
log files on compressed devices" would be reasonable.

Ideally, I guess, I'd want to be able to declare particular columns to be
compressed, specifying compression as an attribute:

create table foo
(fookey int not null,
footext varchar(100) compressed)

and have handling of the compressed data be transparent to the programmer.
It would still be an nteresting performance tradeoff problem for the DBA, of
course.


"No Junkmail" <nojunkmail@anywhere.com> wrote in message
news:yXEXa.67772$o%2.32105@sccrnsc02...[color=blue]
> Just a little more detail if anyone cares:
>
> On Oracle you can look at the following white paper on indexed
> tables (the feature is call "key compression".)
> http://otn.oracle.com/products/oracl...ts/iot_ds.html
>
> On MySQL http://www.mysql.com/doc/en/index.html an look for
> the key word compress. You will have to look into the prerelease
> documentation for the new UDF functions to come.
>
> "No Junkmail" <nojunkmail@anywhere.com> wrote in message
> news:ONEXa.67708$o%2.32778@sccrnsc02...[color=green]
> > MySQL has compressed tables (read only) and is adding COMPRESS()
> > and DECOMPRESS() UDF in version 5.0.
> >
> > MS does compression on it's native OLAP cubes.
> >
> > Oracle 9i can do data compression on Enterprise Edition relational[/color][/color]
tables[color=blue][color=green]
> > and
> > OLAP cubes.
> >
> > Data compression seems to be such a basic staple when working with
> > large data of any kind I was surprised that MS did not have it as a
> > standard feature in the current release
> >
> > "DHatheway" <dlhatheway@mmm.com.nospam> wrote in message
> > news:bgm90p$gbb$1@tuvok3.mmm.com...[color=darkred]
> > > Just out of curiosity, which other DBMS's that you know of support
> > > compression?[/color][/color][/color]



No Junkmail
Guest
 
Posts: n/a
#7: Jul 20 '05

re: Method for compressing varchar/nvarchar columns?


My comments are embedded in your message below:

"DHatheway" <dlhatheway@mmm.com.nospam> wrote in message
news:bgrm6i$d3o$1@tuvok3.mmm.com...[color=blue]
> Thanks for the pointers. I looked at both.
>
> The Oracle feature seems to be a key structure optimization only.[/color]

From the Oracle white paper:

"Reduced Storage
Index-organized tables maintain a single storage structure -- the B-tree
index.
Primary key column values are stored only in the B-tree index and not
duplicated
in the table and index as happens in a conventional heap-organized table.
Because rows of an index-organized table are stored in primary key order, a
significant amount of additional storage space savings can be obtained
through
the use of key compression."

It clearly reads to me that the primary key is not duplicated and can be
compressed. Yet it remains in sorted order within the b-tree making
clustered
access to ranges of similar value quicker due to locality and caching.
[color=blue]
> The MySQL readonly compressed table would be OK for situations that are,
> well, readonly.[/color]

Yes... it forces you to a more complicated solution (eg. two table lookup
and periodic offline updates to the compressed table.) This is why the new
5.0 (pre alpha) COMPRESS() and DECOMPRESS() are in some ways
more attractive to me.
[color=blue]
> You could probably get away with setting up a Microsoft SQL Server[/color]
database[color=blue]
> that was "readonly" on a compressed volume, too. I do note that MS SQL
> Server does NOT support databases on compressed volumes but, according to
> the KB article I read, it has do do with sector-aligned writes on the log
> device and a "readonly" database wouldn't get into trouble that way, so I
> imagine it would work.[/color]

Don't do that! Volume compression breaks the smart caching patterns of
any database... if it works at all it will be unusable. MS strongly advises
against
this approach. Try it and you will see :-).
[color=blue]
> I couldn't find the prerelease documentation for the MySQL COMPRESS() and
> DECOMPRESS() functions but I'm guessing they would be used this way:[/color]

The doc's for 5.0 are burried in the 5.0 pre-release notes which may be why
you
couldn't find them.
[color=blue]
> insert mytable (mykey, mycomprsdcol) values (1,compress('data to store in
> compressed form')
>
> and then
>
> select * from mytable where DECOMPRESS(mycomprsdcol) like '%store%'
>
> ... in other words, lots of CPU overhead, a tendency to force scans and
> programmer awareness of the compressed structure? You'd have to save a[/color]
LOT[color=blue]
> of space and only do comparisons on the compressed columns very rarely to
> make this pay off.[/color]

I typically have 8 to 1 compression (many, many, many text fields of 100+
characters differ by only a few bits.) This can easily take my working set
from thrashing the disks to fitting comfortably in the servers RAM cache.

I think the "lots of CPU overhead" is ungrounded. It really is a tradeoff
between
disk I/O and cpu. On many access patterns I'm almost always waiting for
the disk. In this case it is a net win to reduce disk I/O by compressing
the data
to be read or written. Last time I looked my disks keep getting bigger
about
10X more quickly than they get faster... also the cpu is getting faster than
my
disks over time. If all else fails I can get a box with more cpu's for less
money
than it takes to buy a substanually faster disk subsystem. (Just because
big
cheap disks are available it doesn't mean that a motherboard with fast I/O,
fast SCSI RAID controller, and small but fast 15K RPM SCSI disks are.)

There are many different compression algorithms out there. Some are much
quicker than others. Some are also very asymmetric and can decompress much
faster than compress... which works very well for "slowly" changing tables.
I
have researcher friend who has found that hash look-ups on a "compressed"
hash table are about 3X slower... didn't help him in his all in memory
application ... but would help me a lot in my disk bound database.
[color=blue]
> I suppose one could write one's own COMPRESS() and DECOMPRESS() functions[/color]
in[color=blue]
> MS SQL Server and have functionality equivalent to MySQL, along with the
> equivalent tradeoffs.
>
> I also looked at the DB2 site... Compression there appears to take[/color]
advantage[color=blue]
> of special hardware available on the IBM mainframe systems. I'm impressed
> with that. However, if you do compression on a system that does not have
> the special hardware, your CPU will take the hit. I suppose that would
> include all Intel-based servers (and maybe this feature is only supported[/color]
at[color=blue]
> all on MVS DB2s).[/color]

On a networked SQL Server what else is the CPU there for? I'm not running
any fancy screen savers on mine :-).
[color=blue]
> Although it's available in some DBMS systems, it's different in each one[/color]
and[color=blue]
> seems to be of limited utility in some and completely untransparent in
> others; I'd hesitate to call it a "basic staple" at this point.
>
> If I were to request this feature from MS, I think I'd ask for it in terms
> of "support database devices on compressed volumes." That way, you could
> establish the highly compressible part of your database on a filegroup
> that's compressed and the parts of your database where you don't think
> compression will work well for you could be on uncompressed volumes.[/color]
Since[color=blue]
> NTFS compression already exists, I'd also think this was the fastest way[/color]
to[color=blue]
> get the feature to market. In fact, explicitly limited support, e.g., "no
> log files on compressed devices" would be reasonable.
>
> Ideally, I guess, I'd want to be able to declare particular columns to be
> compressed, specifying compression as an attribute:
>
> create table foo
> (fookey int not null,
> footext varchar(100) compressed)
>
> and have handling of the compressed data be transparent to the programmer.
> It would still be an nteresting performance tradeoff problem for the DBA,[/color]
of[color=blue]
> course.[/color]

I to would prefer a transparent solution too, which was why I asked in the
news
group in the first place.

--
frederick dot w dot staats at intel dot com


Closed Thread