473,385 Members | 1,523 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Method for compressing varchar/nvarchar columns?

I have an application with highly compressable strings (gzip encoding
usually does somewhere between 20-50X reduction.) My base 350MB
database is mostly made up of these slowly (or even static) strings. I
would like to compress these so that my disk I/O and memory footprint
is greatly reduced.

Some databases have the ability to provide a compressed
table, compressed column, or provide a user defined function to
compress an indvidual Field with a user defined function
[ala. COMPRESS() and DECOMPRESS() ].

I could right a UDF with an extended prodcedure if I need to but I'm
wondering if there are any other known methods to do this in MS SQL
Server 2000 today?

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)


Jul 20 '05 #1
6 6554
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then
the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)
"Simon Hayes" <sq*@hayes.ch> wrote in message
news:3f**********@news.bluewin.ch...
"Junkmail" <ju******@anywhere.com> wrote in message
news:EQHWa.46033$o%2.23461@sccrnsc02...
I have an application with highly compressable strings (gzip encoding
usually does somewhere between 20-50X reduction.) My base 350MB
database is mostly made up of these slowly (or even static) strings. I
would like to compress these so that my disk I/O and memory footprint
is greatly reduced.

Some databases have the ability to provide a compressed
table, compressed column, or provide a user defined function to
compress an indvidual Field with a user defined function
[ala. COMPRESS() and DECOMPRESS() ].

I could right a UDF with an extended prodcedure if I need to but I'm
wondering if there are any other known methods to do this in MS SQL
Server 2000 today?

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)

I'm not sure that I understand your question - are you saying that you'd
like to stored the compressed text in a column, and dynamically
compress/decompress on INSERT/SELECT? If so, I don't think this is such a
good approach - disk space is very cheap, and a 350MB database is very
small. In general, the overhead of compression/decompression would

outweigh any benefits of a smaller database. In addition, you wouldn't be able to
index or search the column efficiently, although you may not need to if you always access it via a key, and never use the column in a WHERE clause.

But since it's not clear from your post what the real issue is that you're
trying to solve, it's possible you may have a good reason for trying to do
this. There is no built-in functionality in MSSQL to do the
compression/decompression, so an extended proc is the obvious server-side
solution, or handle it in the client application.

Simon

Jul 20 '05 #2
Just out of curiosity, which other DBMS's that you know of support
compression?
"No Junkmail" <no********@anywhere.com> wrote in message
news:NjuXa.63017$uu5.6655@sccrnsc04...
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then
the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats


Jul 20 '05 #3
MySQL has compressed tables (read only) and is adding COMPRESS()
and DECOMPRESS() UDF in version 5.0.

MS does compression on it's native OLAP cubes.

Oracle 9i can do data compression on Enterprise Edition relational tables
and
OLAP cubes.

Data compression seems to be such a basic staple when working with
large data of any kind I was surprised that MS did not have it as a
standard feature in the current release

"DHatheway" <dl********@mmm.com.nospam> wrote in message
news:bg**********@tuvok3.mmm.com...
Just out of curiosity, which other DBMS's that you know of support
compression?
"No Junkmail" <no********@anywhere.com> wrote in message
news:NjuXa.63017$uu5.6655@sccrnsc04...
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats


Jul 20 '05 #4
Just a little more detail if anyone cares:

On Oracle you can look at the following white paper on indexed
tables (the feature is call "key compression".)
http://otn.oracle.com/products/oracl...ts/iot_ds.html

On MySQL http://www.mysql.com/doc/en/index.html an look for
the key word compress. You will have to look into the prerelease
documentation for the new UDF functions to come.

"No Junkmail" <no********@anywhere.com> wrote in message
news:ONEXa.67708$o%2.32778@sccrnsc02...
MySQL has compressed tables (read only) and is adding COMPRESS()
and DECOMPRESS() UDF in version 5.0.

MS does compression on it's native OLAP cubes.

Oracle 9i can do data compression on Enterprise Edition relational tables
and
OLAP cubes.

Data compression seems to be such a basic staple when working with
large data of any kind I was surprised that MS did not have it as a
standard feature in the current release

"DHatheway" <dl********@mmm.com.nospam> wrote in message
news:bg**********@tuvok3.mmm.com...
Just out of curiosity, which other DBMS's that you know of support
compression?
"No Junkmail" <no********@anywhere.com> wrote in message
news:NjuXa.63017$uu5.6655@sccrnsc04...
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats



Jul 20 '05 #5
Thanks for the pointers. I looked at both.

The Oracle feature seems to be a key structure optimization only.

The MySQL readonly compressed table would be OK for situations that are,
well, readonly.

You could probably get away with setting up a Microsoft SQL Server database
that was "readonly" on a compressed volume, too. I do note that MS SQL
Server does NOT support databases on compressed volumes but, according to
the KB article I read, it has do do with sector-aligned writes on the log
device and a "readonly" database wouldn't get into trouble that way, so I
imagine it would work.

I couldn't find the prerelease documentation for the MySQL COMPRESS() and
DECOMPRESS() functions but I'm guessing they would be used this way:

insert mytable (mykey, mycomprsdcol) values (1,compress('data to store in
compressed form')

and then

select * from mytable where DECOMPRESS(mycomprsdcol) like '%store%'

.... in other words, lots of CPU overhead, a tendency to force scans and
programmer awareness of the compressed structure? You'd have to save a LOT
of space and only do comparisons on the compressed columns very rarely to
make this pay off.

I suppose one could write one's own COMPRESS() and DECOMPRESS() functions in
MS SQL Server and have functionality equivalent to MySQL, along with the
equivalent tradeoffs.

I also looked at the DB2 site... Compression there appears to take advantage
of special hardware available on the IBM mainframe systems. I'm impressed
with that. However, if you do compression on a system that does not have
the special hardware, your CPU will take the hit. I suppose that would
include all Intel-based servers (and maybe this feature is only supported at
all on MVS DB2s).

Although it's available in some DBMS systems, it's different in each one and
seems to be of limited utility in some and completely untransparent in
others; I'd hesitate to call it a "basic staple" at this point.

If I were to request this feature from MS, I think I'd ask for it in terms
of "support database devices on compressed volumes." That way, you could
establish the highly compressible part of your database on a filegroup
that's compressed and the parts of your database where you don't think
compression will work well for you could be on uncompressed volumes. Since
NTFS compression already exists, I'd also think this was the fastest way to
get the feature to market. In fact, explicitly limited support, e.g., "no
log files on compressed devices" would be reasonable.

Ideally, I guess, I'd want to be able to declare particular columns to be
compressed, specifying compression as an attribute:

create table foo
(fookey int not null,
footext varchar(100) compressed)

and have handling of the compressed data be transparent to the programmer.
It would still be an nteresting performance tradeoff problem for the DBA, of
course.
"No Junkmail" <no********@anywhere.com> wrote in message
news:yXEXa.67772$o%2.32105@sccrnsc02...
Just a little more detail if anyone cares:

On Oracle you can look at the following white paper on indexed
tables (the feature is call "key compression".)
http://otn.oracle.com/products/oracl...ts/iot_ds.html

On MySQL http://www.mysql.com/doc/en/index.html an look for
the key word compress. You will have to look into the prerelease
documentation for the new UDF functions to come.

"No Junkmail" <no********@anywhere.com> wrote in message
news:ONEXa.67708$o%2.32778@sccrnsc02...
MySQL has compressed tables (read only) and is adding COMPRESS()
and DECOMPRESS() UDF in version 5.0.

MS does compression on it's native OLAP cubes.

Oracle 9i can do data compression on Enterprise Edition relational tables and
OLAP cubes.

Data compression seems to be such a basic staple when working with
large data of any kind I was surprised that MS did not have it as a
standard feature in the current release

"DHatheway" <dl********@mmm.com.nospam> wrote in message
news:bg**********@tuvok3.mmm.com...
Just out of curiosity, which other DBMS's that you know of support
compression?


Jul 20 '05 #6
My comments are embedded in your message below:

"DHatheway" <dl********@mmm.com.nospam> wrote in message
news:bg**********@tuvok3.mmm.com...
Thanks for the pointers. I looked at both.

The Oracle feature seems to be a key structure optimization only.
From the Oracle white paper:

"Reduced Storage
Index-organized tables maintain a single storage structure -- the B-tree
index.
Primary key column values are stored only in the B-tree index and not
duplicated
in the table and index as happens in a conventional heap-organized table.
Because rows of an index-organized table are stored in primary key order, a
significant amount of additional storage space savings can be obtained
through
the use of key compression."

It clearly reads to me that the primary key is not duplicated and can be
compressed. Yet it remains in sorted order within the b-tree making
clustered
access to ranges of similar value quicker due to locality and caching.
The MySQL readonly compressed table would be OK for situations that are,
well, readonly.
Yes... it forces you to a more complicated solution (eg. two table lookup
and periodic offline updates to the compressed table.) This is why the new
5.0 (pre alpha) COMPRESS() and DECOMPRESS() are in some ways
more attractive to me.
You could probably get away with setting up a Microsoft SQL Server database that was "readonly" on a compressed volume, too. I do note that MS SQL
Server does NOT support databases on compressed volumes but, according to
the KB article I read, it has do do with sector-aligned writes on the log
device and a "readonly" database wouldn't get into trouble that way, so I
imagine it would work.
Don't do that! Volume compression breaks the smart caching patterns of
any database... if it works at all it will be unusable. MS strongly advises
against
this approach. Try it and you will see :-).
I couldn't find the prerelease documentation for the MySQL COMPRESS() and
DECOMPRESS() functions but I'm guessing they would be used this way:
The doc's for 5.0 are burried in the 5.0 pre-release notes which may be why
you
couldn't find them.
insert mytable (mykey, mycomprsdcol) values (1,compress('data to store in
compressed form')

and then

select * from mytable where DECOMPRESS(mycomprsdcol) like '%store%'

... in other words, lots of CPU overhead, a tendency to force scans and
programmer awareness of the compressed structure? You'd have to save a LOT of space and only do comparisons on the compressed columns very rarely to
make this pay off.
I typically have 8 to 1 compression (many, many, many text fields of 100+
characters differ by only a few bits.) This can easily take my working set
from thrashing the disks to fitting comfortably in the servers RAM cache.

I think the "lots of CPU overhead" is ungrounded. It really is a tradeoff
between
disk I/O and cpu. On many access patterns I'm almost always waiting for
the disk. In this case it is a net win to reduce disk I/O by compressing
the data
to be read or written. Last time I looked my disks keep getting bigger
about
10X more quickly than they get faster... also the cpu is getting faster than
my
disks over time. If all else fails I can get a box with more cpu's for less
money
than it takes to buy a substanually faster disk subsystem. (Just because
big
cheap disks are available it doesn't mean that a motherboard with fast I/O,
fast SCSI RAID controller, and small but fast 15K RPM SCSI disks are.)

There are many different compression algorithms out there. Some are much
quicker than others. Some are also very asymmetric and can decompress much
faster than compress... which works very well for "slowly" changing tables.
I
have researcher friend who has found that hash look-ups on a "compressed"
hash table are about 3X slower... didn't help him in his all in memory
application ... but would help me a lot in my disk bound database.
I suppose one could write one's own COMPRESS() and DECOMPRESS() functions in MS SQL Server and have functionality equivalent to MySQL, along with the
equivalent tradeoffs.

I also looked at the DB2 site... Compression there appears to take advantage of special hardware available on the IBM mainframe systems. I'm impressed
with that. However, if you do compression on a system that does not have
the special hardware, your CPU will take the hit. I suppose that would
include all Intel-based servers (and maybe this feature is only supported at all on MVS DB2s).
On a networked SQL Server what else is the CPU there for? I'm not running
any fancy screen savers on mine :-).
Although it's available in some DBMS systems, it's different in each one and seems to be of limited utility in some and completely untransparent in
others; I'd hesitate to call it a "basic staple" at this point.

If I were to request this feature from MS, I think I'd ask for it in terms
of "support database devices on compressed volumes." That way, you could
establish the highly compressible part of your database on a filegroup
that's compressed and the parts of your database where you don't think
compression will work well for you could be on uncompressed volumes. Since NTFS compression already exists, I'd also think this was the fastest way to get the feature to market. In fact, explicitly limited support, e.g., "no
log files on compressed devices" would be reasonable.

Ideally, I guess, I'd want to be able to declare particular columns to be
compressed, specifying compression as an attribute:

create table foo
(fookey int not null,
footext varchar(100) compressed)

and have handling of the compressed data be transparent to the programmer.
It would still be an nteresting performance tradeoff problem for the DBA, of course.


I to would prefer a transparent solution too, which was why I asked in the
news
group in the first place.

--
frederick dot w dot staats at intel dot com
Jul 20 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Vani Kancherlapalli | last post by:
Hi The maximum length of a nvarchar could be 4000 characters while that of varchar could be 8000. We are trying to use unicode which would require that the datatype for one our fields be...
4
by: wriggs | last post by:
Hi, This is probably an easy question for someone so any help would be appreciated. I have changed the columns in a table that where nvarchar to the same size of type varchar so halve the...
5
by: jim_geissman | last post by:
One table I manage has a clustered index, and it includes some varchar columns. When it is initially created, all the columns in the clustered index are populated, and then some of the longer...
0
by: Roman | last post by:
I'm trying to create the form which would allow data entry to the Client table, as well as modification and deletion of existing data rows. For some reason the DataGrid part of functionality stops...
4
by: D. | last post by:
Hi, I'm starting a new application in java using JTDS jdbc driver (http://jtds.sourceforge.net) and SQLServer 2005 Express. I have to design the database from scratch and my doubt is if I have to...
4
by: stuckish | last post by:
My collegue is saying that you should store text data in Sql Server i unicode fields (nvarchar) beause that .net (asp.net website) is all unicode and therefore you save some time not having to...
5
by: haidani | last post by:
Hi, I am new to MS SQL. When I create a column in a table, when shall I use nvarchar or varchar? Please help. Thanks, Mike
2
by: bharathreddy | last post by:
Here i will show the differences between nvarchar and varchar: The difference is that nvarchar is used to store unicode data, which is used to store multilingual data in the database tables....
0
by: maheshmohta | last post by:
Background Often while remodeling legacy application, one of the important tasks for the architects is to have an optimum usage of storage capabilities of database. Most of the legacy applications...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.