473,769 Members | 6,248 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Method for compressing varchar/nvarchar columns?

I have an application with highly compressable strings (gzip encoding
usually does somewhere between 20-50X reduction.) My base 350MB
database is mostly made up of these slowly (or even static) strings. I
would like to compress these so that my disk I/O and memory footprint
is greatly reduced.

Some databases have the ability to provide a compressed
table, compressed column, or provide a user defined function to
compress an indvidual Field with a user defined function
[ala. COMPRESS() and DECOMPRESS() ].

I could right a UDF with an extended prodcedure if I need to but I'm
wondering if there are any other known methods to do this in MS SQL
Server 2000 today?

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)


Jul 20 '05 #1
6 6580
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then
the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)
"Simon Hayes" <sq*@hayes.ch > wrote in message
news:3f******** **@news.bluewin .ch...
"Junkmail" <ju******@anywh ere.com> wrote in message
news:EQHWa.4603 3$o%2.23461@scc rnsc02...
I have an application with highly compressable strings (gzip encoding
usually does somewhere between 20-50X reduction.) My base 350MB
database is mostly made up of these slowly (or even static) strings. I
would like to compress these so that my disk I/O and memory footprint
is greatly reduced.

Some databases have the ability to provide a compressed
table, compressed column, or provide a user defined function to
compress an indvidual Field with a user defined function
[ala. COMPRESS() and DECOMPRESS() ].

I could right a UDF with an extended prodcedure if I need to but I'm
wondering if there are any other known methods to do this in MS SQL
Server 2000 today?

--
Frederick Staats
frederick dot w dot staats at intel dot com (I hate junk mail :-)

I'm not sure that I understand your question - are you saying that you'd
like to stored the compressed text in a column, and dynamically
compress/decompress on INSERT/SELECT? If so, I don't think this is such a
good approach - disk space is very cheap, and a 350MB database is very
small. In general, the overhead of compression/decompression would

outweigh any benefits of a smaller database. In addition, you wouldn't be able to
index or search the column efficiently, although you may not need to if you always access it via a key, and never use the column in a WHERE clause.

But since it's not clear from your post what the real issue is that you're
trying to solve, it's possible you may have a good reason for trying to do
this. There is no built-in functionality in MSSQL to do the
compression/decompression, so an extended proc is the obvious server-side
solution, or handle it in the client application.

Simon

Jul 20 '05 #2
Just out of curiosity, which other DBMS's that you know of support
compression?
"No Junkmail" <no********@any where.com> wrote in message
news:NjuXa.6301 7$uu5.6655@sccr nsc04...
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then
the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats


Jul 20 '05 #3
MySQL has compressed tables (read only) and is adding COMPRESS()
and DECOMPRESS() UDF in version 5.0.

MS does compression on it's native OLAP cubes.

Oracle 9i can do data compression on Enterprise Edition relational tables
and
OLAP cubes.

Data compression seems to be such a basic staple when working with
large data of any kind I was surprised that MS did not have it as a
standard feature in the current release

"DHatheway" <dl********@mmm .com.nospam> wrote in message
news:bg******** **@tuvok3.mmm.c om...
Just out of curiosity, which other DBMS's that you know of support
compression?
"No Junkmail" <no********@any where.com> wrote in message
news:NjuXa.6301 7$uu5.6655@sccr nsc04...
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats


Jul 20 '05 #4
Just a little more detail if anyone cares:

On Oracle you can look at the following white paper on indexed
tables (the feature is call "key compression".)
http://otn.oracle.com/products/oracl...ts/iot_ds.html

On MySQL http://www.mysql.com/doc/en/index.html an look for
the key word compress. You will have to look into the prerelease
documentation for the new UDF functions to come.

"No Junkmail" <no********@any where.com> wrote in message
news:ONEXa.6770 8$o%2.32778@scc rnsc02...
MySQL has compressed tables (read only) and is adding COMPRESS()
and DECOMPRESS() UDF in version 5.0.

MS does compression on it's native OLAP cubes.

Oracle 9i can do data compression on Enterprise Edition relational tables
and
OLAP cubes.

Data compression seems to be such a basic staple when working with
large data of any kind I was surprised that MS did not have it as a
standard feature in the current release

"DHatheway" <dl********@mmm .com.nospam> wrote in message
news:bg******** **@tuvok3.mmm.c om...
Just out of curiosity, which other DBMS's that you know of support
compression?
"No Junkmail" <no********@any where.com> wrote in message
news:NjuXa.6301 7$uu5.6655@sccr nsc04...
Based on a co-workers experience with similar compression algorithms
as long as the compressed data isn't being used as an index directly then the performance slow down would be more than made up by shifting the
bulk of the I/O from disk read/write to cached pages in memory. The
20-50X reduction of data size really makes a difference (5 to >1000X
speed improvement depending on the caching scheme.)

It's a both a performance and a disk storage issue. Although a single
database is ~350MBytes I might want to have dozens of them on a
single machine... not everything is a hardware problem :-).

It sounds like there is no native or add-on support for compressing
columns in SQL Server 2000. There is such support in the OLAP
cubes when stored in their native cube format (reading the docs
yesterday so I'll look into that too.) Most of the other DBMS's out
there have native compression support on relational tables or are
adding it so I can hope.

--
Frederick Staats



Jul 20 '05 #5
Thanks for the pointers. I looked at both.

The Oracle feature seems to be a key structure optimization only.

The MySQL readonly compressed table would be OK for situations that are,
well, readonly.

You could probably get away with setting up a Microsoft SQL Server database
that was "readonly" on a compressed volume, too. I do note that MS SQL
Server does NOT support databases on compressed volumes but, according to
the KB article I read, it has do do with sector-aligned writes on the log
device and a "readonly" database wouldn't get into trouble that way, so I
imagine it would work.

I couldn't find the prerelease documentation for the MySQL COMPRESS() and
DECOMPRESS() functions but I'm guessing they would be used this way:

insert mytable (mykey, mycomprsdcol) values (1,compress('da ta to store in
compressed form')

and then

select * from mytable where DECOMPRESS(myco mprsdcol) like '%store%'

.... in other words, lots of CPU overhead, a tendency to force scans and
programmer awareness of the compressed structure? You'd have to save a LOT
of space and only do comparisons on the compressed columns very rarely to
make this pay off.

I suppose one could write one's own COMPRESS() and DECOMPRESS() functions in
MS SQL Server and have functionality equivalent to MySQL, along with the
equivalent tradeoffs.

I also looked at the DB2 site... Compression there appears to take advantage
of special hardware available on the IBM mainframe systems. I'm impressed
with that. However, if you do compression on a system that does not have
the special hardware, your CPU will take the hit. I suppose that would
include all Intel-based servers (and maybe this feature is only supported at
all on MVS DB2s).

Although it's available in some DBMS systems, it's different in each one and
seems to be of limited utility in some and completely untransparent in
others; I'd hesitate to call it a "basic staple" at this point.

If I were to request this feature from MS, I think I'd ask for it in terms
of "support database devices on compressed volumes." That way, you could
establish the highly compressible part of your database on a filegroup
that's compressed and the parts of your database where you don't think
compression will work well for you could be on uncompressed volumes. Since
NTFS compression already exists, I'd also think this was the fastest way to
get the feature to market. In fact, explicitly limited support, e.g., "no
log files on compressed devices" would be reasonable.

Ideally, I guess, I'd want to be able to declare particular columns to be
compressed, specifying compression as an attribute:

create table foo
(fookey int not null,
footext varchar(100) compressed)

and have handling of the compressed data be transparent to the programmer.
It would still be an nteresting performance tradeoff problem for the DBA, of
course.
"No Junkmail" <no********@any where.com> wrote in message
news:yXEXa.6777 2$o%2.32105@scc rnsc02...
Just a little more detail if anyone cares:

On Oracle you can look at the following white paper on indexed
tables (the feature is call "key compression".)
http://otn.oracle.com/products/oracl...ts/iot_ds.html

On MySQL http://www.mysql.com/doc/en/index.html an look for
the key word compress. You will have to look into the prerelease
documentation for the new UDF functions to come.

"No Junkmail" <no********@any where.com> wrote in message
news:ONEXa.6770 8$o%2.32778@scc rnsc02...
MySQL has compressed tables (read only) and is adding COMPRESS()
and DECOMPRESS() UDF in version 5.0.

MS does compression on it's native OLAP cubes.

Oracle 9i can do data compression on Enterprise Edition relational tables and
OLAP cubes.

Data compression seems to be such a basic staple when working with
large data of any kind I was surprised that MS did not have it as a
standard feature in the current release

"DHatheway" <dl********@mmm .com.nospam> wrote in message
news:bg******** **@tuvok3.mmm.c om...
Just out of curiosity, which other DBMS's that you know of support
compression?


Jul 20 '05 #6
My comments are embedded in your message below:

"DHatheway" <dl********@mmm .com.nospam> wrote in message
news:bg******** **@tuvok3.mmm.c om...
Thanks for the pointers. I looked at both.

The Oracle feature seems to be a key structure optimization only.
From the Oracle white paper:

"Reduced Storage
Index-organized tables maintain a single storage structure -- the B-tree
index.
Primary key column values are stored only in the B-tree index and not
duplicated
in the table and index as happens in a conventional heap-organized table.
Because rows of an index-organized table are stored in primary key order, a
significant amount of additional storage space savings can be obtained
through
the use of key compression."

It clearly reads to me that the primary key is not duplicated and can be
compressed. Yet it remains in sorted order within the b-tree making
clustered
access to ranges of similar value quicker due to locality and caching.
The MySQL readonly compressed table would be OK for situations that are,
well, readonly.
Yes... it forces you to a more complicated solution (eg. two table lookup
and periodic offline updates to the compressed table.) This is why the new
5.0 (pre alpha) COMPRESS() and DECOMPRESS() are in some ways
more attractive to me.
You could probably get away with setting up a Microsoft SQL Server database that was "readonly" on a compressed volume, too. I do note that MS SQL
Server does NOT support databases on compressed volumes but, according to
the KB article I read, it has do do with sector-aligned writes on the log
device and a "readonly" database wouldn't get into trouble that way, so I
imagine it would work.
Don't do that! Volume compression breaks the smart caching patterns of
any database... if it works at all it will be unusable. MS strongly advises
against
this approach. Try it and you will see :-).
I couldn't find the prerelease documentation for the MySQL COMPRESS() and
DECOMPRESS() functions but I'm guessing they would be used this way:
The doc's for 5.0 are burried in the 5.0 pre-release notes which may be why
you
couldn't find them.
insert mytable (mykey, mycomprsdcol) values (1,compress('da ta to store in
compressed form')

and then

select * from mytable where DECOMPRESS(myco mprsdcol) like '%store%'

... in other words, lots of CPU overhead, a tendency to force scans and
programmer awareness of the compressed structure? You'd have to save a LOT of space and only do comparisons on the compressed columns very rarely to
make this pay off.
I typically have 8 to 1 compression (many, many, many text fields of 100+
characters differ by only a few bits.) This can easily take my working set
from thrashing the disks to fitting comfortably in the servers RAM cache.

I think the "lots of CPU overhead" is ungrounded. It really is a tradeoff
between
disk I/O and cpu. On many access patterns I'm almost always waiting for
the disk. In this case it is a net win to reduce disk I/O by compressing
the data
to be read or written. Last time I looked my disks keep getting bigger
about
10X more quickly than they get faster... also the cpu is getting faster than
my
disks over time. If all else fails I can get a box with more cpu's for less
money
than it takes to buy a substanually faster disk subsystem. (Just because
big
cheap disks are available it doesn't mean that a motherboard with fast I/O,
fast SCSI RAID controller, and small but fast 15K RPM SCSI disks are.)

There are many different compression algorithms out there. Some are much
quicker than others. Some are also very asymmetric and can decompress much
faster than compress... which works very well for "slowly" changing tables.
I
have researcher friend who has found that hash look-ups on a "compressed "
hash table are about 3X slower... didn't help him in his all in memory
application ... but would help me a lot in my disk bound database.
I suppose one could write one's own COMPRESS() and DECOMPRESS() functions in MS SQL Server and have functionality equivalent to MySQL, along with the
equivalent tradeoffs.

I also looked at the DB2 site... Compression there appears to take advantage of special hardware available on the IBM mainframe systems. I'm impressed
with that. However, if you do compression on a system that does not have
the special hardware, your CPU will take the hit. I suppose that would
include all Intel-based servers (and maybe this feature is only supported at all on MVS DB2s).
On a networked SQL Server what else is the CPU there for? I'm not running
any fancy screen savers on mine :-).
Although it's available in some DBMS systems, it's different in each one and seems to be of limited utility in some and completely untransparent in
others; I'd hesitate to call it a "basic staple" at this point.

If I were to request this feature from MS, I think I'd ask for it in terms
of "support database devices on compressed volumes." That way, you could
establish the highly compressible part of your database on a filegroup
that's compressed and the parts of your database where you don't think
compression will work well for you could be on uncompressed volumes. Since NTFS compression already exists, I'd also think this was the fastest way to get the feature to market. In fact, explicitly limited support, e.g., "no
log files on compressed devices" would be reasonable.

Ideally, I guess, I'd want to be able to declare particular columns to be
compressed, specifying compression as an attribute:

create table foo
(fookey int not null,
footext varchar(100) compressed)

and have handling of the compressed data be transparent to the programmer.
It would still be an nteresting performance tradeoff problem for the DBA, of course.


I to would prefer a transparent solution too, which was why I asked in the
news
group in the first place.

--
frederick dot w dot staats at intel dot com
Jul 20 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
9069
by: Vani Kancherlapalli | last post by:
Hi The maximum length of a nvarchar could be 4000 characters while that of varchar could be 8000. We are trying to use unicode which would require that the datatype for one our fields be converted from varchar to nvarchar. But looks like this would result in loss of existing data. Is there a way to do this without loss of data?
4
2043
by: wriggs | last post by:
Hi, This is probably an easy question for someone so any help would be appreciated. I have changed the columns in a table that where nvarchar to the same size of type varchar so halve the space needed for them. I have done this a) becuase this is never going to be an international application, b) we are running out of space and c) there are 100
5
6069
by: jim_geissman | last post by:
One table I manage has a clustered index, and it includes some varchar columns. When it is initially created, all the columns in the clustered index are populated, and then some of the longer varchars are populated through update queries. If the varchar columns are stored outside the clustered structure, then it would make sense to create the clustered index before populating the varchar columns. Otherwise it would make sense to wait,...
0
2131
by: Roman | last post by:
I'm trying to create the form which would allow data entry to the Client table, as well as modification and deletion of existing data rows. For some reason the DataGrid part of functionality stops working when I include data entry fields to the form: I click on Delete or Edit inside of DataGrid and get this error: "Error: Object doesn't support this property or method" If I remove data entry fields from the form - DataGrid allows to...
4
13008
by: D. | last post by:
Hi, I'm starting a new application in java using JTDS jdbc driver (http://jtds.sourceforge.net) and SQLServer 2005 Express. I have to design the database from scratch and my doubt is if I have to use varchar or nvarchar fields to store string data. Any experience about performance issues using nvarchar instead of varchar (considering that Java internally works in unicode too)? Thanks in advance, Davide.
4
2153
by: stuckish | last post by:
My collegue is saying that you should store text data in Sql Server i unicode fields (nvarchar) beause that .net (asp.net website) is all unicode and therefore you save some time not having to convert the data from ascii to unicode. So i thought that i should ask you experts what you saying about it? What is best for performance if you do not have to store any unicode
5
18343
by: haidani | last post by:
Hi, I am new to MS SQL. When I create a column in a table, when shall I use nvarchar or varchar? Please help. Thanks, Mike
2
6168
by: bharathreddy | last post by:
Here i will show the differences between nvarchar and varchar: The difference is that nvarchar is used to store unicode data, which is used to store multilingual data in the database tables. Other languages have an extended set of character codes that need to be saved and this datatype allows for this extension. If the database will not be storing multilingual data we should use the varchar database instead. The reason for this is that...
0
27258
by: maheshmohta | last post by:
Background Often while remodeling legacy application, one of the important tasks for the architects is to have an optimum usage of storage capabilities of database. Most of the legacy applications are constrained by the technology available at the time of their development and hence aren’t optimum as per current scenario. One of such cases is the extensive usage of CHAR fields, which aren’t optimum solution for space storage now. This paper...
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10214
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8872
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7410
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5304
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5447
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3963
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3563
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.