By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,016 Members | 2,265 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,016 IT Pros & Developers. It's quick & easy.

How can mySQL only use 1 byte for storage?

P: n/a
>From the manual "Storage Requirements":

"ENUM('value1','value2',...) =1 or 2 bytes, depending on the number
of enumeration values (65,535 values maximum)"

This seems to mean:

"a" = 1 byte of storage
"prestidigitation" = 1 byte of storage

and

enum("a","b","c","d") = 4 bytes of storage
enum("prestidigitation","sleight of hand","disappearances","card
tricks") = 4 bytes of storage

How is this possible? I'm trying to get my head around WHY these values
in an ENUM-type column occupy the same amount (1 byte or 4 bytes) of
storage space (<256 enumeration values).

Am I misreading the manual? Is it just referring to the index, and if
so, what accounts for the string storage? If not, doesn't the index
take up 1 byte per value, itself? How can these strings occupy the same
amount of storage? Is there some sort of serialization going on inside
a MyISAM table?

Thanks in advance for any illumination.

Aug 14 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
StupidScript wrote:
enum("a","b","c","d") = 4 bytes of storage
No, that would only take 1 byte of storage. For example:
"a" = 0000 in binary,
"b" = 0001,
"c" = 0010,
"d" = 0011

Actually only 2 bits are used to record which of four values are used.
But it's easier to round up to whole bytes, so even though only 2 bits
are being used, the implementation "generously" allocates 1 byte.

Somewhere else, the mapping from binary values to the four strings is
stored, but it is stored only once. No need to store it for every row
that has an ENUM column. When you query the column, it looks up the
string corresponding to the binary pattern, and returns that string.
enum("prestidigitation","sleight of hand","disappearances","card
tricks") = 4 bytes of storage
That does require more storage for the mapping between binary values and
the strings, but only once. It requires only 2 bits (or round up to 1
whole byte) per row.

Regards,
Bill K.
Aug 14 '06 #2

P: n/a
>>From the manual "Storage Requirements":

This is the storage requirements *FOR A ROW OF DATA*. The table
definition is not included in each row of data. Strings are not
included in each row of data because they do not appear there.

It is my observation that table definitions take between 8.5K and
10.5K for the tables I've got, which don't go overboard with strings
in enums. Table definitions don't change size with the number of
rows stored.
>"ENUM('value1','value2',...) =1 or 2 bytes, depending on the number
of enumeration values (65,535 values maximum)"

This seems to mean:

"a" = 1 byte of storage
"prestidigitation" = 1 byte of storage

and

enum("a","b","c","d") = 4 bytes of storage
enum("prestidigitation","sleight of hand","disappearances","card
tricks") = 4 bytes of storage

How is this possible?
It's easy. Number the possible values. Store the index in the record.
MySQL has to do some translation on the way in and out, a tradeoff
of a little CPU time for reduced disk storage.

>I'm trying to get my head around WHY these values
in an ENUM-type column occupy the same amount (1 byte or 4 bytes) of
storage space (<256 enumeration values).
How come "Sue Li" and "Alexander Brownschweiger
Supercalifragilisticexpialidocious" both have 9-digit Social Security
numbers?
>Am I misreading the manual? Is it just referring to the index, and if
so, what accounts for the string storage?
As I understand it, it's referring to a data row (*.MYD for MyISAM).
It does *NOT* include the index (*.MYI for MyISAM), if the enum field
is in an index. And the table definition (*.frm for MyISAM) also isn't
included. If you really need to sweat over the disk storage used
by a table definition, you have much more serious problems to worry
about (like logs).

Do you really care about the length of the strings, which are stored
once *per table*, in a table with a few hundred million rows where the
data field is stored *per row*?
>If not, doesn't the index
take up 1 byte per value, itself?
No. Values not indexed don't take any storage at all and values that
are indexed certainly take more than one byte per value even if that
value is only one byte.
>How can these strings occupy the same
amount of storage?
The strings aren't stored in each row.
>Is there some sort of serialization going on inside
a MyISAM table?

Thanks in advance for any illumination.

Aug 14 '06 #3

P: n/a
Bill:
>Somewhere else, the mapping from binary values
to the four strings is stored, but it is stored
only once."
But where is it stored only once? And why doesn't that data retention
add to the storage requirements for a set of variable string data, like
an ENUM row?

Gordon:
>It's easy. Number the possible values. Store the
index in the record. MySQL has to do some translation
on the way in and out, a tradeoff of a little CPU
time for reduced disk storage.
Translation ... from the index mySQL stores to the string value ...
which is stored ... where? And how is it reduced, if there needs to be
an additional 1-2 byte associative element generated and stored for
each element in the ENUM array?
>How come "Sue Li" and "Alexander Brownschweiger
Supercalifragilisticexpialidocious" both have 9-digit
Social Security numbers?
Unless every string is reduced to some predictable-length value, such
as that used to count bytes-stored, as we are discussing, how could a
simple binary string of 1-2 byte length be capable of associating a
predictable 9-digit code (SSN) with a variable User Name? I mean, the
second name you used in your example is 22 characters! I can imagine
much longer strings. Doesn't that need to be stored, somewhere?

Somewhere the name is associated with the number, and the length of
that name's characters would seem to take up additional storage space
.... somewhere. Maybe there's serialization occuring, in order to
standardize the string length. I dunno. But there's more than 1-2 bytes
involved, isn't there?

I'm delighted with the avenues these comments are making into my brain,
and thank you both very much. I'm just wondering where my understanding
of 'storage' should break with the laws of physics?

Seriously, though, thanks. Good stuff! :)

Aug 15 '06 #4

P: n/a
Okay, how about this? (No C.S. degree ... just guessing ...)

1) mySQL does, indeed, serialize both the individual elements of the
ENUM row (!) and the aggregated total of those values and then adds an
index for itself to minimize its own seek-deliver activity ... dunno
exactly how, but that's my working theory ... all taking up 1-2 bytes
of total storage space per ENUM row (1 value).

2) Therefore, the ENUM-type column is better-suited for short,
relatively predictable types of values, like "one", "two", "three" and
"true", "false", and not-so-good for pages of text or image data, etc.

Here's why:

Jimmy Crunch gets it into his head that he should store EVERY value he
needs in ENUM-type columns because, after all, it only uses 1-2 bytes
for the whole row! (1 value) The problem is, serializing gets lossy ...
it can't quite reconstruct the long-ish strings right, so it fudges a
little. (Y'all know what I mean ...)

So Mr. Crunch decides to use ENUM columns for short, predictable string
values, and BLOB and TEXT and VARCHAR and, et al. for the other
datatypes, because those types of storage simply do the job better, for
his data storage needs.

I know it seems like I'm hooked on serial, but whaddya think? Something
like that?

Regards,

James

Aug 15 '06 #5

P: n/a
StupidScript wrote:
Bill:
>Somewhere else, the mapping from binary values
to the four strings is stored, but it is stored
only once."

But where is it stored only once? And why doesn't that data retention
add to the storage requirements for a set of variable string data, like
an ENUM row?

Somewhere the name is associated with the number, and the length of
that name's characters would seem to take up additional storage space
... somewhere.
They're stored as part of the table definition. Just as when you store
a column with type VARCHAR(200), the table definition stores the value
200 as the maximum length of that column. An ENUM column must store all
the possible values (strings).

In the case of MySQL, it goes in the .FRM file for a given table, which
is where it stores all information about the columns for the respective
table. Try it:

CREATE TABLE test.test_enum (
col1 ENUM("prestidigitation", "sleight of hand",
"disappearances","card tricks")
);
EXIT;

Now go look at "<datadir>/test/test_enum.FRM":

$ cd <datadir>/test
$ od -c test_enum.FRM
.. . .
0020520 004 \0 005 c o l 1 \0 004 005 020 020 \0 002 \0 \0
0020540 \b 201 020 \0 001 367 \b \0 \0 377 c o l 1 377 \0
0020560 377 p r e s t i d i g i t a t i o
0020600 n 377 s l e i g h t o f h a n
0020620 d 377 d i s a p p e a r a n c e s
0020640 377 c a r d t r i c k s 377 \0
.. . .

"od" is the UNIX command for "octal dump", or output a binary file with
unprintable characters represented by their base-eight value.

Notice how the values are separated with a byte with octal value 377.
This is apparently how MySQL knows where each enum value terminates.
Then at the end of all these values, there's an 0 byte value. I assume
that MySQL uses the ordering of these strings to determine the value to
store in rows of data.

I don't know the purpose of all those other bytes, but one can assume
they say other things about the column, like the fact that it is an
ENUM, the position of the column within the table, whether it is NOT
NULL or not, etc.
I'm delighted with the avenues these comments are making into my brain,
and thank you both very much. I'm just wondering where my understanding
of 'storage' should break with the laws of physics?
I think you are making this harder than it needs to be! :-)

A similar technique is used by GIF image files. You can have 256
distinct colors in the palette in a GIF file, but you can choose any
24-bit color for each of those palette entries. The palette is stored
near the beginning of the file, before the image data, that maps values
in the image data to the 24-bit RGB values. Therefore a GIF image
stores 1 byte per pixel, even though the color is a 3-byte RGB value.
There's a small amount of overhead (256 x 3 bytes) near the beginning of
the file. Any software that reads GIF images knows to use that palette
data as an index for the image data, and therefore knows how to display
the right color.

Regards,
Bill K.
Aug 15 '06 #6

P: n/a
StupidScript wrote:
Jimmy Crunch gets it into his head that he should store EVERY value
he needs in ENUM-type columns because, after all, it only uses 1-2
bytes for the whole row!
I wouldn't use ENUM for storing a large number of distinct strings.
You'd be likely to make the .FRM file very large, and that could have
its own scalability problems. The strings should never be lossy, but
they are probably managed with linear search, instead of hashing or
indexing.

Also, any time you add a new value would constitute a schema change. In
MySQL, this causes the table to be _restructured_ (making a copy of the
whole table and then dropping the original). Changing the values of an
ENUM shouldn't require this, but MySQL isn't very smart about optimizing
schema changes. There are many cases which shouldn't require a table
restructure, but MySQL does one anyway.

An ENUM should be used for a column that has a small number of distinct
values, which change very seldom (preferably never). For example:
yes/no, male/female, red/yellow/green (states of a traffic light),
january through december (month names), etc. I wouldn't use ENUM for an
arbitrary text column.

In fact, I don't think I've ever actually used an ENUM in a real
project. It's not worth the trouble.

Anything that ENUM would be used for, usually is better implemented as a
foreign key referencing a lookup table. Then it's very easy to add a
new value to the set, query the set of permitted values, add attributes
or descriptions to each value, etc.

Being overly meticulous with storage efficiency is not worth the effort.
Disk drives are incredibly cheap these days -- on the order of pennies
per gigabyte. You're better off spending your effort on designing a
simple physical data model, and implementing it in a way that makes it
easy to work on.

Regards,
Bill K.
Aug 15 '06 #7

P: n/a
Many, many thanks, Bill.

I believe I have a handle on how ENUM fields operate, and why they only
take 1-2 bytes in the table's DATA storage. The table definitions
storage requirement (which would be of a variable amount depending on
the ENUM's definition values) is insignificant.

I also appreciate the insight into using alternative field types for
efficiency, scalability and portability.

Great answers! Thanks.

James

Aug 15 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.