By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,467 Members | 1,308 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,467 IT Pros & Developers. It's quick & easy.

what char-set to use in my case ?

P: n/a
Somewhere there is a column called blogs - type TEXT with a FULLTEXT
index.

Most entries would be in English, but few might be in any other
language.

Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
each character since most entries, say 99% are in English.

What should be my approach here - to somehow enable multiple languages
in the column but also to save space..

Any work-around?

Mike

Jul 23 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
siliconmike (si*********@yahoo.com) wrote:
: Somewhere there is a column called blogs - type TEXT with a FULLTEXT
: index.

: Most entries would be in English, but few might be in any other
: language.

: Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
: each character since most entries, say 99% are in English.

But utf-8 doesn't require 3 bytes per character, it requires a varying
number of bytes depending on the characters in the data.

I don't see anything in the mysql docs to suggest that the TEXT datatype
allocates three bytes per character for utf-8 data. I would assume that
the column allocation routines of mysql will vary the bytes allocated in
pretty much the same manner as when using single byte character sets,
except that some characters will take more than one byte.

But I haven't had to worry about this in mysql, so perhaps the mysql utf-8
handling doesn't work that way.
--

This space not for rent.
Jul 23 '05 #2

P: n/a


Malcolm Dew-Jones wrote:
siliconmike (si*********@yahoo.com) wrote:
: Somewhere there is a column called blogs - type TEXT with a FULLTEXT
: index.

: Most entries would be in English, but few might be in any other
: language.

: Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
: each character since most entries, say 99% are in English.

But utf-8 doesn't require 3 bytes per character, it requires a varying
number of bytes depending on the characters in the data.

I don't see anything in the mysql docs to suggest that the TEXT datatype
allocates three bytes per character for utf-8 data. I would assume that
the column allocation routines of mysql will vary the bytes allocated in
pretty much the same manner as when using single byte character sets,
except that some characters will take more than one byte.

But I haven't had to worry about this in mysql, so perhaps the mysql utf-8
handling doesn't work that way.


See this:
http://dev.mysql.com/doc/mysql/en/data-size.html

At a point it says that UTF-8 uses 3 bytes fixed allocation. (But I
guess it is only for ROW_FORMAT=REDUNDANT setting)

It would be nice to check though.

Any ideas how to determine a record size in bytes ?

Mike

Jul 23 '05 #3

P: n/a
siliconmike (si*********@yahoo.com) wrote:
: Malcolm Dew-Jones wrote:
: > siliconmike (si*********@yahoo.com) wrote:
: > : Somewhere there is a column called blogs - type TEXT with a FULLTEXT
: > : index.
: >
: > : Most entries would be in English, but few might be in any other
: > : language.
: >
: > : Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
: > : each character since most entries, say 99% are in English.
: >
: > But utf-8 doesn't require 3 bytes per character, it requires a varying
: > number of bytes depending on the characters in the data.
: >
: > I don't see anything in the mysql docs to suggest that the TEXT datatype
: > allocates three bytes per character for utf-8 data. I would assume that
: > the column allocation routines of mysql will vary the bytes allocated in
: > pretty much the same manner as when using single byte character sets,
: > except that some characters will take more than one byte.
: >
: > But I haven't had to worry about this in mysql, so perhaps the mysql utf-8
: > handling doesn't work that way.
: >

: See this:
: http://dev.mysql.com/doc/mysql/en/data-size.html

: At a point it says that UTF-8 uses 3 bytes fixed allocation. (But I
: guess it is only for ROW_FORMAT=REDUNDANT setting)

The page you reference discusses using fixed length fields.

At one point it says

Since many languages can be written mostly with single-byte UTF-8
characters, a fixed storage length often wastes space. The
ROW_FORMAT=COMPACT format allocates a variable amount

so the point of that section is how to use ROW_FORMAT=COMPACT to reduce
wasted space when using fixed length fields with utf-8 data.

As I mentioned, that page discusses fixed length fields, (e.g. CHAR), as
opposed to variable length fields (e.g. VARCHAR). It seems to me that
this still implies that a variable length field will use a variable number
of bytes, and the number of bytes for utf-8 data will still be the minimum
required, give or take the usual over head issues that varying data will
have.

: It would be nice to check though.
: Any ideas how to determine a record size in bytes ?

No, I don't know how to check individual fields in records to confirm how
they are stored.
--

This space not for rent.
Jul 23 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.