471,092 Members | 1,550 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,092 software developers and data experts.

what char-set to use in my case ?

Somewhere there is a column called blogs - type TEXT with a FULLTEXT
index.

Most entries would be in English, but few might be in any other
language.

Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
each character since most entries, say 99% are in English.

What should be my approach here - to somehow enable multiple languages
in the column but also to save space..

Any work-around?

Mike

Jul 23 '05 #1
3 1547
siliconmike (si*********@yahoo.com) wrote:
: Somewhere there is a column called blogs - type TEXT with a FULLTEXT
: index.

: Most entries would be in English, but few might be in any other
: language.

: Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
: each character since most entries, say 99% are in English.

But utf-8 doesn't require 3 bytes per character, it requires a varying
number of bytes depending on the characters in the data.

I don't see anything in the mysql docs to suggest that the TEXT datatype
allocates three bytes per character for utf-8 data. I would assume that
the column allocation routines of mysql will vary the bytes allocated in
pretty much the same manner as when using single byte character sets,
except that some characters will take more than one byte.

But I haven't had to worry about this in mysql, so perhaps the mysql utf-8
handling doesn't work that way.
--

This space not for rent.
Jul 23 '05 #2


Malcolm Dew-Jones wrote:
siliconmike (si*********@yahoo.com) wrote:
: Somewhere there is a column called blogs - type TEXT with a FULLTEXT
: index.

: Most entries would be in English, but few might be in any other
: language.

: Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
: each character since most entries, say 99% are in English.

But utf-8 doesn't require 3 bytes per character, it requires a varying
number of bytes depending on the characters in the data.

I don't see anything in the mysql docs to suggest that the TEXT datatype
allocates three bytes per character for utf-8 data. I would assume that
the column allocation routines of mysql will vary the bytes allocated in
pretty much the same manner as when using single byte character sets,
except that some characters will take more than one byte.

But I haven't had to worry about this in mysql, so perhaps the mysql utf-8
handling doesn't work that way.


See this:
http://dev.mysql.com/doc/mysql/en/data-size.html

At a point it says that UTF-8 uses 3 bytes fixed allocation. (But I
guess it is only for ROW_FORMAT=REDUNDANT setting)

It would be nice to check though.

Any ideas how to determine a record size in bytes ?

Mike

Jul 23 '05 #3
siliconmike (si*********@yahoo.com) wrote:
: Malcolm Dew-Jones wrote:
: > siliconmike (si*********@yahoo.com) wrote:
: > : Somewhere there is a column called blogs - type TEXT with a FULLTEXT
: > : index.
: >
: > : Most entries would be in English, but few might be in any other
: > : language.
: >
: > : Now, if I make it UTF8, it will be a waste of space to use 3 bytes for
: > : each character since most entries, say 99% are in English.
: >
: > But utf-8 doesn't require 3 bytes per character, it requires a varying
: > number of bytes depending on the characters in the data.
: >
: > I don't see anything in the mysql docs to suggest that the TEXT datatype
: > allocates three bytes per character for utf-8 data. I would assume that
: > the column allocation routines of mysql will vary the bytes allocated in
: > pretty much the same manner as when using single byte character sets,
: > except that some characters will take more than one byte.
: >
: > But I haven't had to worry about this in mysql, so perhaps the mysql utf-8
: > handling doesn't work that way.
: >

: See this:
: http://dev.mysql.com/doc/mysql/en/data-size.html

: At a point it says that UTF-8 uses 3 bytes fixed allocation. (But I
: guess it is only for ROW_FORMAT=REDUNDANT setting)

The page you reference discusses using fixed length fields.

At one point it says

Since many languages can be written mostly with single-byte UTF-8
characters, a fixed storage length often wastes space. The
ROW_FORMAT=COMPACT format allocates a variable amount

so the point of that section is how to use ROW_FORMAT=COMPACT to reduce
wasted space when using fixed length fields with utf-8 data.

As I mentioned, that page discusses fixed length fields, (e.g. CHAR), as
opposed to variable length fields (e.g. VARCHAR). It seems to me that
this still implies that a variable length field will use a variable number
of bytes, and the number of bytes for utf-8 data will still be the minimum
required, give or take the usual over head issues that varying data will
have.

: It would be nice to check though.
: Any ideas how to determine a record size in bytes ?

No, I don't know how to check individual fields in records to confirm how
they are stored.
--

This space not for rent.
Jul 23 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

18 posts views Thread by Matt | last post: by
21 posts views Thread by Niu Xiao | last post: by
5 posts views Thread by Omats.Z | last post: by
5 posts views Thread by pkirk25 | last post: by
13 posts views Thread by Protoman | last post: by
12 posts views Thread by Robert.Holic | last post: by
4 posts views Thread by Vince C. | last post: by
4 posts views Thread by Protoman | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.