469,312 Members | 2,503 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,312 developers. It's quick & easy.

Calculate the length of text in bytes

Hi,

I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character. Other characters
may be longer, e.g. cyrillic would be 2 bytes per character.

Is there a safe and easy way to notice non-7bit-Ascii input?

Cheers,

M.

--
Interested in sophisticated fun? You, hubby, girl friends. The more the
merrier. Get in touch with Kirby, through WASTE only, Box 7391, L.A.
-=-=- -=-=-=-=- --=-=-
Martin Dietze -=-=- http://www.the-little-red-haired-girl.org
Mar 29 '06 #1
13 13758
Martin Herbert Dietze wrote:
Hi,

I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character. Other characters
may be longer, e.g. cyrillic would be 2 bytes per character.

Is there a safe and easy way to notice non-7bit-Ascii input?


There are two questions here. The first is the number of unicode
characters (code points). The second is whether there are any unicode
characters requiring more than 7 bits to represent.

function charCount(str) {
// counts the number of unicode characters, which may differ from
str.length
var esc = escape(str); // escape more robust than encodeURI
return esc.replace(/%uD[C-F]../g,'')
.replace(/%u..../g,'"')
.replace(/%../g,"'")
.length;
}

function SevenBitsOrLess(str) {
// returns false if str includes a unicode character requiring 8 or
more bits
var esc = escape(str);
if (esc.match(/%u/)) return false;
return (!esc.match(/%[89A-F]/));
}

// Example
var str0 = "ab$1_^%25$.%D8%B7 %EA%96%A4,%F3%87%BB%81";
str = decodeURI (str0);
alert (charCount(str));

Note that in the example,
str0.length => 40
str.length => 15
alert (charCount(str)) => 14

Csaba Gabor from Vienna

Mar 29 '06 #2

Martin Herbert Dietze wrote:
Hi,

I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character. Other characters
may be longer, e.g. cyrillic would be 2 bytes per character.

Is there a safe and easy way to notice non-7bit-Ascii input?

Cheers,

M.


Hi

I assume by "text input" you mean the value of an <INPUT type="text">

I also assume by "7 bit ASCII", you are not including extended asci
(160-255), which is still within one byte.

Note: when you extract the input "value" into a JavaScript String, the
String stores each character as a 16 bit value.

To spot non-ASCII characters:-

One option would be to loop through the String, using charCodeAt(), and
test its value.

An alternative is to use a RegExp thus:-

var a=String.fromCharCode(0x61);
var b=String.fromCharCode(0xFC);
var c=String.fromCharCode(0x2020);

var s=a+b+c+a+b+c;
alert(s);

var ASCII=/[\x00-\x7f]/gi;
var NON_ASCII=/[^\x00-\x7f]/gi;
alert("Non ASCII Count: "+s.replace(ASCII,"").length);
alert("ASCII Count: "+s.replace(NON_ASCII,"").length);

Regards

Julian Turner

Mar 29 '06 #3
Csaba Gabor wrote:
Martin Herbert Dietze wrote:
I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character. Other characters
may be longer, e.g. cyrillic would be 2 bytes per character.

Is there a safe and easy way to notice non-7bit-Ascii input?
There are two questions here. The first is the number of unicode
characters (code points).


A code point is not a Unicode character. I can represent one.
The second is whether there are any unicode characters requiring
more than 7 bits to represent.
More than 8. He considers 7-bit-ASCII as one byte per character.
[...]


Your algorithm does not have anything to do with the OP's request.
PointedEars
Mar 29 '06 #4
Martin Herbert Dietze wrote:
I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character.
OK
Other characters may be longer, e.g. cyrillic would be 2 bytes per
character.
You need to differentiate between the code point and the character encoding.
A Cyrillic character from the Universal Character Set/Unicode would require
two bytes if encoded with UTF-8 (two UTF-8 code units) or UTF-16 (one
UTF-16 code unit). It would require more bytes if encoded with UTF-32.
Is there a safe and easy way to notice non-7bit-Ascii input?


Yes, but that is only a start when talking about calculating the "physical
length".

<URL:http://unicode.org/faq/>
PointedEars
Mar 29 '06 #5
Julian Turner wrote:
var a=String.fromCharCode(0x61);
var b=String.fromCharCode(0xFC);
var c=String.fromCharCode(0x2020);
There are "\x61", "\xFC", and "\u2020" for that.
var s=a+b+c+a+b+c;
alert(s);

var ASCII=/[\x00-\x7f]/gi;
var NON_ASCII=/[^\x00-\x7f]/gi;
alert("Non ASCII Count: "+s.replace(ASCII,"").length);
alert("ASCII Count: "+s.replace(NON_ASCII,"").length);


String values are encoded with UTF-16, and Unicode is supported, since
JavaScript 1.3, and ECMAScript Edition 3. This code will only provide
the number of characters in the string anyway.
PointedEars
Mar 29 '06 #6
Csaba Gabor wrote:
Martin Herbert Dietze wrote:
I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character. Other characters
may be longer, e.g. cyrillic would be 2 bytes per character.

Is there a safe and easy way to notice non-7bit-Ascii input?
There are two questions here. The first is the number of unicode
characters (code points).


A code point is not a Unicode character. It can represent one.
The second is whether there are any unicode characters requiring
more than 7 bits to represent.
More than 8. He considers 7-bit-ASCII as one byte per character.
[...]


Your algorithm does not have anything to do with the OP's request.
PointedEars
Mar 29 '06 #7
Thomas 'PointedEars' Lahn <Po*********@web.de> wrote:
Is there a safe and easy way to notice non-7bit-Ascii input?


Yes, but that is only a start when talking about calculating the "physical
length".


Hmm, "physical length" was obviously not the best term to use.
My application needs to decide which encoding to use in order
to represent the text I get from an input field. If there are
any characters not representable as 7bit-Ascii I will have to
switch the whole buffer's encoding after having received the
data from the form. I need the JavaScript code to notify the
user of the space requirement of the text entered.

Cheers,

Martin

--
(9) For all resources, whatever it is, you need more.
---RFC 1925 "The Twelve Networking Truths"
-=-=- -=-=-=-=- --=-=-
Martin Dietze -=-=- http://www.the-little-red-haired-girl.org
Mar 29 '06 #8
Martin Herbert Dietze wrote:
Thomas 'PointedEars' Lahn <Po*********@web.de> wrote:
> Is there a safe and easy way to notice non-7bit-Ascii input? Yes, but that is only a start when talking about calculating
the "physical length".


Hmm, "physical length" was obviously not the best term to use.
My application needs to decide which encoding to use in order
to represent the text I get from an input field. If there are
any characters not representable as 7bit-Ascii I will have to
switch the whole buffer's encoding after having received the

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ data from the form. I need the JavaScript code to notify the
user of the space requirement of the text entered.


You do not have to worry about that. Use UTF-8 always, and use a VARCHAR
or equivalent data type. UTF-8 will encode characters from ASCII (Unicode
code points U+0000 to U+007F) and other characters from ISO-8859-1 (Unicode
code points U+0080 to U+00FF) with 8 bit (one UTF-8 code unit), and other
Unicode characters in the range supported by HTML 4.01 and XML 1.0 with up
to 4 bytes. So you can calculate the maximum number of bytes that are
needed, provided that you know how long (in characters) the string value
can be. If you do not know that, use a BLOB or equivalent data type
instead, where UTF-8 would still be the preferred encoding.

<URL:http://en.wikipedia.org/wiki/Unicode>
PointedEars
Mar 29 '06 #9

Thomas 'PointedEars' Lahn wrote:

[snip]
String values are encoded with UTF-16, and Unicode is supported, since
JavaScript 1.3, and ECMAScript Edition 3. This code will only provide
the number of characters in the string anyway.


Thanks. I think I missed the point as to what the "OP" was looking for
with "physical length".

Regards

Julian Turner

Mar 29 '06 #10
Martin Herbert Dietze wrote:
Hi,

I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character.
Hi there,

I come with an off-topic rant :)

You consideration is good, for ASCII *is* a 7 bit
encoding (that is *always* represented using 8 bits).
Extended ASCII is not ASCII (even if many people
mistakingly think they're the same they're not, and this
is the cause of many problems). An ASCII text file,
for example, will have exactly the same length, in bytes,
as the number of (ASCII) characters and will always
have every 8th bit set to zero (if this is not the case, it
simply is *not* an ASCII text file).

Wikipedia has a very fine explanation on the subject, and
they explain very clearly that the 8th bit is set to zero (in very
old days, on very odd architecture that don't exist anymore,
the 8th bit sometimes was set to one).

Is there a safe and easy way to notice non-7bit-Ascii input?


"non-7bit-Ascii" is redundant. You're looking for "non-ASCII" inputs.

I know I'm repeating myself, but it bears repeating ;)

Based on the fact that ASCII is a 7 bit encoding, that Unicode
is a superset of ASCII and that ASCII characters (and only
ASCII characters) are encoded on a single byte using the UTF-8
encoding, there's a nice trick that works in Java (not Javascript)
to check if a String contains non-ASCII characters:

"some string".length == "some string".getBytes("UTF-8");

will *always* be true when the string contains only ASCII chars
and will *always* be false when the string contains non-ASCII
chars (if any Java programmer here disagree, then I challenge
him to construct me a string that proves this wrong... For this
is simply impossible).

ASCII is a 7 bit encoding

ISO-Latin-1 is a 8 bit encoding that is a superset of ASCII (with
the ASCII code points having the same values) and that is in very
widespread use

Extended-ASCII is an 8 bit monstrosity that barely exists anymore
on the Web and is definitely NOT ASCII

Unicode is a superset of ASCII (with the ASCII code points having
the same values).

I leave the c.l.j. gurus explain how to check for non-ASCII
characters...

Mar 29 '06 #11
Thomas 'PointedEars' Lahn wrote:
Martin Herbert Dietze wrote:
Thomas 'PointedEars' Lahn <Po*********@web.de> wrote:
> Is there a safe and easy way to notice non-7bit-Ascii input?
Yes, but that is only a start when talking about calculating
the "physical length".


Hmm, "physical length" was obviously not the best term to use.
My application needs to decide which encoding to use in order
to represent the text I get from an input field. If there are
any characters not representable as 7bit-Ascii I will have to
switch the whole buffer's encoding after having received the

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
data from the form. I need the JavaScript code to notify the
user of the space requirement of the text entered.


You do not have to worry about that. Use UTF-8 always, and use a VARCHAR
or equivalent data type. UTF-8 will encode characters from ASCII (Unicode
code points U+0000 to U+007F) and other characters from ISO-8859-1 (Unicode
code points U+0080 to U+00FF) with 8 bit (one UTF-8 code unit)


Utter nonsense. (I learned that one here ;)

you're a little bit confused as to how UTF-8 works. For your
explanation as to how UTF-8 works is wrong.

UTF-8 guarantees that ASCII and *only* ASCII characters will always
be encoded using one byte, whose 8th bit is set to zero.

UTF-8 also guarantees that every ISO-8859-1 character that is not
an ASCII character will be encoded using two bytes. It also
guarantees that the 8th bit of these two bytes will *always* be set.

In fact UTF-8 guarantees more than that: every single byte found
in an UTF-8 encoding that doesn't have the 8th bit set is guaranteed
to be an ASCII char.

Links for clearing up your misconceptions and stopping regurgitating
utter nonsense on that subject ;)

http://en.wikipedia.org/wiki/Utf-8

Or at the source, the RFC defining UTF-8:

http://www.ietf.org/rfc/rfc3629.txt

Mar 29 '06 #12
ne******@yahoo.fr wrote:
Thomas 'PointedEars' Lahn wrote:
Martin Herbert Dietze wrote:
> Thomas 'PointedEars' Lahn <Po*********@web.de> wrote:
>> > Is there a safe and easy way to notice non-7bit-Ascii input?
>> Yes, but that is only a start when talking about calculating
>> the "physical length".
>
> Hmm, "physical length" was obviously not the best term to use.
> My application needs to decide which encoding to use in order
> to represent the text I get from an input field. If there are
> any characters not representable as 7bit-Ascii I will have to
> switch the whole buffer's encoding after having received the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> data from the form. I need the JavaScript code to notify the
> user of the space requirement of the text entered.


You do not have to worry about that. Use UTF-8 always, and use a VARCHAR
or equivalent data type. UTF-8 will encode characters from ASCII
(Unicode code points U+0000 to U+007F) and other characters from
ISO-8859-1 (Unicode code points U+0080 to U+00FF) with 8 bit (one UTF-8
code unit)


Utter nonsense. (I learned that one here ;)


No.
you're a little bit confused as to how UTF-8 works.
I have confused something here.
For your explanation as to how UTF-8 works is wrong.

UTF-8 guarantees that ASCII and *only* ASCII characters will always
be encoded using one byte, whose 8th bit is set to zero.

UTF-8 also guarantees that every ISO-8859-1 character that is not
an ASCII character will be encoded using two bytes. It also
guarantees that the 8th bit of these two bytes will *always* be set.

In fact UTF-8 guarantees more than that: every single byte found
in an UTF-8 encoding that doesn't have the 8th bit set is guaranteed
to be an ASCII char.


True.
PointedEars
Mar 29 '06 #13
ne******@yahoo.fr wrote:
Martin Herbert Dietze wrote:
I need to calculate the physical length of text in a text
input. The term "physical" means in this context, that I
consider 7bit-Ascii as one-byte-per character.
[...]
You consideration is good,


Debatable.
for ASCII *is* a 7 bit encoding
True.
(that is *always* represented using 8 bits).
Wrong, because
Extended ASCII is not ASCII


"At the time ASCII was introduced, many computers dealt with eight-bit
groups (bytes or, more specifically, octets) as the smallest unit of
information; the eighth bit was commonly used as a parity bit for error
checking on communication lines or other device-specific functions."
(Wikipedia)
PointedEars
Mar 29 '06 #14

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by David Garamond | last post: by
12 posts views Thread by paii, Ron | last post: by
reply views Thread by Hannibal111111 | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
1 post views Thread by Geralt96 | last post: by
reply views Thread by harlem98 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.