423,350 Members | 2,519 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 423,350 IT Pros & Developers. It's quick & easy.

Validating utf-8 character strings in javascript regular expression

P: n/a
los
Hi,

I've created a web application using struts. I am trying to solve an
issue where in one of the forms where I want to allow the values
inserted to be special characters from other languages, but not symbols
such as (, <, +, }, etc... Creating the regular expression that
handles these values is becoming quite hard to find. Right now I have
^([a-zA-Z0-9_\x81-\xFF])*$ and this works for some utf-8 characters
such as , , , etc... But doesn't work for other characters such
as , , , etc...

I was wondering if someone has come across this issue and has found a
solution for the problem.

Thanks,

-Los

Sep 20 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
ASM
los wrote:
Hi,

I've created a web application using struts. I am trying to solve an
issue where in one of the forms where I want to allow the values
inserted to be special characters from other languages, but not symbols
such as (, <, +, }, etc... Creating the regular expression that
handles these values is becoming quite hard to find. Right now I have
^([a-zA-Z0-9_\x81-\xFF])*$ and this works for some utf-8 characters
such as , , , etc... But doesn't work for other characters such
as , , , etc...

I was wondering if someone has come across this issue and has found a
solution for the problem.


if you are in encodage IS0-8859-1 :

? ^([a-zA-Z0-9_\xA0-\xFF])*$
? ^([a-zA-Z0-9_\x A0-\x FF])*$
? ^([a-zA-Z0-9_/A0/-/FF/])*$
? ^([\x61-\x7A\x41-\x5A\x30-\x39\x5F\xA0-\xFF])*$

? ^([/61/-/7A/41/-/5A/30/-/39//5F/A0-/FF/])*$

--
Stephane Moriaux et son [moins] vieux Mac
Sep 20 '05 #2

P: n/a
los
What if we don't want to restrict to just ISO-8859-1 characters? What
if we want to be all of the UTF-8 characters?

I tried doing something like ^([a-zA-Z0-9_\x0080-\xFFFF])*$ and it
didn't work.

-Los

Sep 21 '05 #3

P: n/a
ASM
los wrote:
What if we don't want to restrict to just ISO-8859-1 characters? What
if we want to be all of the UTF-8 characters?
because \x?? is not utf-8
it is hexa
and because the hexa code is not same in each charset

example :
space = \xA0 (hexa) = 00A0 (unicode) = C2 A0 (utf-8)
space = hexa : A0 with chartsets : ISO-8859-1 & CP1252
space = hexa : FF with chartsets : CP850 & CP437

http://www.miakinen.net/vrac/c10/charsets
I tried doing something like ^([a-zA-Z0-9_\x0080-\xFFFF])*$ and it
didn't work.


? ^([a-zA-Z0-9_/0080/-/FFFF/])

think it could be :

0081 to 00FF unicode
or
C2A0 to C3BF utf-8
from :
http://www.macchiato.com/unicode/chart/
or :
other url above

--
Stephane Moriaux et son [moins] vieux Mac
Sep 21 '05 #4

P: n/a
los
Thanks for the reply!

I tried your approach but for some reason the javascript parser doesn't
recognize the utf-8 characters still.

Could someone please verify that the correct regex should be
^([a-zA-Z0-9_\u00A1-\uFFFF])*$ ?

If I use the above regex in my xml, in the javascript that gets
generated on the web page I get the following rule;

this.mask=/^([a-zA-Z0-9_\\u00A1-\\uFFFF])*$/;

I apologize if this is a frugal question but I'm new at this and am
learning this as I go along.

Thanks,

-Los

Sep 21 '05 #5

P: n/a
los wrote:
I tried your approach but for some reason the javascript parser doesn't
recognize the utf-8 characters still.

Could someone please verify that the correct regex should be
^([a-zA-Z0-9_\u00A1-\uFFFF])*$ ?
It should not. Firstly, Unicode escapes needs to be supported which is not
the case with every script engine. Test it like

/\u00A1/.toString().length < 4 ? supported : unsupported

Secondly, using the Asterisk (`*') quantifier includes that it also matches
for the empty string; you should use the Plus (`+') quantifier instead.

Thirdly, you have to specify what Unicode glyphs you consider to be
"symbols". For example, including Unicode glyphs 0x00A1 to 0xFFFF as above
would also include glyph range 0x2100 to 0x214F (Letterlike Symbols).
See <http://unicode.org/> and <http://pointedears.de/scripts/test/charset>
for details.
If I use the above regex in my xml, in the javascript that gets
generated on the web page I get the following rule;

this.mask=/^([a-zA-Z0-9_\\u00A1-\\uFFFF])*$/;


The fact aside that this would include the empty string as well, that
would be quite obviously a RegExp completely different to the one above.
Escaping the backslash would include it as literal character into the
character class including all following elements of the previous escape
sequence (here: u, 0, A, 1, F).

What you possibly could want is

this.mask = new RegExp("^([a-zA-Z0-9_\\u00A1-\\uFFFF])+$");

where the escaped backslashes would collapse to single ones before
passed to the RegExp constructor and so apply to the first RegExp
literal (apart from the quantifier). But you should rather configure
your server-side code generator not to escape escape sequences.
PointedEars
Oct 16 '05 #6

P: n/a
JRS: In article <11****************@PointedEars.de>, dated Sun, 16 Oct
2005 18:41:31, seen in news:comp.lang.javascript, Thomas 'PointedEars'
Lahn <Po*********@web.de> posted :
los wrote:


ON 21 SEPTEMBER
AISB, your attribution line does not comply with the minimum current
Usenet thinking - this is not news:de,* here, as you should know.
I tried your approach but for some reason the javascript parser doesn't
recognize the utf-8 characters still.

Could someone please verify that the correct regex should be
^([a-zA-Z0-9_\u00A1-\uFFFF])*$ ?


It should not. Firstly, Unicode escapes needs to be supported which is not
the case with every script engine. Test it like

One had hoped that the turd who thinks it useful to disinter aged
threads had himself passed on to another place.

--
John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)
Oct 16 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.