By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,645 Members | 1,048 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,645 IT Pros & Developers. It's quick & easy.

regex/replace white list

P: n/a
Hi,

What is the best way to white list a set of allowable characters using
regex or replace? I understand it is safer to whitelist than to
blacklist, but am not sure how to go about it.

Many thanks!

Feb 17 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
jg*****@gmail.com wrote:
Hi,

What is the best way to white list a set of allowable characters using
regex or replace? I understand it is safer to whitelist than to
blacklist, but am not sure how to go about it.


Whether to use a white list (i.e. list of allowed characters) or a black
list (list of not allowed characters) is probably best decided by which
one gives the smaller list. I'm not sure 'safety' is an issue.

As far as a regular expression is concerned, the difference between the
two is whether to use the NOT (!) operator or not (or use an else
statement).

To build the white/black list, use a string of characters and the
RegExp() function as a constructor, e.g. if you want to disallow the
letter 'a' in a string, then:

var re = new RegExp('a');

will create a regular expression that can be used to match the letter
'a' anywhere, e.g.:

if ( re.test(someString) )
{
// someString contains the letter 'a'
} else {
// someString doesn't contain the letter 'a'
}

or:

if ( ! re.test(someString) )
{
// someString doesn't contain the letter 'a'
}

To make the regular expression case-insensitive, add the 'i' flag:

var re = new RegExp('a','i');
To match any word character or the '$' character:

var re = new RegExp('[\\w$]');
To match any non-word character (not part of: a-z, A-Z, 0-9):

var re = new RegExp('\\W');
You can build the expression and flags as string variables and use those:

var reString = '\\W'; // Expression string
var flString = 'g'; // Flag string
var re = new RegExp(reString, flString);
and so on... Search the archives for lots of examples.

--
Rob
Feb 17 '06 #2

P: n/a
RobG wrote:
To build the white/black list, use a string of characters and the
RegExp() function as a constructor, e.g. if you want to disallow the
letter 'a' in a string, then:

var re = new RegExp('a');

will create a regular expression that can be used to match the letter
'a' anywhere, [...]


While there is not much point in using the RegExp() constructor instead
of a Regular Expression literal when the expression is invariant. As was
discussed here recently, efficiency and compatibility are seldom an issue:

As for efficiency, the RegExp object created by a RegExp literal is created
before execution, and the literal is then merely a reference to that
object. The RegExp object is not recreated by repeated use of the same
literal (say, in a loop). (Which must be considered regarding efficiency,
though, since this will create a new RegExp object always if the expression
differs, unconditionally. Even if the object is used only when a certain
condition applies.)

As for compatibility, even though RegExp literals have not been specified
before ECMAScript Edition 3 (issued 1999, seven years ago already, though),
they are supported since JavaScript 1.2 (Netscape 4.0, June 1997) except
of the `m' modifier. They are supported including the `m' modifier since
JavaScript 1.5 (Mozilla/5.0 rv:0.6, November 2000) and JScript 3.0
(Internet Explorer 4.0, and Internet Information Server 4.0, October 1997).
(The problems that remain compared to ECMAScript Edition 3 are non-capturing
parantheses and non-greedy expressions that are not universally supported,
but you have to deal with those problems with the RegExp() constructor as
well.)

However, using the RegExp constructor removes and introduces a maintenance
problem. It removes the problem that Regular Expressions cannot span lines
because string concatenation serves the purpose. It introduces the problem
that one has to escape the expression twice: one time to avoid escape
sequences in the string literal, and again to have RegExp special
characters parsed as expression atoms instead. (This is often very
confusing to people who are fairly new to the language.)

var re = /a/;

and the like certainly suffices here.

As I final note, I want to add that if special features of Regular
Expressions compared to strings are not used, it is probably more
efficient not to use Regular Expressions at all. Instead of writing

if (re.test(someString))

using the RegExp() constructor or the above RegExp object initializer,
it is probably more efficient to write

if (someString.indexOf("a") > -1)

instead.
PointedEars
Feb 17 '06 #3

P: n/a
Thomas 'PointedEars' Lahn wrote:
RobG wrote:

To build the white/black list, use a string of characters and the
RegExp() function as a constructor, e.g. if you want to disallow the
letter 'a' in a string, then:

var re = new RegExp('a');

will create a regular expression that can be used to match the letter
'a' anywhere, [...]

While there is not much point in using the RegExp() constructor instead
of a Regular Expression literal when the expression is invariant.


My understanding of the request is that the string *is* variant. The OP
wishes to build a list of characters to allow/disallow, I presumed it
would not be hard-coded - though it might be built that way at the
server where the value is extracted from a database and the appropriate
value hard-coded into the script.

But I supposed that the value would written to some variable, which is
then accessed by the script, e.g.

var blackList = '$%#';

and then later:

var re = new RegExp('[' + blacklist + ']');

of a Regular Expression literal when the expression is invariant. As was
discussed here recently, efficiency and compatibility are seldom an issue:

As for efficiency, the RegExp object created by a RegExp literal is created
before execution, and the literal is then merely a reference to that
object. The RegExp object is not recreated by repeated use of the same
literal (say, in a loop). (Which must be considered regarding efficiency,
though, since this will create a new RegExp object always if the expression
differs, unconditionally. Even if the object is used only when a certain
condition applies.)
Quite true, I was addressing efficiency from the point of view of the
length of the expression. e.g. to allow only letters and digits, \w
will do the trick. To disallow only '@#$' then - [@#$] - is much
shorter than a list of everything else.

The difference in efficiency between using RegExp as a constructor and
using a literal in the above scenario is likely irrelevant (though I
understand your point and in general much prefer to use literals).

[...] However, using the RegExp constructor removes and introduces a maintenance
problem. It removes the problem that Regular Expressions cannot span lines
because string concatenation serves the purpose. It introduces the problem
that one has to escape the expression twice: one time to avoid escape
sequences in the string literal, and again to have RegExp special
characters parsed as expression atoms instead.
Escaping characters is always an issue, especially if multi-line input
is accepted. Should new lines & line feeds be allowed? The solution is
for the OP to learn about matching characters and apply that to their
particular circumstance.
[...]
var re = /a/;

and the like certainly suffices here.
Probably a result of my trivial example - a better example is below.

As I final note, I want to add that if special features of Regular
Expressions compared to strings are not used, it is probably more
efficient not to use Regular Expressions at all. Instead of writing

if (re.test(someString))

using the RegExp() constructor or the above RegExp object initializer,
it is probably more efficient to write

if (someString.indexOf("a") > -1)


If the need was a test for a specific character, then that would be
fine. Maybe you could use it with a loop to go through each character
in the black list, but how many characters/loops would it take before a
regular expression was faster?

The following example may be better:

<script type="text/javascript">

function checkList(blID, strID)
{
var blackList = document.getElementById(blID).value;
var inString = document.getElementById(strID).value;
var re = new RegExp('[' + blackList + ']');
document.getElementById('xx').innerHTML = re.test(inString);
}
</script>
<label for="blackList">Blacklist characters:<input
type="text" id="blackList" value="\^\]$#@"></label><br>

<label for="inputText">String to check:<input
type="text" id="inputText" value="Cost: $6"></label>

<input type="button" value="Check input with blacklist"
onclick="checkList('blackList','inputText');">

<div>Result: <span id="xx" style="font-weight: bold;">
<i>no check done yet...</i></span></div>
If new lines, line feeds, etc. need to be tested too, use a textarea
instead of a text input for the input string. Variations on how
browsers represent new lines may need to be accommodated too.

--
Rob
Feb 20 '06 #4

P: n/a
RobG wrote:
Thomas 'PointedEars' Lahn wrote:
However, using the RegExp constructor removes and introduces a
maintenance problem. It removes the problem that Regular Expressions
cannot span lines because string concatenation serves the purpose. It
introduces the problem that one has to escape the expression twice: one
time to avoid escape sequences in the string literal, and again to have
RegExp special characters parsed as expression atoms instead.
Escaping characters is always an issue, especially if multi-line input
is accepted. Should new lines & line feeds be allowed?


You misunderstood. This was not about matching newline in the input.
The solution is for the OP to learn about matching characters and apply
that to their particular circumstance.
My point was that

var rx = /very_long_Regular_Expression.a.b.c.d.e.f.g.h.i.j.k .l.m.n.o.p.
r.s.t.u.v.w.x.y.z.\..#.#.4.2.1.3.3.7./

is not possible (consider the above a _hard_ line break to avoid crossing
the 80-columns border), but

var rx = new RegExp(
"very_long_Regular_Expression.a.b.c.d.e.f.g.h.i.j. k.l.m.n.o.p."
+ "r.s.t.u.v.w.x.y.z.\\..#.#.4.2.1.3.3.7.");

(and the like) is. The latter introduces the maintenance problem that the
literal "." must be escaped twice, but it removes the maintenance problem
that literals are not allowed to span lines (in the source code).
As I final note, I want to add that if special features of Regular
Expressions compared to strings are not used, it is probably more
efficient not to use Regular Expressions at all. Instead of writing

if (re.test(someString))

using the RegExp() constructor or the above RegExp object initializer,
it is probably more efficient to write

if (someString.indexOf("a") > -1)


If the need was a test for a specific character, then that would be
fine. Maybe you could use it with a loop to go through each character
in the black list, but how many characters/loops would it take before a
regular expression was faster?


I do not know. This was a general note.
The following example may be better:
Maybe not :)
<script type="text/javascript">

function checkList(blID, strID)
{
var blackList = document.getElementById(blID).value;
var inString = document.getElementById(strID).value;
A `form' element would have avoided the inefficient and not downwards
compatible referencing.

function checkList(f, blId, strID)
{
var es;
if (blID && strID
&& f && (es = f.elements)
&& es[blID] && es[strID])
{
var blackList = es[blID].value;
var inString = es[strID].value;

// ...
}
else
{
window.alert("foobar!");
}

return false;
}

<form action="..."
onsubmit="checkList(this, 'blackList', 'inputText');">
...
<input type="submit" value="Check input with blacklist">
</form>
var re = new RegExp('[' + blackList + ']');
What about the escaping part? You do not want the user to handle that,
do you?
document.getElementById('xx').innerHTML = re.test(inString);
Mixing standards compliant and proprietary DOM features unnecessarily.

es["xx"].style.fontStyle = "normal"; // I prefer setStyleProperty()[1]
es["xx"].value = re.test(inString);

<form ...>
...
<div>Result: <input id="xx"
value="no check done yet..."
style="border:0; font-weight:bold; font-style:italic"></div>
</form>
[...]

PointedEars
___________
[1] <URL:http://pointedears.de/scripts/dhtml.js>
Feb 20 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.