By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,118 Members | 1,133 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,118 IT Pros & Developers. It's quick & easy.

allowed characters in a string (stripping it)

P: n/a
I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_LOW="*èìòùáé*óúâêîôû ëïöüÿãõñçæøåăāĕē*īŏō*ūəß" ;
$ACCENTED_ALL_BIG="ÀÈÌÒÙÁÉÍÓÚÂÊÎÔÛ ËÏÖÜŸÃÕÑÇÆØÅĂĀĔĒĬĪŎŌŬŪƏß" ;
$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG;
$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
$SYMBOLS_NAME=".'- ";

first time I tried using something like this:
$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

now I tried instead using str_split:
function clear_name_complex ($name, $ok_chars) {
$ass=str_split($ok_chars);
$al=array();
foreach ($ass as $a) {
$al[$a]=TRUE;
}

$s=str_split($name);
$ret="";
foreach ($s as $c) {
if (!$al[$c]) continue;
$ret.=$c;
}

return $ret;
}

still nothing.
unicode, and it goes mad and output makes no sense.
I belive that's because in both cases it treats unicode characters
splitting into single bytes, but still, I'm clueless about what am I
supposed to do.
Dec 11 '07 #1
Share this Question
Share on Google+
11 Replies


P: n/a

"Lo'oris" <lo****@gmail.comwrote in message
news:f0**********************************@r60g2000 hsc.googlegroups.com...
I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_LOW="aae eiioouu?";
$ACCENTED_ALL_BIG="YAAE EIIOOUU?";
$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG;
$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
$SYMBOLS_NAME=".'- ";

first time I tried using something like this:
$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

=================

preg_replace is your best bet.

$notAllowed = "/[^a-zaaeeiioouu?]/si";
$newString = preg_replace($notAllowed, '', $stringToSearch);

or something to that effect...it tested fine in 'the regulator' - a regex
tester.
Dec 11 '07 #2

P: n/a
"Lo'oris" <lo****@gmail.comwrote:
>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.
How are you entering "strings containing unicode"? Browsers don't send
Unicode.
>I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_LOW="?? ?????????";
$ACCENTED_ALL_BIG="ܟ?? ?????????";
$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG ;
$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
$SYMBOLS_NAME=".'- ";

first time I tried using something like this:
$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
(bear in mind I do *NOT* know regexp, a friend wrote this line)
This matches one of 4 things:
[a-zA-Z] an upper or lower case letter,
| or
- a hypen
| or
[$al] whatever is contained in $al
| or
. any character,

and replaces it with whatever was matched. Clearly, this is a very
expensive no-op. You do NOT want the '.' in there.

I would suggest this:

preg_match_all("/[$al]+/", $name, $out);
$result = implode('', $out[0]);
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Dec 11 '07 #3

P: n/a

"Tim Roberts" <ti**@probo.comwrote in message
news:gt********************************@4ax.com...
"Lo'oris" <lo****@gmail.comwrote:
>>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

How are you entering "strings containing unicode"? Browsers don't send
Unicode.
>>I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_LOW="? ??????????";
$ACCENTED_ALL_BIG="Y? ??????????";
$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BI G;
$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
$SYMBOLS_NAME=".'- ";

first time I tried using something like this:
$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

This matches one of 4 things:
[a-zA-Z] an upper or lower case letter,
| or
- a hypen
| or
[$al] whatever is contained in $al
| or
. any character,

and replaces it with whatever was matched. Clearly, this is a very
expensive no-op. You do NOT want the '.' in there.

I would suggest this:

preg_match_all("/[$al]+/", $name, $out);
$result = implode('', $out[0]);
it's not expensive at all. and a dot is any single character...not a greedy
wild card. the only reason he wouldn't want a dot is because it could be an
'illegal' character that he's trying to get rid of anyway. as it is, he just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).
Dec 11 '07 #4

P: n/a
Tim Roberts wrote:
"Lo'oris" <lo****@gmail.comwrote:
>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.
I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

How are you entering "strings containing unicode"? Browsers don't send
Unicode.
Excuse me? They sure can, depending on the language being used.

So the rest of your post is immaterial. Steve's suggestion is a lot closer.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Dec 11 '07 #5

P: n/a
Steve wrote:
damnit if everyone in the world won't speak and use the english
language!
Some English words use non-ASCII characters too.

There are obviously those adopted from other languages such as résumé,
café and crêche. Many such words will lose their accents when they come to
English, but the examples given in the previous sentence typically retain
them.

The forenames Chloë and Zoë use a diaeresis mark to indicate that the 'e'
should be pronounced independently from the 'o' sound.

The surname Brontë has a similar mark, though that was a fanciful addition
by their father. They were originally from Ireland and thought that the
English might have a hard time correctly pronouncing Proinntigh, so
anglicised it. One might wonder why he thought adding an 'ë' counted as
anglicising, but at that time, the mark was quite common, used in words
such as coöperate, coöordinate, reënact and noöne. (Now that the mark has
become less common, these words are somewhat more awkward to spell. Using
a hyphen looks wrong, but the words look even worse without!) It is still
retained in naïve.

Also, 'æ' and to a lesser extent, 'œ' are used in many words. (Try looking
up pre-mediæval diseases of the œsophagus in an encyclopædia, and you
might find that many of them have names which are an onomatopœia.)

All of these are still readable when transliterated into a purely ASCII
alphabet, though 'resume' is ambiguous.

--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 3 days, 21:11.]

Sharing Music with Apple iTunes
http://tobyinkster.co.uk/blog/2007/1...tunes-sharing/
Dec 11 '07 #6

P: n/a
Greetings, Lo'oris.
In reply to Your message dated Tuesday, December 11, 2007, 05:47:08,
I'd like to have a set of "allowed characters", and strip a string
from everything besides those.
I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.
I think we should stop here and decide, what we can do to be sure we have
proper user input before dealing with it.

Typical case: You're working with posting form and forget to set
accept-charset attribute for it leaving user-agent to decide his own way to
send the data. In most cases it returns the data in the encoding Your server
sending Your pages to the user. But that the fair browser, like Opera. And if
Your server supplies any encoding.

Sorry, I can't add anything behind this explanation as it is not the PHP
question in general.

Worst way is that if You cannot take any actiona to make user input affordable
for You. Then You can try to detect the encoding of data passed to Your
application. Hope there's some articles in the Net covering this task.
--
Sincerely Yours, AnrDaemon <an*******@freemail.ru>

Dec 12 '07 #7

P: n/a
"Steve" <no****@example.comwrote:
>
>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);

it's not expensive at all. and a dot is any single character...not a greedy
wild card. the only reason he wouldn't want a dot is because it could be an
'illegal' character that he's trying to get rid of anyway. as it is, he just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).
That statement as written will replace each character with itself, one by
one, repeatedly, for each character in $name.

It is an expensive no-op.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Dec 13 '07 #8

P: n/a
Jerry Stuckle <js*******@attglobal.netwrote:
>Tim Roberts wrote:
>"Lo'oris" <lo****@gmail.comwrote:
>>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.
I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

How are you entering "strings containing unicode"? Browsers don't send
Unicode.

Excuse me? They sure can, depending on the language being used.
Yes, I know better. That was not the sentiment I intended to convey.
>So the rest of your post is immaterial. Steve's suggestion is a lot closer.
Damn you, Stuckle. How can you see anything at all from up there on your
high horse?

Despite my faux pas, my suggestion was also correct, your invective
notwithstanding.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Dec 13 '07 #9

P: n/a

"Tim Roberts" <ti**@probo.comwrote in message
news:dp********************************@4ax.com...
"Steve" <no****@example.comwrote:
>>
>>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);

it's not expensive at all. and a dot is any single character...not a
greedy
wild card. the only reason he wouldn't want a dot is because it could be
an
'illegal' character that he's trying to get rid of anyway. as it is, he
just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).

That statement as written will replace each character with itself, one by
one, repeatedly, for each character in $name.

It is an expensive no-op.
that is true, however each character is analyzed *as a single character*.
there is no marker being set and a pattern being sought beyond that marker
to see if there is another pattern match. markers are set, the replacement
is made to those characters marked, the process is done. one of the least
expensive operations one could ask of preg.

may be a good idea to write a pattern you think would be less expense that
does similar things...see if you can time-test compare the two. you can also
measure memory consumption too. i don't think you'll find any significant
consumption of resources running the above, esp. comparitively.
Dec 13 '07 #10

P: n/a

"Tim Roberts" <ti**@probo.comwrote in message
news:i2********************************@4ax.com...
Jerry Stuckle <js*******@attglobal.netwrote:
>>Tim Roberts wrote:
>>"Lo'oris" <lo****@gmail.comwrote:

I'd like to have a set of "allowed characters", and strip a string
from everything besides those.
I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

How are you entering "strings containing unicode"? Browsers don't send
Unicode.

Excuse me? They sure can, depending on the language being used.

Yes, I know better. That was not the sentiment I intended to convey.
>>So the rest of your post is immaterial. Steve's suggestion is a lot
closer.

Damn you, Stuckle. How can you see anything at all from up there on your
high horse?
with jerry, it's a matter of people in glass houses. except when you start
throwing rocks at his, he will claim you have no rock and that, in fact,
you've not broken any windows. :)
Despite my faux pas, my suggestion was also correct, your invective
notwithstanding.
good word, invective...he likes doing that apparently. at least we encounter
it often in his posts.

cheers.
Dec 13 '07 #11

P: n/a

"Steve" <no****@example.comwrote in message
news:HI*************@newsfe07.lga...
>
"Tim Roberts" <ti**@probo.comwrote in message
news:dp********************************@4ax.com...
>"Steve" <no****@example.comwrote:
>>>
$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);

it's not expensive at all. and a dot is any single character...not a
greedy
wild card. the only reason he wouldn't want a dot is because it could be
an
'illegal' character that he's trying to get rid of anyway. as it is, he
just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).

That statement as written will replace each character with itself, one by
one, repeatedly, for each character in $name.

It is an expensive no-op.

that is true, however each character is analyzed *as a single character*.
there is no marker being set and a pattern being sought beyond that marker
to see if there is another pattern match. markers are set, the replacement
is made to those characters marked, the process is done. one of the least
expensive operations one could ask of preg.

may be a good idea to write a pattern you think would be less expense that
does similar things...see if you can time-test compare the two. you can
also measure memory consumption too. i don't think you'll find any
significant consumption of resources running the above, esp.
comparitively.
sorry tim, i needed to make it clear - as i'd mentioned in one of my first
responses to you - that i think the dot in his preg is just mistakenly not
excaped. i don't think he means "any single character", rather, "a period".
anyway, my comments above are made under that assumption. otherwise you are
more right than before, however more in the line of "that's dumb to put
or'ed patterns when one of those will basically make the other
conditions/patterns moot". still, in this case, the expense is nominal since
all conditions/patterns work over a single character.

just thought i'd clarify.

cheers
Dec 13 '07 #12

This discussion thread is closed

Replies have been disabled for this discussion.