473,775 Members | 2,432 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

allowed characters in a string (stripping it)

I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_L OW="*èìòù é*óúâêî ôûäëïöü ãõñçæøå ăāĕē*īŏ *ūəß";
$ACCENTED_ALL_B IG="ÀÈÌÒÙ ÉÍÓÚÂÊÎ ÔÛÄËÏÖÜ ÃÕÑÇÆØÅ ĂĀĔĒĬĪŎ ŬŪƏß";
$ACCENTED_ALL=$ ACCENTED_ALL_LO W.$ACCENTED_ALL _BIG;
$ALPHABET_LOW=" qwertyuiopasdfg hjklzxcvbnm";
$ALPHABET_BIG=" QWERTYUIOPASDFG HJKLZXCVBNM";
$ALPHABET_ALL=$ ALPHABET_LOW.$A LPHABET_BIG;
$SYMBOLS_NAME=" .'- ";

first time I tried using something like this:
$name=preg_repl ace("/([a-zA-Z]|-|[$al])|./",'$1',$nam e);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

now I tried instead using str_split:
function clear_name_comp lex ($name, $ok_chars) {
$ass=str_split( $ok_chars);
$al=array();
foreach ($ass as $a) {
$al[$a]=TRUE;
}

$s=str_split($n ame);
$ret="";
foreach ($s as $c) {
if (!$al[$c]) continue;
$ret.=$c;
}

return $ret;
}

still nothing.
unicode, and it goes mad and output makes no sense.
I belive that's because in both cases it treats unicode characters
splitting into single bytes, but still, I'm clueless about what am I
supposed to do.
Dec 11 '07 #1
11 6497

"Lo'oris" <lo****@gmail.c omwrote in message
news:f0******** *************** ***********@r60 g2000hsc.google groups.com...
I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_L OW=" aaeeiioouu?" ;
$ACCENTED_ALL_B IG=" Y AAEEIIOOUU?" ;
$ACCENTED_ALL=$ ACCENTED_ALL_LO W.$ACCENTED_ALL _BIG;
$ALPHABET_LOW=" qwertyuiopasdfg hjklzxcvbnm";
$ALPHABET_BIG=" QWERTYUIOPASDFG HJKLZXCVBNM";
$ALPHABET_ALL=$ ALPHABET_LOW.$A LPHABET_BIG;
$SYMBOLS_NAME=" .'- ";

first time I tried using something like this:
$name=preg_repl ace("/([a-zA-Z]|-|[$al])|./",'$1',$nam e);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

=============== ==

preg_replace is your best bet.

$notAllowed = "/[^a-z a aeeiioouu?]/si";
$newString = preg_replace($n otAllowed, '', $stringToSearch );

or something to that effect...it tested fine in 'the regulator' - a regex
tester.
Dec 11 '07 #2
"Lo'oris" <lo****@gmail.c omwrote:
>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.
How are you entering "strings containing unicode"? Browsers don't send
Unicode.
>I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_ LOW=" ??????????? ";
$ACCENTED_ALL_ BIG=" ܟ ??????????? ";
$ACCENTED_ALL= $ACCENTED_ALL_L OW.$ACCENTED_AL L_BIG;
$ALPHABET_LOW= "qwertyuiopasdf ghjklzxcvbnm";
$ALPHABET_BIG= "QWERTYUIOPASDF GHJKLZXCVBNM";
$ALPHABET_ALL= $ALPHABET_LOW.$ ALPHABET_BIG;
$SYMBOLS_NAME= ".'- ";

first time I tried using something like this:
$name=preg_rep lace("/([a-zA-Z]|-|[$al])|./",'$1',$nam e);
(bear in mind I do *NOT* know regexp, a friend wrote this line)
This matches one of 4 things:
[a-zA-Z] an upper or lower case letter,
| or
- a hypen
| or
[$al] whatever is contained in $al
| or
. any character,

and replaces it with whatever was matched. Clearly, this is a very
expensive no-op. You do NOT want the '.' in there.

I would suggest this:

preg_match_all( "/[$al]+/", $name, $out);
$result = implode('', $out[0]);
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Dec 11 '07 #3

"Tim Roberts" <ti**@probo.com wrote in message
news:gt******** *************** *********@4ax.c om...
"Lo'oris" <lo****@gmail.c omwrote:
>>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

How are you entering "strings containing unicode"? Browsers don't send
Unicode.
>>I'm sure I'm missing something but no idea what.

$ACCENTED_ALL _LOW=" ??????????? ";
$ACCENTED_ALL _BIG=" Y ??????????? ";
$ACCENTED_ALL =$ACCENTED_ALL_ LOW.$ACCENTED_A LL_BIG;
$ALPHABET_LOW ="qwertyuiopasd fghjklzxcvbnm";
$ALPHABET_BIG ="QWERTYUIOPASD FGHJKLZXCVBNM";
$ALPHABET_ALL =$ALPHABET_LOW. $ALPHABET_BIG;
$SYMBOLS_NAME =".'- ";

first time I tried using something like this:
$name=preg_re place("/([a-zA-Z]|-|[$al])|./",'$1',$nam e);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

This matches one of 4 things:
[a-zA-Z] an upper or lower case letter,
| or
- a hypen
| or
[$al] whatever is contained in $al
| or
. any character,

and replaces it with whatever was matched. Clearly, this is a very
expensive no-op. You do NOT want the '.' in there.

I would suggest this:

preg_match_all( "/[$al]+/", $name, $out);
$result = implode('', $out[0]);
it's not expensive at all. and a dot is any single character...not a greedy
wild card. the only reason he wouldn't want a dot is because it could be an
'illegal' character that he's trying to get rid of anyway. as it is, he just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).
Dec 11 '07 #4
Tim Roberts wrote:
"Lo'oris" <lo****@gmail.c omwrote:
>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.
I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

How are you entering "strings containing unicode"? Browsers don't send
Unicode.
Excuse me? They sure can, depending on the language being used.

So the rest of your post is immaterial. Steve's suggestion is a lot closer.

--
=============== ===
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attgl obal.net
=============== ===

Dec 11 '07 #5
Steve wrote:
damnit if everyone in the world won't speak and use the english
language!
Some English words use non-ASCII characters too.

There are obviously those adopted from other languages such as résumé,
café and crêche. Many such words will lose their accents when they come to
English, but the examples given in the previous sentence typically retain
them.

The forenames Chloë and Zoë use a diaeresis mark to indicate that the 'e'
should be pronounced independently from the 'o' sound.

The surname Brontë has a similar mark, though that was a fanciful addition
by their father. They were originally from Ireland and thought that the
English might have a hard time correctly pronouncing Proinntigh, so
anglicised it. One might wonder why he thought adding an 'ë' counted as
anglicising, but at that time, the mark was quite common, used in words
such as coöperate, coöordinate, reënact and noöne. (Now that the mark has
become less common, these words are somewhat more awkward to spell. Using
a hyphen looks wrong, but the words look even worse without!) It is still
retained in naïve.

Also, 'æ' and to a lesser extent, 'œ' are used in many words. (Try looking
up pre-mediæval diseases of the œsophagus in an encyclopædia, and you
might find that many of them have names which are an onomatopœia.)

All of these are still readable when transliterated into a purely ASCII
alphabet, though 'resume' is ambiguous.

--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 3 days, 21:11.]

Sharing Music with Apple iTunes
http://tobyinkster.co.uk/blog/2007/1...tunes-sharing/
Dec 11 '07 #6
Greetings, Lo'oris.
In reply to Your message dated Tuesday, December 11, 2007, 05:47:08,
I'd like to have a set of "allowed characters", and strip a string
from everything besides those.
I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.
I think we should stop here and decide, what we can do to be sure we have
proper user input before dealing with it.

Typical case: You're working with posting form and forget to set
accept-charset attribute for it leaving user-agent to decide his own way to
send the data. In most cases it returns the data in the encoding Your server
sending Your pages to the user. But that the fair browser, like Opera. And if
Your server supplies any encoding.

Sorry, I can't add anything behind this explanation as it is not the PHP
question in general.

Worst way is that if You cannot take any actiona to make user input affordable
for You. Then You can try to detect the encoding of data passed to Your
application. Hope there's some articles in the Net covering this task.
--
Sincerely Yours, AnrDaemon <an*******@free mail.ru>

Dec 12 '07 #7
"Steve" <no****@example .comwrote:
>
>>$name=preg_re place("/([a-zA-Z]|-|[$al])|./",'$1',$nam e);

it's not expensive at all. and a dot is any single character...not a greedy
wild card. the only reason he wouldn't want a dot is because it could be an
'illegal' character that he's trying to get rid of anyway. as it is, he just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).
That statement as written will replace each character with itself, one by
one, repeatedly, for each character in $name.

It is an expensive no-op.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Dec 13 '07 #8
Jerry Stuckle <js*******@attg lobal.netwrote:
>Tim Roberts wrote:
>"Lo'oris" <lo****@gmail.c omwrote:
>>I'd like to have a set of "allowed characters", and strip a string
from everything besides those.
I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

How are you entering "strings containing unicode"? Browsers don't send
Unicode.

Excuse me? They sure can, depending on the language being used.
Yes, I know better. That was not the sentiment I intended to convey.
>So the rest of your post is immaterial. Steve's suggestion is a lot closer.
Damn you, Stuckle. How can you see anything at all from up there on your
high horse?

Despite my faux pas, my suggestion was also correct, your invective
notwithstanding .
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Dec 13 '07 #9

"Tim Roberts" <ti**@probo.com wrote in message
news:dp******** *************** *********@4ax.c om...
"Steve" <no****@example .comwrote:
>>
>>>$name=preg_r eplace("/([a-zA-Z]|-|[$al])|./",'$1',$nam e);

it's not expensive at all. and a dot is any single character...not a
greedy
wild card. the only reason he wouldn't want a dot is because it could be
an
'illegal' character that he's trying to get rid of anyway. as it is, he
just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).

That statement as written will replace each character with itself, one by
one, repeatedly, for each character in $name.

It is an expensive no-op.
that is true, however each character is analyzed *as a single character*.
there is no marker being set and a pattern being sought beyond that marker
to see if there is another pattern match. markers are set, the replacement
is made to those characters marked, the process is done. one of the least
expensive operations one could ask of preg.

may be a good idea to write a pattern you think would be less expense that
does similar things...see if you can time-test compare the two. you can also
measure memory consumption too. i don't think you'll find any significant
consumption of resources running the above, esp. comparitively.
Dec 13 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

18
2429
by: JKop | last post by:
Can some-one please point me to a nice site that gives an exhaustive list of all the memberfunctions, membervariables, operators, etc. of the std::string class, along with an informative description of how each works. I've been trying Google for the last 20 minutes but can't get anything decent. Thanks.
4
5266
by: Ewok | last post by:
let me just say. it's not by choice but im dealing with a .net web app (top down approach with VB and a MySQL database) sigh..... Anyhow, I've just about got all the kinks worked out but I am having trouble preserving data as it gets entered into the database. Primarily, quotes and special characters. Spcifically, I noticed it stripped out some double quotes and a "Registered" symbol &reg; (not the ascii but the actual character"
4
6675
by: Lu | last post by:
Hi, i am currently working on ASP.Net v1.0 and is encountering the following problem. In javascript, I'm passing in: "somepage.aspx?QSParameter=<RowID>Chèques</RowID>" as part of the query string. However, in the code behind when I tried to get the query string value by calling Request.QueryString("QSParameter"), the value I got is: "<RowID>Chques</RowID>". The special character "è" has been stripped out. The web.config file is...
3
2582
by: et | last post by:
How can I strip out unwanted characters in a string before updating the database? For instance, in names & addresses in our client table, we want only letters and numbers, no punctuation. Is there a way to do this?
4
3029
by: vvenk | last post by:
Hello: I have a string, "Testing_!@#$%^&*()". It may have single and double quotations as well. I would like to strip all chararcters others than a-z, A-Z, 0-9 and the comma. I came across the following snippet in the online help but the output does not change at all:
13
3233
by: preport | last post by:
I'm trying to ensure that all the characters in my XML document are characters specified in this document: http://www.w3.org/TR/2000/REC-xml-20001006#charsets Would a function like this work: private static string formatXMLString(string n) { if (string.IsNullOrEmpty(n)) return n; System.Text.StringBuilder sb = new System.Text.StringBuilder();
1
27831
Plater
by: Plater | last post by:
I have been using MS SQL server (8.0.194) and I have been wondering whatacters should I strip from entries before putting them into a varchar() field? I check for single quote (') and handle that, and malicious attempts. But is it ok to have the newline characters in there(\r\n)? The always show up as the ASCII-square box, so I was wondering if I need to be stripping them out as well? What other "normally used" text characters do I also need...
9
2078
by: Abandoned | last post by:
Hi.. I want to delete all now allowed characters in my text. I use this function: def clear(s1=""): if s1: allowed = s1 = "".join(ch for ch in s1 if ch in allowed) return s1
7
3794
by: Grok | last post by:
I need an elegant way to remove any characters in a string if they are not in an allowed char list. The part cleaning files of the non-allowed characters will run as a service, so no forms here. The list also needs to be editable by the end-user so I'll be providing a form on which they can edit the allowed character list. The end-user is non-technical so asking them to type a regular expression is out.
0
9454
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10270
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10109
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10051
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9916
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7464
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
1
4017
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3611
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2853
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.