472,119 Members | 1,628 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,119 software developers and data experts.

Removing Bad Words

Looking for suggestions on how to handle bad words that might
get passed in through $_GET['item'] variables.

My first thoughts included using str_replace() to strip out such
content, but then one ends up looking for characters that wrap
around the stripped characters and it ends up as a recursive
ordeal that fails to identify a poorly constructed $_GET['item']
variable (when someone hand-types the item into the line and
makes a simple typing error).

So the next thoughts involved employing a list of good words
and if any word in the $_GET['item'] list doesn't fall into the
list of good words, then an empty string gets returned.

Any suggestions on how to handle this?

Thanks,

Jim Carlock

Feb 22 '06 #1
7 3026
Jim Carlock wrote:
Any suggestions on how to handle this?


You will have to implement "fuzzy logics" which wil be able to filter not
only "badword" but also "b a d w o r d", "b@d word", "b*dword", etcetera.

Although you should be able to catch some of those, the best filter is still
the human moderator...
JW
Feb 22 '06 #2
On Wed, 22 Feb 2006 19:36:41 GMT, "Jim Carlock" <an*******@127.0.0.1>
wrote:
Looking for suggestions on how to handle bad words that might
get passed in through $_GET['item'] variables.

My first thoughts included using str_replace() to strip out such
content, but then one ends up looking for characters that wrap
around the stripped characters and it ends up as a recursive
ordeal that fails to identify a poorly constructed $_GET['item']
variable (when someone hand-types the item into the line and
makes a simple typing error).

So the next thoughts involved employing a list of good words
and if any word in the $_GET['item'] list doesn't fall into the
list of good words, then an empty string gets returned.

Any suggestions on how to handle this?


Automatic removal is just about impossible to do reliably. (People
living in places such as Sussex and Scunthorpe have complained that
their addresses get rejected by some sites.) If at all possible use a
matching routine to detect doubtful entries and place them on one side
for subsequent manual review.

--
Stephen Poley

http://www.xs4all.nl/~sbpoley/webmatters/
Feb 22 '06 #3
Jim Carlock wrote:
Looking for suggestions on how to handle bad words that might
get passed in through $_GET['item'] variables.

My first thoughts included using str_replace() to strip out such
content, but then one ends up looking for characters that wrap
around the stripped characters and it ends up as a recursive
ordeal that fails to identify a poorly constructed $_GET['item']
variable (when someone hand-types the item into the line and
makes a simple typing error).

So the next thoughts involved employing a list of good words
and if any word in the $_GET['item'] list doesn't fall into the
list of good words, then an empty string gets returned.

Any suggestions on how to handle this?

Thanks,

Jim Carlock


Jim, Not knowing your requirments or what the website will be used for makes it
a little difficult to give you a solution. Would a drop-down list of acceptable
words be better than expecting the user to type them correctly?

That being said, if you type as badly as I do, you have probably made all of teh
tpying errors most commonly seen. Including a str_replace() for all of those
examples would not be that difficult - better yet include it into a javascript
and let the client-side handle the word-corrections (onclick or onsubmit).

I have worked with several products (OS and database) that will auto-correct
some commands like: eixt = EXIT or comit=COMMIT etc... Digital TOPS10/20 OS
that ran on the KL10/20 systems (36bit - circa mid 70's early 80's) would prompt
you for a yes/no to:
did you mean [whatever the correct spelling of the command is] Pretty cool for
it's day...

--
Michael Austin.
DBA Consultant
Donations welcomed. Http://www.firstdbasource.com/donations.html
:)
Feb 22 '06 #4
Jim Carlock wrote:
So the next thoughts involved employing a list of good words
and if any word in the $_GET['item'] list doesn't fall into the
list of good words, then an empty string gets returned.

Any suggestions on how to handle this?
"Michael Austin" replied: Jim, Not knowing your requirments or what the website will be
used for makes it a little difficult to give you a solution. Would
a drop-down list of acceptable words be better than expecting
the user to type them correctly?
Well a drop down list will go into the making for some things, but
anyone can edit the line of text in the address-bar. And so instead
of filtering for bad words, I'm looking for suggestions on how to
parse through a list of good words (stored inside an array) and if
any of the words in the address bar fail to match the words in the
any of the words in the array, the individual gets routed to a
bad-word page (the website homepage). I see a database as a
very useful option but I'm working with PHP arrays at the
moment. The database will be the future, but for the moment, I
think an array of 200 possible words might work very well.

Just need an effective way to compare a word to a list of words
inside an array and return true if it matches, false if it fails the
match.

My thoughts include:

function IsValidWord($sCheckThis) {
global $aWords;
foreach($aWords as $sWord) {
if ($sWord === $sCheckThis) {
return(TRUE);
}
}
return(FALSE);
}

So I'm looking for any other suggestions.
That being said, if you type as badly as I do, you have probably
made all of teh tpying errors most commonly seen. Including a
str_replace() for all of those examples would not be that difficult
- better yet include it into a javascript and let the client-side
handle the word-corrections (onclick or onsubmit).


The list of words is to remain on the server, so JavaScript in this
case, seems to be an invalid option. Any mistyped words are to
route the client to the homepage, or perhaps present the page in
question with no selections selected. Either/or seems appropriate
in this case.

<snip>...</snip>

Jim Carlock
Post replies to the group.
Feb 22 '06 #5
The function you need is in_array() although an associative array would
be more efficient. E.g.

$good_hash = array(
'good' => true,
'better' => true,
'best' => true,
...
);

if(!array_key_exists(strtolower($word), $good_word)) {
...
}

Feb 23 '06 #6
On 23 Feb 2006 00:29:48 GMT,
"Chung Leong" <ch***********@hotmail.com> posted:
The function you need is in_array() although an associative array
would be more efficient. E.g.


$good_hash = array(
'good' => true,
'better' => true,
'best' => true,
...
);

if(!array_key_exists(strtolower($word), $good_word)) {
...
}

Thanks, Chung. It seems like it's best to store everything inside the
array as lowercase and then fill in some appropriate variables for.

I initially started out with mixed-case arrays. For example:

// array of states
function Create_USA_States_Array() {
$aStates = array(
// http://www.usps.com/ncsc/lookups/usp...eviations.html
array("Alabama", "AL"),
array("Alaska", "AK"),
array("Arizona", "AZ"),
array("Arkansas", "AR"),
array("California", "CA"),
array("Colorado", "CO"),
array("Connecticut", "CT"),
array("Deleware", "DE"),
array("Florida", "FL"),
array("Georgia", "GA"),
array("Hawaii", "HI"),
array("Idaho", "ID"),
array("Illinois", "IL"),
array("Indiana", "IN"),
array("Iowa", "IA"),
array("Kansas", "KS"),
array("Kentucky", "KY"),
array("Louisiana", "LA"),
array("Maine", "ME"),
array("Maryland", "MD"),
array("Massachusetts", "MA"),
array("Michigan", "MI"),
array("Minnesota", "MN"),
array("Mississippi", "MS"),
array("Missouri", "MO"),
array("Montana", "MT"),
array("Nebraska", "NE"),
array("Nevada", "NV"),
array("New Hampshire", "NH"),
array("New Jersey", "NJ"),
array("New Mexico", "NM"),
array("New York", "NY"),
array("North Carolina", "NC"),
array("North Dakota", "ND"),
array("Ohio", "OH"),
array("Oklahoma", "OK"),
array("Oregon", "OR"),
array("Pennsylvania", "PA"),
array("Rhode Island", "RI"),
array("South Carolina", "SC"),
array("South Dakota", "SD"),
array("Tennessee", "TN"),
array("Texas", "TX"),
array("Utah", "UT"),
array("Vermont", "VT"),
array("Virginia", "VA"),
array("Washington", "WA"),
array("Washington, D.C.", "DC"),
array("West Virginia", "WV"),
array("Wisconsin", "WI"),
array("Wyoming", "WY"));
return($aStates);
}

The function established to return a state name works as follows:

// this function is incomplete
// PURPOSE: RETURN statename from parameter passed in
// INPUT: City-State String, OPTIONAL default string
// RETURNS: empty string if invalid parameter requested
// $sDS represents default state name to return
// $sCS = $_GET['citystate'];
// "Charlotte NC" or "Charlotte North Carolina" or "Charlotte" or
// "usertyped garbage"
function GetStateNameFromCityState($sCS, $sDS = "") {
$sStateAbbr = trim($sCS);
$iLen = strlen($sStateAbbr);
// first check to see if empty string
if (strlen($iLen < 2)) { return($sDS); }
if (GetStateFromAbbr($sStateAbbr)) {
// a valid abbreviation was passed in
return(GetStateFromAbbr($sStateAbbr));
}
$aStates = Create_USA_States_Array();
// possible state name in parameter so check for a state name,
// before checking against abbreviations
foreach ($aStates as $aState) {
// state name: $aState[0]
if (stristr($sStateAbbr, $aState[0]) != FALSE) {
// return state name
return($aState[0]);
}
}
// no valid statename found, so start abbreviation checks
// first determine if there's an abbreviation present
// explode(separator, string to separate)
$aWords = explode(" ", $sStateAbbr);
$yAbbrFound = FALSE;
// check for abbreviations
foreach ($aWords as $sWord) {
if (strlen($sWord) == 2) {
// assume a 2-letter word represents a state abbreviation
$sStateAbbr = $sWord;
$yAbbrFound = TRUE;
break;
}
}
if ($yAbbrFound) {
} else {
// no abbreviation to check, so return empty string
return($sDS);
}
// now validate abbreviation found
// COULD this fail? NEEDS MORE TESTING.
foreach ($aStates as $aState) {
// now check against abbreviations
if (stristr($sStateAbbr, $aState[1]) != FALSE) {
// return state name in proper formatting
return($aState[1]);
}
}
// return empty string when it all fails (default state)
return($sDS);
}

Haven't fully tested the user-typed garbage being passed in, but
my question specifically involves configuring the state array, and
alternative suggestions for this.

Note, that the above function actually returns what's found inside
the predefined array, rather than what's found in the address-bar.
This in effect, should get me words proper for HTML presentation,
where I don't have to mess with capitalizing ALL state abbrev's,
or capitalizing the first word of anything.

I still need to test the code above some more, so if anyone happens
to catch a flaw please point it out.

And again back to the question in the topic... "Lowercase Versus
Mixed-case" words inside the array that holds the states and state
abbreviations. Anyone here that knows of a better way to do this?
Another array might get created, as the list of targeted cities is over
100 right at the moment. To possibly identify each city to a proper
state.

I plan on getting something going whereby a new array appears as
follows:

"city name", iStateNumber

"state number" represents an integer 0 to 50 (51 states).
Duplicate "city name"'s could exist, so the database, combines
the "city name" and the "state number" into an index. The "state
number" ends up being a pointer to the StateID in the State
database. So continuing along the lines of the indexed arrays,
as presented by Chung Leong, how would I go about indexing
such an array as above and would indexing be appropriate for
such?

Thanks, Chung Leong. I did put the indexed array into play in
another function where the number of items is greater. I didn't
know how to work it into this particular array (or an array with
multiple fields with duplicate records).

Jim Carlock
Post replies to the group.
Feb 24 '06 #7
Jim Carlock wrote:
And again back to the question in the topic... "Lowercase Versus
Mixed-case" words inside the array that holds the states and state
abbreviations. Anyone here that knows of a better way to do this?
Another array might get created, as the list of targeted cities is over
100 right at the moment. To possibly identify each city to a proper
state.


Just have the static array be in mixed case, then generate the other
one(s) programmatically:

$states = array(
"AL" => "Alabama",
...
"WY" => "Wyoming"
);

$state_hash = array_flip(array_map('strtolower', $states));

Feb 25 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Asad Khan | last post: by
2 posts views Thread by Nathan Sokalski | last post: by
5 posts views Thread by nuffnough | last post: by
4 posts views Thread by JJ | last post: by
11 posts views Thread by George Sakkis | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.