Im trying to make a naughty word filter. It removes bad words fine, but instances where there is a bad word found in the text like "assist" and "asses" get caught in the filter as well. Strangely though if the sentence is: My asses to assist me." the clean version will read: My asses to ***ist me." It seems to clear the first use of the word in another word, but then blocks the rest. Any ideas? My script is below. Thanks. -
-
function cleanWords($value) {
-
-
/* strip naughty words */
-
$bad_word_file = 'standards/badwords.txt';
-
$strtofile = fopen($bad_word_file, "r");
-
$badwords = explode("\n", fread($strtofile, filesize($bad_word_file)));
-
fclose($strtofile);
-
-
for ($i = 0; $i < count($badwords); $i++) {
-
$wordlist .= str_replace(chr(13),'',$badwords[$i]).'|';
-
}
-
$wordlist = substr($wordlist,0,-1);
-
-
$value = preg_replace("/\b($wordlist)\b/ie", 'preg_replace("/./","*","\\1")', $value);
-
return $value;
-
-
}
-
-
6 5891 Atli 5,058
Expert 4TB
Hey.
If you print the $wordlist, does it look right?
I tested this by just creating the $wordlist manually and it seemed to work fine.
yes $wordlist is correct. If it helps the wordlist is just over a 1000 words.
Use the space character with or conditions.
(\s|^)(badword1|badword2)(\s|$)
That checks for either a space before the word or if it is at the start of the screen. Then checks for either a space or the end of the line.
i ended up finding that the word "a.s.s." was in my list. I think the dots were messing up the expression. For thos interested, this is my new code. Thanks for any suggestions to get it where it is. -
$_SESSION[wordlist] = join("|", array_map('trim', file('standards/badwords.txt')));
-
-
function cleanWords($value) {
-
-
global $_SESSION;
-
-
$value = preg_replace("/\b($_SESSION[wordlist])\b/ie", 'str_repeat("*", strlen("\\1")) ', $value);
-
return $value;
-
-
}
-
Atli 5,058
Expert 4TB
Hey.
Glad you got it working.
However, I would consider using a different method. - Putting the whole thing into the session is very inefficient. The list remains constant for every user, and rarely changes (if ever) right? - If so, then compiling it for every user like that and storing it in separate sessions for each one is just doing two things: eating up resources and cluttering the sessions with duplicate data.
You would be far better of compiling the regular expression into a common file, shared between all users. - This is how I would do this. (Wouldn't usually make a ready-to-use code example, but since you already solved this on your own...) - <?php
-
define("BADWORDS_RAW_FILE", "/path/to/badwords.txt");
-
define("BADWORDS_EXP_FILE", "/path/to/badwords_expression.txt");
-
-
/**
-
* Returns a regular expression that can be used to check
-
* for "bad" words. Returns an expression in the format:
-
* - /\b(list|of|bad|words)\b/i
-
*/
-
function getBadWordsRegexp()
-
{
-
$regexp = "";
-
-
// Try to fetch an existing expression.
-
if(!file_exists(BADWORDS_EXP_FILE) ||
-
filesize(BADWORDS_EXP_FILE) <= 0 ||
-
($regexp = file_get_contents(BADWORDS_EXP_FILE)) === false)
-
{
-
// Make sure the raw word list exists
-
if(!file_exists(BADWORDS_RAW_FILE)) {
-
trigger_error("The bad words file does not exists.", E_USER_ERROR);
-
return false;
-
}
-
-
// Compile the regular expression
-
$regexp = '/\b(' . join("|", array_map('trim', file(BADWORDS_RAW_FILE))) . ')\b/i';
-
-
// Try to save it
-
if(!is_writeable(BADWORDS_EXP_FILE) ||
-
!file_put_contents(BADWORDS_EXP_FILE, $regexp))
-
{
-
trigger_error("Could not save badwords expression. Check file permissions.", E_USER_WARNING);
-
}
-
}
-
-
// Return it
-
return $regexp;
-
}
-
?>
Then you could use it like: - <?php
-
function cleanWords($value) {
-
$regexp = getBadWordsRegexp();
-
return preg_replace($regexp . 'e', 'str_repeat("*", strlen("\\1")) ', $value);
-
}
-
?>
P.S.
I have a couple of notes on your code, though. - You don't need to import $_SESSION into functions using the global keyword. $_SESSION is a "super-global", which makes it available to you wherever you are in your code.
- All strings need to be quoted. That includes array keys. Which means that:
- // This
-
$_SESSION[wordlist];
-
-
// Should be
-
$_SESSION['wordlist'];
If you leave it out, PHP assumes it is a constant. Failing to find a constant, it prints a warning and uses it as a string (which is why it works, even thought it is technically an error.) - For future-compatibility and performance reasons (minor as they may be), it is best to just remember the strings.
thanks Atli! your suggestions are much appreciated.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Juha Suni |
last post by:
Hi!
I have managed to live without using too much regular expressions so
far, and now that I need one, I need some help too.
I have a string containing a (possibly large) block of html. I need...
|
by: Sebastian Araya |
last post by:
Hello,
I have a string like this:
var1: value1...valueI var2: value1...valueJ ... varN:
value1...valueK
this is an example:
breakfast: coffee eggs lunch: sandwich apple dinner: chicken...
|
by: Alexander Ross |
last post by:
I dont think I'll ever understand regular expressions ... I need to do th
efollowing and I just don't know where to start:
$haystack = "How much wood would a wood chuck chuck if a woodchuck could...
|
by: TXSherry |
last post by:
Hi,
I cannot seem to wrap my brain around preg_replace. Though I've read
the help file backwords and forwards. :/ Hoping someone can give me
a solution here.
Problem: Given string 'str'...
|
by: Margaret MacDonald |
last post by:
I've been going mad trying to figure out how to do this--it should be
easy!
Allow the user to enter '\_sometext\_', i.e., literal backslash,
underscore, some text, literal backslash, underscore...
|
by: Afkamm |
last post by:
Hi, :)
The preg_replace function...
preg_replace(pattern, replacement, subject )
How on earth do you get the limit value to work with arrays?
In my code both the pattern and replacement...
|
by: Charles |
last post by:
I'm new to this regular expression stuff. I'd like to use preg_replace to
eliminate a known multi-line signature from the body of an E-mail. Say the
body text is in $body, and the sig is this
...
|
by: correo |
last post by:
Hi all!
This:
$string = preg_replace('//i', '_', $string);
replaces an accented letter with two underscores instead of one, when
the submitting page is in UTF8 ($string comes from a GET...
|
by: shonend |
last post by:
I am trying to extract the pattern like this :
"SUB: some text LOT: one-word"
Described, "SUB" and "LOT" are key words; I want those words,
everything in between and one word following the...
|
by: monomaniac21 |
last post by:
hi all
using preg_replace
how can i replace the letter i in a string with nothing (delete it)
when it is the last letter or it is followed by an i?
i have products that are listed in a db...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
| |