By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,628 Members | 1,880 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,628 IT Pros & Developers. It's quick & easy.

Preg_replace whole word only

100+
P: 162
Im trying to make a naughty word filter. It removes bad words fine, but instances where there is a bad word found in the text like "assist" and "asses" get caught in the filter as well. Strangely though if the sentence is: My asses to assist me." the clean version will read: My asses to ***ist me." It seems to clear the first use of the word in another word, but then blocks the rest. Any ideas? My script is below. Thanks.

Expand|Select|Wrap|Line Numbers
  1.  
  2. function cleanWords($value) {
  3.  
  4.     /*   strip naughty words   */
  5.     $bad_word_file = 'standards/badwords.txt';
  6.     $strtofile = fopen($bad_word_file, "r");
  7.     $badwords = explode("\n", fread($strtofile, filesize($bad_word_file)));
  8.     fclose($strtofile);
  9.  
  10.     for ($i = 0; $i < count($badwords); $i++) {
  11.         $wordlist .= str_replace(chr(13),'',$badwords[$i]).'|';
  12.     }
  13.     $wordlist = substr($wordlist,0,-1);
  14.  
  15.     $value = preg_replace("/\b($wordlist)\b/ie", 'preg_replace("/./","*","\\1")', $value);    
  16.     return $value;
  17.  
  18. }
  19.  
  20.  
Mar 12 '10 #1
Share this Question
Share on Google+
6 Replies


Atli
Expert 5K+
P: 5,058
Hey.

If you print the $wordlist, does it look right?
I tested this by just creating the $wordlist manually and it seemed to work fine.
Mar 12 '10 #2

100+
P: 162
yes $wordlist is correct. If it helps the wordlist is just over a 1000 words.
Mar 12 '10 #3

P: 50
Use the space character with or conditions.

(\s|^)(badword1|badword2)(\s|$)

That checks for either a space before the word or if it is at the start of the screen. Then checks for either a space or the end of the line.
Mar 12 '10 #4

100+
P: 162
i ended up finding that the word "a.s.s." was in my list. I think the dots were messing up the expression. For thos interested, this is my new code. Thanks for any suggestions to get it where it is.

Expand|Select|Wrap|Line Numbers
  1. $_SESSION[wordlist] = join("|", array_map('trim', file('standards/badwords.txt')));
  2.  
  3. function cleanWords($value) {
  4.  
  5.     global $_SESSION;
  6.  
  7.     $value = preg_replace("/\b($_SESSION[wordlist])\b/ie", 'str_repeat("*", strlen("\\1")) ', $value);    
  8.     return $value;
  9.  
  10. }
  11.  
Mar 13 '10 #5

Atli
Expert 5K+
P: 5,058
Hey.
Glad you got it working.

However, I would consider using a different method. - Putting the whole thing into the session is very inefficient. The list remains constant for every user, and rarely changes (if ever) right? - If so, then compiling it for every user like that and storing it in separate sessions for each one is just doing two things: eating up resources and cluttering the sessions with duplicate data.

You would be far better of compiling the regular expression into a common file, shared between all users. - This is how I would do this. (Wouldn't usually make a ready-to-use code example, but since you already solved this on your own...)
Expand|Select|Wrap|Line Numbers
  1. <?php
  2. define("BADWORDS_RAW_FILE", "/path/to/badwords.txt");
  3. define("BADWORDS_EXP_FILE", "/path/to/badwords_expression.txt");
  4.  
  5. /**
  6.  * Returns a regular expression that can be used to check
  7.  * for "bad" words. Returns an expression in the format:
  8.  *  - /\b(list|of|bad|words)\b/i
  9.  */
  10. function getBadWordsRegexp()  
  11. {
  12.     $regexp = "";
  13.  
  14.     // Try to fetch an existing expression.
  15.     if(!file_exists(BADWORDS_EXP_FILE) || 
  16.        filesize(BADWORDS_EXP_FILE) <= 0 ||
  17.        ($regexp = file_get_contents(BADWORDS_EXP_FILE)) === false)
  18.     {
  19.         // Make sure the raw word list exists
  20.         if(!file_exists(BADWORDS_RAW_FILE)) {
  21.             trigger_error("The bad words file does not exists.", E_USER_ERROR);
  22.             return false;
  23.         }
  24.  
  25.         // Compile the regular expression
  26.         $regexp = '/\b(' . join("|", array_map('trim', file(BADWORDS_RAW_FILE))) . ')\b/i';
  27.  
  28.         // Try to save it
  29.         if(!is_writeable(BADWORDS_EXP_FILE) ||
  30.            !file_put_contents(BADWORDS_EXP_FILE, $regexp)) 
  31.         {
  32.             trigger_error("Could not save badwords expression. Check file permissions.", E_USER_WARNING);
  33.         }
  34.     }
  35.  
  36.     // Return it
  37.     return $regexp;
  38. }
  39. ?>
Then you could use it like:
Expand|Select|Wrap|Line Numbers
  1. <?php
  2. function cleanWords($value) {
  3.     $regexp = getBadWordsRegexp();
  4.     return preg_replace($regexp . 'e', 'str_repeat("*", strlen("\\1")) ', $value);
  5. }
  6. ?>
P.S.
I have a couple of notes on your code, though.
  • You don't need to import $_SESSION into functions using the global keyword. $_SESSION is a "super-global", which makes it available to you wherever you are in your code.
  • All strings need to be quoted. That includes array keys. Which means that:
    Expand|Select|Wrap|Line Numbers
    1. // This
    2. $_SESSION[wordlist];
    3.  
    4. // Should be
    5. $_SESSION['wordlist'];
    If you leave it out, PHP assumes it is a constant. Failing to find a constant, it prints a warning and uses it as a string (which is why it works, even thought it is technically an error.) - For future-compatibility and performance reasons (minor as they may be), it is best to just remember the strings.
Mar 14 '10 #6

100+
P: 162
thanks Atli! your suggestions are much appreciated.
Mar 14 '10 #7

Post your reply

Sign in to post your reply or Sign up for a free account.