On Mon, 09 Jul 2007 09:53:10 -0700, tshad <t@home.comwrote:
Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.
At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot
of
time before a submission is actually made live.
For what it's worth, you may want to reconsider making this an automated
process. You could, as Nicholas points out, use various text searching
mechanisms to match a dictionary of "inappropriate language" against
submissions. However, this runs the risk of either being too aggressive,
blocking text that is in some contexts perfectly fine, or too passive,
allowing people to easily bypass the filter, or even having both problems
at the same time, blocking things that shouldn't be blocked while at the
same time allowing offensive things through far too easily (usually when
the user intentionally obfuscates their offensive language in a way that
makes their text obvious to a human without a computer being able to
understand it).
When dealing with problems that are unique to humans, it is usually best
to leave the solution to humans. You can either invest a lot of time and
effort into creating a dictionary-based text matching system that tries to
filter inappropriate language, or you can just put a little "report post"
link in the user's viewing UI and automatically block posts (and maybe
even users) when some threshold (probably based on proportion of total
user base) of users reports the post as inappropriate.
Using such a mechanism, a relative handful of users will still be
subjected to inappropriate language, but hopefully it's not really that
harmful to them, and the end result will be that inappropriate language is
much more accurately identified and blocked. That is, even though you're
guaranteed some users will always see the inappropriate language in any
post, on average all users are likely to see less inappropriate language
than would be the case with a completely automated system.
That said, if you do decide to go the dictionary route, you may find that
simple Regex or IndexOf as Nicholas suggested doesn't perform well. If
the submissions are short and the dictionary only has a small number of
words in it, that's probably fine. But otherwise, you are likely to find
that the algorithm cost scales out of control as submission length and
dictionary length get large.
If so, you may want to consider something based on existing indexing
and/or spell-check functionality. I admit, I'm not that familiar with
what's already out there. I'd guess there are already good, full-featured
libraries (maybe even classes in .NET for all I know) that can handle that
sort of work. However, if not you may find this class that I wrote as an
exercise for a similar problem useful:
<http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06f696d4500b77?dmode=source>
The original poster in that thread never mentioned whether he found it
useful or not. Maybe he didn't, and maybe you wouldn't either. But I
mention it anyway, just in case. :)
Pete