473,320 Members | 1,945 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Procedure searching for inappropriate language

Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot of
time before a submission is actually made live.

Thanks,

Tom
Jul 9 '07 #1
12 1401
Tom,

First, you have to define what is "inappropriate". That's going to
range widely among people.

I assume that when you do it manually, you have established guidelines
indicating what is inappropriate language. That should serve as your design
spec (or at least serve as the basis for one).

Once you have that, the rest should be easy, as it will really boil down
to some regular expression code, or some calls to IndexOf on the string
class.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com
"tshad" <t@home.comwrote in message
news:OZ**************@TK2MSFTNGP02.phx.gbl...
Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot
of time before a submission is actually made live.

Thanks,

Tom

Jul 9 '07 #2
What I do is have a Static Hashtable in global called "Badwords", that
contains my list of baddies, and a static accompanying IsBadWord method. It's
pretty quick to pass a post through this and either remove the offenders or
decide not to accept the post at all. George Carlin would be proud.

-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
BlogMetaFinder(BETA): http://www.blogmetafinder.com

"tshad" wrote:
Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot of
time before a submission is actually made live.

Thanks,

Tom
Jul 9 '07 #3
On Mon, 09 Jul 2007 09:53:10 -0700, tshad <t@home.comwrote:
Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot
of
time before a submission is actually made live.
For what it's worth, you may want to reconsider making this an automated
process. You could, as Nicholas points out, use various text searching
mechanisms to match a dictionary of "inappropriate language" against
submissions. However, this runs the risk of either being too aggressive,
blocking text that is in some contexts perfectly fine, or too passive,
allowing people to easily bypass the filter, or even having both problems
at the same time, blocking things that shouldn't be blocked while at the
same time allowing offensive things through far too easily (usually when
the user intentionally obfuscates their offensive language in a way that
makes their text obvious to a human without a computer being able to
understand it).

When dealing with problems that are unique to humans, it is usually best
to leave the solution to humans. You can either invest a lot of time and
effort into creating a dictionary-based text matching system that tries to
filter inappropriate language, or you can just put a little "report post"
link in the user's viewing UI and automatically block posts (and maybe
even users) when some threshold (probably based on proportion of total
user base) of users reports the post as inappropriate.

Using such a mechanism, a relative handful of users will still be
subjected to inappropriate language, but hopefully it's not really that
harmful to them, and the end result will be that inappropriate language is
much more accurately identified and blocked. That is, even though you're
guaranteed some users will always see the inappropriate language in any
post, on average all users are likely to see less inappropriate language
than would be the case with a completely automated system.

That said, if you do decide to go the dictionary route, you may find that
simple Regex or IndexOf as Nicholas suggested doesn't perform well. If
the submissions are short and the dictionary only has a small number of
words in it, that's probably fine. But otherwise, you are likely to find
that the algorithm cost scales out of control as submission length and
dictionary length get large.

If so, you may want to consider something based on existing indexing
and/or spell-check functionality. I admit, I'm not that familiar with
what's already out there. I'd guess there are already good, full-featured
libraries (maybe even classes in .NET for all I know) that can handle that
sort of work. However, if not you may find this class that I wrote as an
exercise for a similar problem useful:
<http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06f696d4500b77?dmode=source>

The original poster in that thread never mentioned whether he found it
useful or not. Maybe he didn't, and maybe you wouldn't either. But I
mention it anyway, just in case. :)

Pete
Jul 9 '07 #4
On Mon, 09 Jul 2007 10:24:02 -0700, Peter Bromberg [C# MVP]
<pb*******@yahoo.yabbadabbadoo.comwrote:
What I do is have a Static Hashtable in global called "Badwords", that
contains my list of baddies, and a static accompanying IsBadWord method.
It's
pretty quick to pass a post through this and either remove the offenders
or
decide not to accept the post at all. George Carlin would be proud.
How well does it work when someone posts "b a d w o r d" or
"preBADWORDpost" or...
Jul 9 '07 #5
It wouldn't, and at that point, you get into the area of heuristics, and
to be quite frank, it becomes a back-and-forth affair, since everything
becomes a guess. You develop a heuristic, the people posting develop a
counter, the heuristic is updated, etc, etc.

At that point, you really want to look into things like Bayesian filters
and the like.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Mon, 09 Jul 2007 10:24:02 -0700, Peter Bromberg [C# MVP]
<pb*******@yahoo.yabbadabbadoo.comwrote:
>What I do is have a Static Hashtable in global called "Badwords", that
contains my list of baddies, and a static accompanying IsBadWord method.
It's
pretty quick to pass a post through this and either remove the offenders
or
decide not to accept the post at all. George Carlin would be proud.

How well does it work when someone posts "b a d w o r d" or
"preBADWORDpost" or...

Jul 9 '07 #6
On Mon, 09 Jul 2007 12:18:26 -0700, Nicholas Paldino [.NET/C# MVP]
<mv*@spam.guard.caspershouse.comwrote:
It wouldn't, and at that point, you get into the area of heuristics,
and
to be quite frank, it becomes a back-and-forth affair, since everything
becomes a guess. You develop a heuristic, the people posting develop a
counter, the heuristic is updated, etc, etc.

At that point, you really want to look into things like Bayesian
filters
and the like.
Even when you can apply such a filter to a complete message, as is the
case with spam filtering, errors are made. False negatives, not so bad.
False positives are downright maddening, and do happen with existing
filtering technology on a fairly regular basis. Reducing false positives
necessarily means increasing false negatives.

My point was that a simple dictionary lookup will have lots of holes in
it, and a complicated mechanism like Bayesian filtering will still have
some holes in it and will hugely add to the development cost as well.

One of the best spam-filtering paradigms is the one in which clients mark
a particular message as spam, and when enough mark it so, that message is
simply blocked for all users. Spam mutates, and that's a problem when
trying to block the same thing repeatedly.

But in the case of some sort of user-shared forum where users make
submissions, you have just one instance of each message that is being
marked and so relying on a user-input monitoring scheme is not only much
simpler than trying to design a robust natural language processor, it is
actually much more reliable as well.

The only real downside is the requirement that users participate. But if
you have users that don't want to participate, IMHO that calls into
question the value of the user-shared forum in the first place. :)

Pete
Jul 9 '07 #7
"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.comwrote in
message news:e1**************@TK2MSFTNGP05.phx.gbl...
Tom,

First, you have to define what is "inappropriate". That's going to
range widely among people.
I am not trying to be all things to all people. I know you can't catch
anything. We know this by Spam filters. This doesn't mean you don't put
something in place. If it handles 90% of the problem - that's fine.
I assume that when you do it manually, you have established guidelines
indicating what is inappropriate language. That should serve as your
design spec (or at least serve as the basis for one).
Of course.
>
Once you have that, the rest should be easy, as it will really boil
down to some regular expression code, or some calls to IndexOf on the
string class.
Right - but I was hoping there was something already out there that would
handle this. Especially something I could handle without changing my code.

Thanks,

Tom
>
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com
"tshad" <t@home.comwrote in message
news:OZ**************@TK2MSFTNGP02.phx.gbl...
>Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot
of time before a submission is actually made live.

Thanks,

Tom


Jul 10 '07 #8
"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Mon, 09 Jul 2007 10:24:02 -0700, Peter Bromberg [C# MVP]
<pb*******@yahoo.yabbadabbadoo.comwrote:
>What I do is have a Static Hashtable in global called "Badwords", that
contains my list of baddies, and a static accompanying IsBadWord method.
It's
pretty quick to pass a post through this and either remove the offenders
or
decide not to accept the post at all. George Carlin would be proud.

How well does it work when someone posts "b a d w o r d" or
"preBADWORDpost" or...
So you are saying I shouldn't do the best I can and put something in place
because someone might do their best to get around the filters and that I
couldn't possibly catch anything?

If that were the case we wouldn't have Spam Filters, Anti Virus program or
Spyware programs.

Tom
Jul 10 '07 #9
"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.comwrote in
message news:OI**************@TK2MSFTNGP03.phx.gbl...
It wouldn't, and at that point, you get into the area of heuristics,
and to be quite frank, it becomes a back-and-forth affair, since
everything becomes a guess. You develop a heuristic, the people posting
develop a counter, the heuristic is updated, etc, etc.

At that point, you really want to look into things like Bayesian
filters and the like.
I don't want to get that complicated. We don't expect a real problem here -
but it could happen.

Thanks,

Tom
>
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
>On Mon, 09 Jul 2007 10:24:02 -0700, Peter Bromberg [C# MVP]
<pb*******@yahoo.yabbadabbadoo.comwrote:
>>What I do is have a Static Hashtable in global called "Badwords", that
contains my list of baddies, and a static accompanying IsBadWord method.
It's
pretty quick to pass a post through this and either remove the offenders
or
decide not to accept the post at all. George Carlin would be proud.

How well does it work when someone posts "b a d w o r d" or
"preBADWORDpost" or...


Jul 10 '07 #10
On Tue, 10 Jul 2007 14:31:00 -0700, tshad <t@home.comwrote:
So you are saying I shouldn't do the best I can and put something in
place
because someone might do their best to get around the filters and that I
couldn't possibly catch anything?
Yes. It's not a useful expenditure of your time. A user-feedback based
system is likely to be at least as effective, and is much simpler to
implement.
If that were the case we wouldn't have Spam Filters, Anti Virus program
or
Spyware programs.
That's not true. Those kinds of tools do something very different from
the goal you're trying to achieve here, especially the latter two. Spam
filters are somewhat more related, but there are issues surrounding the
delivery and filtering of email that create extra headaches when doing a
user-driven system (and even so, there are spam-filtering tools that do in
fact simply rely on a user-driver system), that would not exist in your
scenario.

You need to compare apples to apples, not to oranges.

Pete
Jul 10 '07 #11
Be careful -- this is a perilous thing to try to do. Do it with dictionary
in hand. The world is plagued with people with small vocabularies who don't
realize that certain words have legitimate uses, e.g., "naked-eye astronomy"
or "bastard file" (metalworking tool). And do not assume that other
people's vocabularies are smaller than yours.
Jul 10 '07 #12
Michael A. Covington wrote:
Be careful -- this is a perilous thing to try to do. Do it with dictionary
in hand. The world is plagued with people with small vocabularies who don't
realize that certain words have legitimate uses, e.g., "naked-eye astronomy"
or "bastard file" (metalworking tool). And do not assume that other
people's vocabularies are smaller than yours.
Yes I saw it once it a java forum, they replaced the word "ass" with
stars, so "class MyClass {}" converted to "cl* MyClass {}".
plain stupid.
Aug 20 '07 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: - | last post by:
I have a country table with code and name columns and create a stored procedure 'get_countries()' but have no idea what is the syntax to return multiple rows. I have searched the newsgroups and...
9
by: deanfamily11 | last post by:
I have an ADT and I'd like to search for certain items in it. How can I accomplish this?
5
by: Gustavo Randich | last post by:
Hello, I'm writing an automatic SQL parser and translator from Informix to DB2. Now I'm faced with one of the most difficult things to translate, the "foreach execute procedure" functionality...
8
by: Viator | last post by:
Hi All; I am working on project; where I need to call a DB2 stored procedure (also to be written in the project) which will update/insert some records in the database. The number of rows to be...
5
by: wpellett | last post by:
I can not get the SQL compiler to rewrite my SQL UPDATE statement to include columns being SET in a Stored Procedure being called from a BEFORE UPDATE trigger. Example: create table...
5
by: tshad | last post by:
Does anyone have or know where I can get a some code that will check a TextBox for inappropriate language. At the moment, we need to manually check submissions for language before posting. This...
20
by: Seongsu Lee | last post by:
Hi, I have a dictionary with million keys. Each value in the dictionary has a list with up to thousand integers. Follow is a simple example with 5 keys. dict = {1: , 2: , 900000: , 900001:...
2
by: shredder249 | last post by:
Hi, I have an "Add New Records To Table" form. In the header of the form there is a combo box (artistsource) which looks up values from a different table (Artist), the value selected by the user is...
0
by: SOI_0152 | last post by:
Hi all! Happy New Year 2008. Il hope it will bring you love and happyness I'm new on this forum. I wrote a stored procedure on mainframe using DB2 7.1.1 and IBM language c. Everything works...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.