471,305 Members | 1,296 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,305 software developers and data experts.

Finding Peoples' Names in Files

Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
Oct 11 '07 #1
11 1318
On Oct 11, 5:22 pm, brad <byte8b...@gmail.comwrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
Can't you just use the string function .findall() ?

Oct 11 '07 #2
On 11/10/2007, brad <by*******@gmail.comwrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
--
http://mail.python.org/mailman/listinfo/python-list

Text = open(fname).read()

def a_function():
for Name in ['Guido', Robert',Susan']:
if Name in Text:
return 1

if a_function():
print "A name was found"

:)
Oct 11 '07 #3
co*********@gmail.com wrote:
On Oct 11, 5:22 pm, brad <byte8b...@gmail.comwrote:
>Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.

Can't you just use the string function .findall() ?
I mean *any* possible person's name... I don't *know* the names
beforehand :)
Oct 11 '07 #4
On 10/11/07, brad <by*******@gmail.comwrote:
co*********@gmail.com wrote:
On Oct 11, 5:22 pm, brad <byte8b...@gmail.comwrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
Can't you just use the string function .findall() ?

I mean *any* possible person's name... I don't *know* the names
beforehand :)

"I cannot combine some characters

dhcmrlchtdj

which the divine Library has not foreseen and which in one of
its secret tongues do not contain a terrible meaning. No one can
articulate a syllable which is not filled with tenderness and fear,
which is not, in one of these languages, the powerful name of a god."

Jorge Luis Borges, The Library of Babel
Oct 11 '07 #5
co*********@gmail.com wrote:
However...how can you know it is a name...
OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)

Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.

Brad

Oct 11 '07 #6
On Oct 11, 9:11 am, brad <byte8b...@gmail.comwrote:
cokofree...@gmail.com wrote:
However...how can you know it is a name...

OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)

Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.

Brad
What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt

Oct 11 '07 #7
On Thu, 11 Oct 2007 11:22:50 -0400, brad wrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
It'll be hard to handle the Dweezil's and Moon Unit's of the world (I
believe these are Frank Zappa's kids?), but you could compile a list of
reasonably common names by gaining access to a usenet news spool, and
pulling the names from the headers.

But then this is starting to sound dangerously like a spam campaign - in
which case, "Please don't!".
Oct 11 '07 #8
On Oct 11, 12:49 pm, Matimus <mccre...@gmail.comwrote:
On Oct 11, 9:11 am, brad <byte8b...@gmail.comwrote:
cokofree...@gmail.com wrote:
However...how can you know it is a name...
OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)
Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.
Brad

What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt
Go have a beer and calm down a bit :) It's a legitimate purpose,
although it could (and probably is being used by bad guys right now).
My intent, as you can see from the links below, is to catch it before
the bad guys do.

http://filebox.vt.edu/users/rtilley/public/find_ccns/
http://filebox.vt.edu/users/rtilley/public/find_ssns/

Brad

Oct 11 '07 #9
brad <by*******@gmail.comwrites:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert'
and or 'Susan', then we should return True, otherwise return False.
A few ideas:

1. If you don't have a list of names, find a list of words that
doesn't contain proper nouns (there are a few word lists out there,
not sure if any exclude people's names, though). Look for short runs
of two or three "words" (punctuation-separated tokens) in the email
that aren't in the dictionary. Some of them will be people's names.

2. Send the text through Google translate and look for runs of words
that are unchanged. Some of them will be people's names.

3. Search the literature and look for fancy algorithms. Here are some
papers (the last mentions some commercial software to do this):

http://citeseer.ist.psu.edu/bikel99algorithm.html

http://citeseer.ist.psu.edu/618945.html

http://arxiv.org/html/cmp-lg/9706017
John
Oct 11 '07 #10
On 10/11/07, by*******@gmail.com <by*******@gmail.comwrote:
On Oct 11, 12:49 pm, Matimus <mccre...@gmail.comwrote:
On Oct 11, 9:11 am, brad <byte8b...@gmail.comwrote:
cokofree...@gmail.com wrote:
However...how can you know it is a name...
OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)
Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.
Brad
What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt

Go have a beer and calm down a bit :) It's a legitimate purpose,
although it could (and probably is being used by bad guys right now).
My intent, as you can see from the links below, is to catch it before
the bad guys do.

http://filebox.vt.edu/users/rtilley/public/find_ccns/
http://filebox.vt.edu/users/rtilley/public/find_ssns/

Brad
In case you're doing this for PCI validation, be aware that just the
CC number is considered sensitive and you'd get some false negatives
if you filter on anything except that.

Random strings that match CC checksums are really quite rare and false
positives from that alone are unlikely to be a problem. Unless I
deployed this and there was a significant false positive rate I
wouldn't risk the false negatives, personally.
Oct 11 '07 #11
Chris Mellon wrote:
In case you're doing this for PCI validation, be aware that just the
CC number is considered sensitive and you'd get some false negatives
if you filter on anything except that.

Random strings that match CC checksums are really quite rare and false
positives from that alone are unlikely to be a problem. Unless I
deployed this and there was a significant false positive rate I
wouldn't risk the false negatives, personally.
Yes, it is for PCI. Our rate of false positives is low, very low. I
wasn't aware that a number alone was a PCI violation. Thank you! On
another note, we're a university (Virginia Tech) and we're subject to
FERPA, HIPPA, GLBA, etc... in addition to PCI. So we do these checks for
U.S. Social Security Numbers too in an effort to prevent or lessen the
chance of ID theft. Unfortunately, there is no luhn check for SSNs. We
follow the Social Security Administration verification guideline
religiously... here's an web front-end to my logic:

http://black.cirt.vt.edu/public/valid_ssn/index.html

but still have many false positives on SSNs, so being able to id *names
and numbers* in files would still be a be benefit to us.

Brad
Oct 11 '07 #12

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by Mr. B | last post: by
4 posts views Thread by Ali Abin Makesh | last post: by
2 posts views Thread by Ben Fidge | last post: by
13 posts views Thread by athiane | last post: by
7 posts views Thread by gtb | last post: by
5 posts views Thread by mohi | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.