473,398 Members | 2,403 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

Finding Peoples' Names in Files

Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
Oct 11 '07 #1
11 1483
On Oct 11, 5:22 pm, brad <byte8b...@gmail.comwrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
Can't you just use the string function .findall() ?

Oct 11 '07 #2
On 11/10/2007, brad <by*******@gmail.comwrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
--
http://mail.python.org/mailman/listinfo/python-list

Text = open(fname).read()

def a_function():
for Name in ['Guido', Robert',Susan']:
if Name in Text:
return 1

if a_function():
print "A name was found"

:)
Oct 11 '07 #3
co*********@gmail.com wrote:
On Oct 11, 5:22 pm, brad <byte8b...@gmail.comwrote:
>Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.

Can't you just use the string function .findall() ?
I mean *any* possible person's name... I don't *know* the names
beforehand :)
Oct 11 '07 #4
On 10/11/07, brad <by*******@gmail.comwrote:
co*********@gmail.com wrote:
On Oct 11, 5:22 pm, brad <byte8b...@gmail.comwrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
Can't you just use the string function .findall() ?

I mean *any* possible person's name... I don't *know* the names
beforehand :)

"I cannot combine some characters

dhcmrlchtdj

which the divine Library has not foreseen and which in one of
its secret tongues do not contain a terrible meaning. No one can
articulate a syllable which is not filled with tenderness and fear,
which is not, in one of these languages, the powerful name of a god."

Jorge Luis Borges, The Library of Babel
Oct 11 '07 #5
co*********@gmail.com wrote:
However...how can you know it is a name...
OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)

Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.

Brad

Oct 11 '07 #6
On Oct 11, 9:11 am, brad <byte8b...@gmail.comwrote:
cokofree...@gmail.com wrote:
However...how can you know it is a name...

OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)

Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.

Brad
What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt

Oct 11 '07 #7
On Thu, 11 Oct 2007 11:22:50 -0400, brad wrote:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
It'll be hard to handle the Dweezil's and Moon Unit's of the world (I
believe these are Frank Zappa's kids?), but you could compile a list of
reasonably common names by gaining access to a usenet news spool, and
pulling the names from the headers.

But then this is starting to sound dangerously like a spam campaign - in
which case, "Please don't!".
Oct 11 '07 #8
On Oct 11, 12:49 pm, Matimus <mccre...@gmail.comwrote:
On Oct 11, 9:11 am, brad <byte8b...@gmail.comwrote:
cokofree...@gmail.com wrote:
However...how can you know it is a name...
OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)
Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.
Brad

What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt
Go have a beer and calm down a bit :) It's a legitimate purpose,
although it could (and probably is being used by bad guys right now).
My intent, as you can see from the links below, is to catch it before
the bad guys do.

http://filebox.vt.edu/users/rtilley/public/find_ccns/
http://filebox.vt.edu/users/rtilley/public/find_ssns/

Brad

Oct 11 '07 #9
brad <by*******@gmail.comwrites:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert'
and or 'Susan', then we should return True, otherwise return False.
A few ideas:

1. If you don't have a list of names, find a list of words that
doesn't contain proper nouns (there are a few word lists out there,
not sure if any exclude people's names, though). Look for short runs
of two or three "words" (punctuation-separated tokens) in the email
that aren't in the dictionary. Some of them will be people's names.

2. Send the text through Google translate and look for runs of words
that are unchanged. Some of them will be people's names.

3. Search the literature and look for fancy algorithms. Here are some
papers (the last mentions some commercial software to do this):

http://citeseer.ist.psu.edu/bikel99algorithm.html

http://citeseer.ist.psu.edu/618945.html

http://arxiv.org/html/cmp-lg/9706017
John
Oct 11 '07 #10
On 10/11/07, by*******@gmail.com <by*******@gmail.comwrote:
On Oct 11, 12:49 pm, Matimus <mccre...@gmail.comwrote:
On Oct 11, 9:11 am, brad <byte8b...@gmail.comwrote:
cokofree...@gmail.com wrote:
However...how can you know it is a name...
OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)
Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.
Brad
What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt

Go have a beer and calm down a bit :) It's a legitimate purpose,
although it could (and probably is being used by bad guys right now).
My intent, as you can see from the links below, is to catch it before
the bad guys do.

http://filebox.vt.edu/users/rtilley/public/find_ccns/
http://filebox.vt.edu/users/rtilley/public/find_ssns/

Brad
In case you're doing this for PCI validation, be aware that just the
CC number is considered sensitive and you'd get some false negatives
if you filter on anything except that.

Random strings that match CC checksums are really quite rare and false
positives from that alone are unlikely to be a problem. Unless I
deployed this and there was a significant false positive rate I
wouldn't risk the false negatives, personally.
Oct 11 '07 #11
Chris Mellon wrote:
In case you're doing this for PCI validation, be aware that just the
CC number is considered sensitive and you'd get some false negatives
if you filter on anything except that.

Random strings that match CC checksums are really quite rare and false
positives from that alone are unlikely to be a problem. Unless I
deployed this and there was a significant false positive rate I
wouldn't risk the false negatives, personally.
Yes, it is for PCI. Our rate of false positives is low, very low. I
wasn't aware that a number alone was a PCI violation. Thank you! On
another note, we're a university (Virginia Tech) and we're subject to
FERPA, HIPPA, GLBA, etc... in addition to PCI. So we do these checks for
U.S. Social Security Numbers too in an effort to prevent or lessen the
chance of ID theft. Unfortunately, there is no luhn check for SSNs. We
follow the Social Security Administration verification guideline
religiously... here's an web front-end to my logic:

http://black.cirt.vt.edu/public/valid_ssn/index.html

but still have many false positives on SSNs, so being able to id *names
and numbers* in files would still be a be benefit to us.

Brad
Oct 11 '07 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Mr. B | last post by:
I'm writing an app where I'm trying to look for and List all specific file 'types' found. So I point to a specific start top level Folder... and I want to drill down through ALL sub folders to...
4
by: Ali Abin Makesh | last post by:
Steve Ballmer calls Linux Advocates waging 'jihad' Jihad is Holy War and is most sacred. Ballmer makes insult to all our peoples.
2
by: Ben Fidge | last post by:
Hi I'm trying (and failing) to create and regular expression for parsing peoples names. I simply want to make sure that a firstname and surname, at least, can be extracted. For example, "Ben...
2
by: Elhurzen | last post by:
X-No-Archive: yes I have an XML file with hundreds of <resource> elements: <resource name="val"> ... </resource> There are now tens of files (written in a mix of C, C++ and C#!) in another...
13
by: athiane | last post by:
I want a way to parse out all function names that appear in a couple of C files. When the parsing logic finds a function name in a file, it should print out the Function name, line number and file...
2
by: Bob Johnson | last post by:
Using C#/2.0 I'm writing a small "data translator" utility app that reads data out of a MS Access database and inserts it into a SQL Server database. The source db lists a bunch of names of people...
7
by: gtb | last post by:
I wish to copy the highest version number of a file from directory \ \ \fileserver\D:\scripts to C:\scripts where the file names are of the form filename_MM.NN.SS.zip, where MM, NN, and SS...
5
by: mohi | last post by:
hello everyone i m positing this again but can't help as im not finding any solution to this . my problem is i have to browse a directory to search for all the files in it and process certain...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.