By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,847 Members | 2,402 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,847 IT Pros & Developers. It's quick & easy.

I need some help with a regexp please

P: n/a
Hi,

I am trying to get a regexp to validate email addresses but can't get
it quite right. The problem is I can't quite find the regexp to deal
with ignoring the case james..ki**@fred.com, which is not valid. Here's
my attempt, neither of my regexps work quite how I want:

Expand|Select|Wrap|Line Numbers
  1. import os
  2. import re
  3.  
  4. s = 'Hi james..ki**@fred.com dr******@blarg.com ji*@home.com @@not
  5. sc*****@home.space.com partridge in a pear tree'
  6. r = re.compile(r'\w+\.?\w+@[^@\s]+\.\w+')
  7. #r = re.compile(r'[a-z\-\.]+@[a-z\-\.]+')
  8.  
  9. addys = set()
  10. for a in r.findall(s):
  11. addys.add(a)
  12.  
  13. for a in sorted(addys):
  14. print a
  15.  
This gives:
dr******@blarg.com
ji*@home.com
ki**@fred.com <-- shouldn't be here :(
sc*****@home.space.com

Nearly there but no cigar :)

I can't see the wood for the trees now :) Can anyone suggest a fix
please?

Thanks,
Tony

Sep 21 '06 #1
Share this Question
Share on Google+
23 Replies


P: n/a

codefire wrote:
Hi,

I am trying to get a regexp to validate email addresses but can't get
it quite right. The problem is I can't quite find the regexp to deal
with ignoring the case james..ki**@fred.com, which is not valid. Here's
my attempt, neither of my regexps work quite how I want:

Expand|Select|Wrap|Line Numbers
  1. import os
  2. import re
  3. s = 'Hi james..ki**@fred.com dr******@blarg.com ji*@home.com @@not
  4. sc*****@home.space.com partridge in a pear tree'
  5. r = re.compile(r'\w+\.?\w+@[^@\s]+\.\w+')
  6. #r = re.compile(r'[a-z\-\.]+@[a-z\-\.]+')
  7. addys = set()
  8. for a in r.findall(s):
  9.     addys.add(a)
  10. for a in sorted(addys):
  11.     print a
  12.  

This gives:
dr******@blarg.com
ji*@home.com
ki**@fred.com <-- shouldn't be here :(
sc*****@home.space.com

Nearly there but no cigar :)

I can't see the wood for the trees now :) Can anyone suggest a fix
please?

Thanks,
Tony
'[\w.]+@\w+(\.\w+)*'
Works for me, and SHOULD for you, but I haven't tested it all that
much.
Good luck.

Sep 21 '06 #2

P: n/a
On 2006-09-21, codefire <to**********@gmail.comwrote:
I am trying to get a regexp to validate email addresses but
can't get it quite right. The problem is I can't quite find the
regexp to deal with ignoring the case james..ki**@fred.com,
which is not valid. Here's my attempt, neither of my regexps
work quite how I want:
I suggest a websearch for email address validators instead of
writing of your own.

Here's a hit that looks useful:

http://aspn.activestate.com/ASPN/Coo...n/Recipe/66439

--
Neil Cerutti
Next Sunday Mrs. Vinson will be soloist for the morning service.
The pastor will then speak on "It's a Terrible Experience."
--Church Bulletin Blooper
Sep 21 '06 #3

P: n/a
codefire wrote:
Hi,

I am trying to get a regexp to validate email addresses but can't get
it quite right. The problem is I can't quite find the regexp to deal
with ignoring the case james..ki**@fred.com, which is not valid. Here's
my attempt, neither of my regexps work quite how I want:

Expand|Select|Wrap|Line Numbers
  1. import os
  2. import re
  3. s = 'Hi james..ki**@fred.com dr******@blarg.com ji*@home.com @@not
  4. sc*****@home.space.com partridge in a pear tree'
  5. r = re.compile(r'\w+\.?\w+@[^@\s]+\.\w+')
  6. #r = re.compile(r'[a-z\-\.]+@[a-z\-\.]+')
  7. addys = set()
  8. for a in r.findall(s):
  9.     addys.add(a)
  10. for a in sorted(addys):
  11.     print a
  12.  

This gives:
dr******@blarg.com
ji*@home.com
ki**@fred.com <-- shouldn't be here :(
sc*****@home.space.com

Nearly there but no cigar :)

I can't see the wood for the trees now :) Can anyone suggest a fix
please?
The problem is that your pattern doesn't start out by confirming that
it's either at the start of a line or after whitespace. You could do
this with a "look-behind assertion" if you wanted.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 21 '06 #4

P: n/a
Hi,

thanks for the advice guys.

Well took the kids swimming, watched some TV, read your hints and
within a few minutes had this:

r = re.compile(r'[^.\w]\w+\.?\w+@[^@\s]+\.\w+')

This works for me. That is if you have an invalid email such as
tony..bATblah.com it will reject it (note the double dots).

Anyway, now know a little more about regexps :)

Thanks again for the hints,

Tony

Sep 21 '06 #5

P: n/a
codefire wrote:
Hi,

thanks for the advice guys.

Well took the kids swimming, watched some TV, read your hints and
within a few minutes had this:

r = re.compile(r'[^.\w]\w+\.?\w+@[^@\s]+\.\w+')

This works for me. That is if you have an invalid email such as
tony..bATblah.com it will reject it (note the double dots).

Anyway, now know a little more about regexps :)
A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator. The definition of a valid
e-mail address is complicated. You may care to check out "Mastering
Regular Expressions" by Jeffery Friedl. In the first edition, at least
(I haven't looked at the 2nd), he works through assembling a 4700+ byte
regex for validating e-mail addresses. Yes, that's 4KB. It's the best
advertisement for *not* using regexes for a task like that that I've
ever seen.

Cheers,
John

Sep 21 '06 #6

P: n/a
"John Machin" <sj******@lexicon.netwrites:
A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator. The definition of a valid
e-mail address is complicated. You may care to check out "Mastering
Regular Expressions" by Jeffery Friedl. In the first edition, at least
(I haven't looked at the 2nd), he works through assembling a 4700+ byte
regex for validating e-mail addresses. Yes, that's 4KB. It's the best
advertisement for *not* using regexes for a task like that that I've
ever seen.
The best advice I've seen when people ask "How do I validate whether
an email address is valid?" was "Try sending mail to it".

It's both Pythonic, and truly the best way. If you actually want to
confirm, don't try to validate it statically; *use* the email address,
and check the result. Send an email to that address, and don't use it
any further unless you get a reply saying "yes, this is the right
address to use" from the recipient.

The sending system's mail transport agent, not regular expressions,
determines which part is the domain to send the mail to.

The domain name system, not regular expressions, determines what
domains are valid, and what host should receive mail for that domain.

Most especially, the receiving mail system, not regular expressions,
determines what local-parts are valid.

--
\ "I believe in making the world safe for our children, but not |
`\ our children's children, because I don't think children should |
_o__) be having sex." -- Jack Handey |
Ben Finney

Sep 22 '06 #7

P: n/a
Ben Finney wrote:
"John Machin" <sj******@lexicon.netwrites:

>>A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator. The definition of a valid
e-mail address is complicated. You may care to check out "Mastering
Regular Expressions" by Jeffery Friedl. In the first edition, at least
(I haven't looked at the 2nd), he works through assembling a 4700+ byte
regex for validating e-mail addresses. Yes, that's 4KB. It's the best
advertisement for *not* using regexes for a task like that that I've
ever seen.


The best advice I've seen when people ask "How do I validate whether
an email address is valid?" was "Try sending mail to it".
That only applies if it's a likely-looking email address. If someone
asks me to send mail to "splurge.!#$%*&^from@thingie?><{}_)" I will
probably assume that it isn't worth my time trying.

If the email looks syntactically correct, *then* it's worth further
validation by trying a delivery attempt.
It's both Pythonic, and truly the best way. If you actually want to
confirm, don't try to validate it statically; *use* the email address,
and check the result. Send an email to that address, and don't use it
any further unless you get a reply saying "yes, this is the right
address to use" from the recipient.
This is a rather scatter-shot approach. Many possibilities can be
properly eliminated by judicious lexical checks before delivery is
considered.
The sending system's mail transport agent, not regular expressions,
determines which part is the domain to send the mail to.

The domain name system, not regular expressions, determines what
domains are valid, and what host should receive mail for that domain.

Most especially, the receiving mail system, not regular expressions,
determines what local-parts are valid.
Nevertheless, I am *not* going to try delivery to (for example) a
non-local address that doesn't contain an "at@ sign.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 22 '06 #8

P: n/a
Steve Holden <st***@holdenweb.comwrites:
Ben Finney wrote:
The best advice I've seen when people ask "How do I validate
whether an email address is valid?" was "Try sending mail to it".
That only applies if it's a likely-looking email address. If someone
asks me to send mail to "splurge.!#$%*&^from@thingie?><{}_)" I will
probably assume that it isn't worth my time trying.
You, as a human, can possibly make that decision, if you don't care
about turning away someone who *does* have such an email address. How
can an algorithm do so? There are many valid email addresses that look
as bizarre as the example you gave.
The sending system's mail transport agent, not regular
expressions, determines which part is the domain to send the mail
to.

The domain name system, not regular expressions, determines what
domains are valid, and what host should receive mail for that
domain.

Most especially, the receiving mail system, not regular
expressions, determines what local-parts are valid.
Nevertheless, I am *not* going to try delivery to (for example) a
non-local address that doesn't contain an "at@ sign.
Would you try delivery to an email address that contains two or more
"@" symbols? If not, you will be denying delivery to valid RFC2821
addresses.

This is, of course, something you're entitled to do. But you've then
consciously chosen not to use "is the email address valid?" as your
criterion, and the original request for such validation becomes moot.

--
\ "During my service in the United States Congress, I took the |
`\ initiative in creating the Internet." -- Al Gore |
_o__) |
Ben Finney

Sep 22 '06 #9

P: n/a
Ant

John Machin wrote:
....
A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator.
We got bitten by this at the last place I worked - we were using a
regex email validator (from Microsoft IIRC), and we kept having
problems with specific email addresses from Ireland. There are stack of
Irish email addresses out there of the form paddy.o'reilly@domain -
perfectly valid email address, but doesn't satisfy the usual naive
versions of regex validators.

We use an even worse validator at my current job, but the feeling the
management have (not one I agree with) is that unusual email addresses,
whilst perhaps valid, are uncommon enough not to worry about....

Sep 22 '06 #10

P: n/a
Ant

Ben Finney wrote:
....
The best advice I've seen when people ask "How do I validate whether
an email address is valid?" was "Try sending mail to it".
There are advantages to the regex method. It is faster than sending an
email and getting a positive or negative return code. The delay may not
be acceptable in many applications. Secondly, the false negatives found
by a reasonable regex will be few compared to the number you'd get if
the smtp server went down, or a remote relay was having problems
delivering the message etc etc.
>From a business point of view, it is probably more important to reduce
the number of false negatives than to reduce the number of false
positives - every false negative is a potential loss of a customer.
False positives? Who cares really as long as they are paying ;-)

Sep 22 '06 #11

P: n/a

Ben Finney wrote:
Steve Holden <st***@holdenweb.comwrites:
Ben Finney wrote:
The best advice I've seen when people ask "How do I validate
whether an email address is valid?" was "Try sending mail to it".
>
That only applies if it's a likely-looking email address. If someone
asks me to send mail to "splurge.!#$%*&^from@thingie?><{}_)" I will
probably assume that it isn't worth my time trying.

You, as a human, can possibly make that decision, if you don't care
about turning away someone who *does* have such an email address. How
can an algorithm do so? There are many valid email addresses that look
as bizarre as the example you gave.
The sending system's mail transport agent, not regular
expressions, determines which part is the domain to send the mail
to.
>
The domain name system, not regular expressions, determines what
domains are valid, and what host should receive mail for that
domain.
>
Most especially, the receiving mail system, not regular
expressions, determines what local-parts are valid.
>
Nevertheless, I am *not* going to try delivery to (for example) a
non-local address that doesn't contain an "at@ sign.

Would you try delivery to an email address that contains two or more
"@" symbols? If not, you will be denying delivery to valid RFC2821
addresses.

This is, of course, something you're entitled to do. But you've then
consciously chosen not to use "is the email address valid?" as your
criterion, and the original request for such validation becomes moot.
What proportion of deliverable e-mail addresses have more than one @ in
them?
It may be a good idea, if the supplier of the e-mail address is a human
and is on-line, to run a plausibility check -- does it look like the
vast majority of addresses? Sure,
"fr**@final.com@re********@relay1.net" may be valid and deliverable,
but "cl****@pastetwice.unorgclumsy@pastetwice.unor g" may be valid and
undeliverable. IMHO a quick "Please check and confirm" dialogue would
be warranted.

Cheers,
John

Sep 22 '06 #12

P: n/a

Ant wrote:
John Machin wrote:
...
A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator.

We got bitten by this at the last place I worked - we were using a
regex email validator (from Microsoft IIRC), and we kept having
problems with specific email addresses from Ireland. There are stack of
Irish email addresses out there of the form paddy.o'reilly@domain -
perfectly valid email address, but doesn't satisfy the usual naive
versions of regex validators.

We use an even worse validator at my current job, but the feeling the
management have (not one I agree with) is that unusual email addresses,
whilst perhaps valid, are uncommon enough not to worry about....
Oh, sorry for the abbreviation. "use" implies "source from believedly
reliable s/w source; test; then deploy" :-)

Sep 22 '06 #13

P: n/a
"John Machin" <sj******@lexicon.netwrites:
What proportion of deliverable e-mail addresses have more than one @
in them?
I don't know. Fortunately, I don't need to; I don't "validate" email
addresses by regular expression.

What proportion of deliverable email addresses do you want to discard
as "not valid"?

--
\ "Theology is the effort to explain the unknowable in terms of |
`\ the not worth knowing." -- Henry L. Mencken |
_o__) |
Ben Finney

Sep 22 '06 #14

P: n/a

Ben Finney wrote:
"John Machin" <sj******@lexicon.netwrites:
What proportion of deliverable e-mail addresses have more than one @
in them?

I don't know. Fortunately, I don't need to; I don't "validate" email
addresses by regular expression.

What proportion of deliverable email addresses do you want to discard
as "not valid"?
None. Re-read my post. I was suggesting suggesting an "are you sure" in
the case of weird or infrequent ones. Discarding wasn't mentioned.

Sep 22 '06 #15

P: n/a
Ben Finney wrote:
"John Machin" <sj******@lexicon.netwrites:

>>What proportion of deliverable e-mail addresses have more than one @
in them?


I don't know. Fortunately, I don't need to; I don't "validate" email
addresses by regular expression.

What proportion of deliverable email addresses do you want to discard
as "not valid"?
Just as a matter of interest, are you expecting that you'll find out
about the undeliverable ones? Because in many cases nowadays you wont,
since so many domains are filtering out "undeliverable mail" messages as
an anti-spam defence.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 22 '06 #16

P: n/a
Ant

John Machin wrote:
Ant wrote:
John Machin wrote:
...
A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator.
We got bitten by this at the last place I worked - we were using a
regex email validator (from Microsoft IIRC)
....
Oh, sorry for the abbreviation. "use" implies "source from believedly
reliable s/w source; test; then deploy" :-)
I actually meant that we got bitten by using a regex validator, not by
using an existing one. Though we did get bitten by an existing one, and
it being from Microsoft we should have known better ;-)

Sep 22 '06 #17

P: n/a
Just as a matter of interest, are you expecting that you'll find out
about the undeliverable ones? Because in many cases nowadays you wont,
since so many domains are filtering out "undeliverable mail" messages as
an anti-spam defence.
....and then there is the problem of validating that the valid email
address belongs to the person entering it !! If it doesn't, any
correspondence you send to that email address will itself be spam (in
the greater modern definition of spam).

You could allow your form to accept any email address, then send a
verification in an email to the address given, asking the recipient
to click a link if they did in fact fill in the form. When they click
the link the details from the original form are then verified and can
be activated and processed.

HTH :)
Sep 22 '06 #18

P: n/a
Steve Holden <st***@holdenweb.comwrites:
Ben Finney wrote:
I don't "validate" email addresses by regular expression.
Just as a matter of interest, are you expecting that you'll find out
about the undeliverable ones? Because in many cases nowadays you
wont, since so many domains are filtering out "undeliverable mail"
messages as an anti-spam defence.
I wouldn't expect a program to treat a user-supplied email address as
known-good until receiving a confirmation email with a cookie, or some
out-of-band confirmation (e.g., the email addresses are seeded by some
trusted source).

Until then, it's an untrusted piece of user-supplied data, to be kept
around for a limited time pending confirmation, and then discarded.

--
\ "Man cannot be uplifted; he must be seduced into virtue." -- |
`\ Donald Robert Perry Marquis |
_o__) |
Ben Finney

Sep 23 '06 #19

P: n/a
Yes, I didn't make it clear in my original post - the purpose of the
code was to learn something about regexps (I only started coding Python
last week). In terms of learning "a little more" the example was
successful. However, creating a full email validator is way beyond me -
the rules are far too complex!! :)

Sep 25 '06 #20

P: n/a
I still don't touch regular expressions... They may be fast, but to
me they are just as much line noise as PERL... I can usually code a
partial "parser" faster than try to figure out an RE.
Yes, it seems to me that REs are a bit "hit and miss" - the only way to
tell if you've got a RE "right" is by testing exhaustively - but you
can never be sure.... They are fine for simple pattern matching though.

Sep 26 '06 #21

P: n/a
Dennis Lee Bieber wrote:
On 25 Sep 2006 10:25:01 -0700, "codefire" <to**********@gmail.com>
declaimed the following in comp.lang.python:

>Yes, I didn't make it clear in my original post - the purpose of the
code was to learn something about regexps (I only started coding Python
last week). In terms of learning "a little more" the example was
successful. However, creating a full email validator is way beyond me -
the rules are far too complex!! :)

I've been doing small things in Python for over a decade now
(starting with the Amiga port)...

I still don't touch regular expressions... They may be fast, but to
me they are just as much line noise as PERL... I can usually code a
partial "parser" faster than try to figure out an RE.
If I may add another thought along the same line: regular expressions
seem to tend towards an art form, or an intellectual game. Many
discussions revolving around regular expressions convey the impression
that the challenge being pursued is finding a magic formula much more
than solving a problem. In addition there seems to exist some code of
honor which dictates that the magic formula must consist of one single
expression that does it all. I suspect that the complexity of one single
expression grows somehow exponentially with the number of
functionalities it has to perform and at some point enters a gray zone
of impending conceptual intractability where the quest for the magic
formula becomes particularly fascinating. I also suspect that some
problems are impossible to solve with a single expression and that no
test of intractability exists other than giving up after so many hours
of trying.
With reference to the OP's question, what speaks against passing his
texts through several simple expressions in succession? Speed of
execution? Hardly. The speed penalty would not be perceptible.
Conversely, in favor of multiple expressions speaks that they can be
kept simple and that the performance of the entire set can be
incrementally improved by adding another simple expression whenever an
unexpected contingency occurs, as they may occur at any time with
informal systems. One may not win a coding contest this way, but saving
time isn't bad either, or is even better.

Frederic

Sep 26 '06 #22

P: n/a
Frederic Rentsch wrote:
If I may add another thought along the same line: regular expressions
seem to tend towards an art form, or an intellectual game. Many
discussions revolving around regular expressions convey the impression
that the challenge being pursued is finding a magic formula much more
than solving a problem. In addition there seems to exist some code of
honor which dictates that the magic formula must consist of one single
expression that does it all.
hear! hear!

for dense guys like myself, regular expressions work best if you use
them as simple tokenizers, and they suck pretty badly if you're trying
to use them as parsers.

and using a few RE:s per problem (or none at all) is a perfectly good
way to get things done.

</F>

Sep 26 '06 #23

P: n/a
for dense guys like myself, regular expressions work best if you use
them as simple tokenizers, and they suck pretty badly if you're trying
to use them as parsers.
:) Well, I'm with you on that one Fredrik! :)

Sep 26 '06 #24

This discussion thread is closed

Replies have been disabled for this discussion.