By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,068 Members | 1,734 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,068 IT Pros & Developers. It's quick & easy.

Extract email addresses from big file.

P: 5
Hey.

I have a big text file with data,
and i want to extract mail addresses.

How i can do it?
May 17 '07 #1
Share this Question
Share on Google+
29 Replies


arne
Expert 100+
P: 315
Hey.

I have a big text file with data,
and i want to extract mail addresses.

How i can do it?
I guess there are plenty of ways to do it. Any constraints on the tool/language?
May 17 '07 #2

P: 5
perl / shellscript using awk-sed-cut ??
May 17 '07 #3

arne
Expert 100+
P: 315
perl / shellscript using awk-sed-cut ??
Perl is certainly a reasonable choice, yes. If I had to do it, I would use it.
May 17 '07 #4

Motoma
Expert 2.5K+
P: 3,235
Regular expressions would be a great way to do this. Try looking at the sed tool.
May 17 '07 #5

Expert 100+
P: 511
Expand|Select|Wrap|Line Numbers
  1. awk '
  2. {
  3.   for (i=1;i<=NF;i++) {
  4.        if ( $i ~ /[[:alpha:]]@[[:alpha:]]/ )  { 
  5.       print $i      
  6.        }
  7.   }
  8. }' "file"
  9.  
May 18 '07 #6

P: 5
Thanx for the code dude :)
May 18 '07 #7

prn
Expert 100+
P: 254
prn
It's been quite a while since I did anything with awk, so I wasn't sure how well ghostdog's code would work. It looked like it should handle only alphabetics with no more than one component on each side of the "@". So I made up a test file (test.txt):

Expand|Select|Wrap|Line Numbers
  1. this is a test file foo@bar.com we are looking for moo@drop.dhcp.bar.com email
  2. addresses inside, 00test@leo.bar.com, a text file with no
  3. particular fname.lname@bar.baz.net other par72@take.the.bus.au restrictions
  4. on the format or locations of the 23skidoo@bar.co.uk addresses inside the file.
  5. Let's try one at the end joe27@aol.com.
  6.  
I ran ghostdog's awk script on this and got the output:
Expand|Select|Wrap|Line Numbers
  1. foo@bar.com
  2. moo@drop.dhcp.bar.com
  3. 00test@leo.bar.com,
  4. fname.lname@bar.baz.net
  5. 23skidoo@bar.co.uk
  6.  
Note that this output has FIVE email addresses, but the file has SEVEN so there is something wrong. The two that are omitted have digits just beside the "@" so it looks like I was close but not quite right on how much awk would match with this RE. It catches everything between spaces into $i whenever it matches /[[:alpha:]]@[[:alpha:]]/

But note that it also caught the comma following the third address "00test@leo.bar.com," which it should not include in the email address.

Here's a Perl one-liner:
Expand|Select|Wrap|Line Numbers
  1. perl -wne'while(/[\w\.]+@[\w\.]+/g){print "$&\n"}' test.txt
This gives the output
Expand|Select|Wrap|Line Numbers
  1. foo@bar.com
  2. moo@drop.dhcp.bar.com
  3. 00test@leo.bar.com
  4. fname.lname@bar.baz.net
  5. par72@take.the.bus.au
  6. 23skidoo@bar.co.uk
  7. joe27@aol.com.
which is almost correct (and does not include the comma following number 3, although it does include the period at the end).

Here's a corrected version:
Expand|Select|Wrap|Line Numbers
  1. perl -wne'while(/[\w\.]+@[\w\.]+\w+/g){print "$&\n"}' test.txt
This yields
Expand|Select|Wrap|Line Numbers
  1. foo@bar.com
  2. moo@drop.dhcp.bar.com
  3. 00test@leo.bar.com
  4. fname.lname@bar.baz.net
  5. par72@take.the.bus.au
  6. 23skidoo@bar.co.uk
  7. joe27@aol.com
I'm sure ghostdog74's awk script could also easily be fixed, but as I said, it's been a long time and I'm not sure how much I want to play with it. ;)

HTH,
Paul
May 21 '07 #8

P: 1
Hi.
Thanks for this. I was using it for a while and thought it was wonderful. However it misses the legitimate hyphen character within emails. Here's an updated version.

Expand|Select|Wrap|Line Numbers
  1. perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
I also piped it through sort to get a sorted, unique list of emails.
Jul 27 '07 #9

Motoma
Expert 2.5K+
P: 3,235
Hi.
Thanks for this. I was using it for a while and thought it was wonderful. However it misses the legitimate hyphen character within emails. Here's an updated version.

Expand|Select|Wrap|Line Numbers
  1. perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
I also piped it through sort to get a sorted, unique list of emails.
Great catch peripatetic! Thanks for the addition, and welcome to The Scripts!
Jul 27 '07 #10

P: 3
guys can this perl script be used on websites ? and i replace the file with a web adress ? or how can i do this to get the emails included in a website ?


and let's say i have www.domain.com/aa.php=1 have some emails saved inside
and www.domain.com/aa.php=2 have also some mails .. how can i make a loop to get all the aa.php=variable and get the mails in all the files ?
thanks in advance and sorry for my english
Feb 9 '08 #11

P: 1
I have a big file with many email addresses, how do i extract only the email address, if posible please include the software i can use
Mar 19 '08 #12

P: 1
How would I use a script like this on a group of files that are in a directory to retrieve email addresses from all of them?
May 21 '08 #13

gpraghuram
Expert 100+
P: 1,275
How would I use a script like this on a group of files that are in a directory to retrieve email addresses from all of them?

Try to combine the find command with xargs and the perl script given here like this.

find . -name "*.txt" | xargs perl <script given here>


Raghu
May 22 '08 #14

RADEP
P: 1
I tried the above example but it didn't work for me.

I got the following error:

C:\Documents and Settings\user\Desktop\abc\trunk\docs>perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}'db_em
ails.txt | sort -u > output.txt
Can't find string terminator "'" anywhere before EOF at -e line 1.
-uThe system cannot find the file specified.
Dec 28 '09 #15

prn
Expert 100+
P: 254
prn
Hi RADEP,

You are apparently trying to do this in a MS Windows environment rather than a *nix environment. The error you are seeing comes from the Windows command-line parser. I am going to assume (which may get me into trouble) that you entered the offending command on a single line. If so, then the only thing that leaps out at me is the lack of space between the final ' and the apparent input file "db_em". Did you copy and paste your example directly from your command window? If so, the first thing I'd try would be to make sure that you do have a space there.

If that does not help, then we have to look a little further. The prior discussion here has been under a Linux/Unix assumption and the *nix shells do parse command lines differently from Windows. The Perl itself should be OK, especially with peripatetic's modification. However, you may have to run it differently. If the windows command-line parser can't handle this as a one-liner, you can always just put the Perl into a file (e.g., "ExtractEmail.pl") and then you should be able to run it that way as:

C:\...>perl ExtractEmail.pl <db_em

or the like.

Let us know if this meets your needs.

Paul
Dec 31 '09 #16

P: 1
Hey Paul,

I'm in the same boat as RADEP. I'm very much able to use the script on my linux machine, but unable to once I try it on my windows vm.

I tried copying the code verbatim into a .pl file and running it from command line per your suggestion with a similar output to RADEP's experience.

As for the "'" terminator, I have no clue, but I am going to guess that windows will not support the 'sort -u' command near the end. What are your thoughts?

And by the way, thanks for everyone's help in this. It's forums like these that help me get through the work day. :)

Cheers,
Scott
Dec 31 '09 #17

prn
Expert 100+
P: 254
prn
Hi Scott,

I'm afraid I was too lazy the other day.

If you're going to create a file to do the same job, you have to do the read from STDIN explicitly. So the file ExtractEmail.pl could look like this:

Expand|Select|Wrap|Line Numbers
  1. while (<STDIN>) {
  2.     while (/[\w\.\-]+@[\w\.\-]+\w+/g)
  3.         {print "$&\n"}
  4. }
Then you can invoke it like this:

Expand|Select|Wrap|Line Numbers
  1. C:\...>perl ExtractEmail.pl <test.txt >out.txt
Of course, the sorting as in a *nix environment is not available in the native windows environment. There are several ways to get the capability. You could install cygwin, which is a port of the bash shell with utilities including sort. (Then you should be able to use the original one-liner.) If you search for "windows unix sort" you should find some advice (which I have not tested) on other ports of the sort utility.

HTH,
Paul
Jan 4 '10 #18

P: 1
I keep getting this error:

Expand|Select|Wrap|Line Numbers
  1. syntax error at email.pl line 1, near "){"
  2. Can't find string terminator "'" anywhere before EOF at email.pl line 1
and I'm using this code:

Expand|Select|Wrap|Line Numbers
  1. perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
Jun 1 '10 #19

P: 2
How about this ...

if for some reason the file lost some spaces or got extra letters and there's this case ....

onani12@yahoo.coms@ <--- or what about this ... lacama@yaho.co

in those cases i want to create ...
onani12@yahoo.com and also the one it catch onani12@yahoo.coms <-- notice doesn't have the @ at the end
and fix lacama@yaho.com ( which is certainly a public email provider ) but just in case we want to keep that one we found
lacama@yaho.com
and add
lacama@yahoo.com

I had many years ago a code that just to do that I'm going to try to find it but if you have a regular expression or short code that can fix that it will be wonderful !

Regards
Angelo
Sep 9 '10 #20

P: 2
Also what about emails like ...

sam.s.schuchts.34@packardbell.net.jp ?

this is very interesting i love it !
Sep 9 '10 #21

P: 1
Hello,

This one-liner is great! I would love to exclude all email addresses ending in one specific domain from the output (e.g. don't print any address ending with thisdomain.com), is there an easy way to do that? Thanks!
Sep 19 '10 #22

P: n/a
Just pipe the output to e/f/grep with the "invert" option to exclude specific domains e.g. grep -v @DOMAIN
Oct 16 '10 #23

P: 1
Hi,

Great code. Unfortunately,

perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt

catches:

joeuser@earthlink.net...User

Can someone correct this???

Many thanks.
Jan 28 '11 #24

P: 1
Hi,
I know its been a while since last post, but I couldn't just leave it with out an answer.
I did same small upgrades making the code better match emails.
I believe this is what you all were looking for:

Expand|Select|Wrap|Line Numbers
  1. perl -ne'if(/[\w\.\-\_]+@([\w\-\_]+\.)+[A-Za-z]{2,4}/g){print "$&\n"}' *.txt
  2.  
Have a nice mail extraction!
Aug 3 '11 #25

P: 1
$ egrep -o [A-Za-z0-9_.]*'[-]'*[A-Za-z0-9_.]*@[A-Za-z0-9_.]*[.][A-Za-z]* < filename

This should work in any Unix version. This should take care of all possibilities.
Feb 8 '12 #26

prn
Expert 100+
P: 254
prn
I haven't really revisited this topic for a while now, but here are a few comments.

TIMTOWTDI, prashantva. :) That is, "There Is More Than One Way To Do It." Your way appears to be almost equivalent to what we had before. The only difference I note will show up in the case of more than one hyphen in a name. Check out the result from the "j-q-public" entry in the output of the perl program below versus the egrep re. Some experimentation with the re should fix that.

I'm not quite sure what Exenfris wants, but it looks like a database of valid domains. I don't think that's practical in a short program, certianly not in a one-liner. I suppose you could add a piece that checks whois for each domain. I'll happily leave that as an exercise for someone else. :) I can imagine adding a routine to check for valid domains to the perl script, but personally, I'm not prepared to try to correct those that turn out to be invalid.

Halle asks for a way to exclude specific domains. Obi Wan suggests just piping the output through grep using "-v". That would work, but how convenient it would be depends a lot on how many domains you want to exclude and how often you want to run the search. Personally, when I start needing more features, I like to document them by including them into a single file, so I've added this feature into the perl script below. (Obviously, a shell script could work equally well.) For convenience the exclude pattern is prominently defined using the regex quote operator.

Carl win points out a problem when an address is followed/terminated by multiple dots. I've included a fix for this below too. I haven't seen a way to solve this one in a single re, but someone clever may be able to do that.

File: ExtractEmail.pl
Expand|Select|Wrap|Line Numbers
  1. #! /usr/bin/perl
  2. use strict;
  3.  
  4. #exclude the following domains:
  5. my $exclude = qr/thisdomain\.com|otherdomain\.com/o;
  6.  
  7. # We need a variable here because we cannot assign to $1
  8. my $address;
  9.  
  10. while (<STDIN>) {
  11.   while ( /([\w\.\-]+@[\w+\.\-]+\.[A-Za-z]{2,4})\W/g ) {
  12.     $address = $1;              # use a variable so we can modify if needed
  13.     $address =~ s/\.\..*// ;    # terminate the address at multiple "."s
  14.     print "$address\n" unless $address =~ $exclude;  
  15.   }
  16. }
  17. exit;
  18.  
And here's a test file that should cover all the cases that have been discussed above.
File: emailtest.txt
Expand|Select|Wrap|Line Numbers
  1. this is a test file foo@bar.com we are looking for moo@drop.dhcp.bar.com email
  2. addresses inside, 00test@leo.bar.com, a text file with no
  3. particular fname.lname@bar.baz.net other par72@take.the.bus.au restrictions
  4. on the format or locations of the 23skidoo@bar.co.uk addresses inside the
  5. file. And, let's not forget addresses with hyphens like j-q-public@foo.bar or
  6. underscores like j_q_public@foo.bar
  7. Also what about emails like sam.s.schuchts.34@packardbell.net.jp?
  8. Let's also say we want to exclude addresses like halle@thisdomain.com from a
  9. specific domain.
  10. Unfortunately, the simpler regex catches: joeuser@earthlink.net...User so what do we do?
  11. But it should not have a problem with: jimuser@earthlink.net ... User. right?
  12. Trailing stuff with hyphens or something else after the end of the actual address 
  13. should be ruled out automatically as in nikt@frombork.pl--bad, right?
  14. Let's also exclude stuff from otherdomain as in halle@otherdomain.com, OK?
  15. This version also rules out TLDs that are too long as in foo@bar.topleveldomain.
  16. Let's try one at the end joe27@aol.com.
  17. And one at the very end without a final newline paul@mydomain.com
To test this script, you can use
Expand|Select|Wrap|Line Numbers
  1. ExtractEmail.pl <emailtest.txt
I have tested the script with this data both in a command prompt window on MSWindows and in Linux.

To test prashantva's egrep version for comparison, you can use
Expand|Select|Wrap|Line Numbers
  1. egrep -o [A-Za-z0-9_.]*'[-]'*[A-Za-z0-9_.]*@[A-Za-z0-9_.]*[.][A-Za-z]*  <emailtest.txt
Paul
Feb 8 '12 #27

P: 1
for i in `cat filename`
do
echo $i >> filename1
done
grep "@" filename

The above code will extract all the email Ids
Feb 15 '12 #28

prn
Expert 100+
P: 254
prn
Let's adjust sun123's script to:
Expand|Select|Wrap|Line Numbers
  1. for i in `cat emailtest.txt`
  2.   do
  3.     echo $i >> filename1
  4.   done
  5. grep "@" filename1
  6.  
This now uses the name of that we have been using for the original input and the file to be grepped in the last line has been corrected from "filename" to "filename1", i.e., the output from the cat instead of the original input.

The "for" loop simply writes "words" from the original file to separate lines in an intermediate file.

sun123's corrected script can alternatively be simplified by eliminating the unnecessary temporary file:
Expand|Select|Wrap|Line Numbers
  1. for i in `cat emailtest.txt`
  2.  do
  3.    echo $i | grep "@"
  4.  done
  5.  
The output from either version is:
Expand|Select|Wrap|Line Numbers
  1. foo@bar.com
  2. moo@drop.dhcp.bar.com
  3. 00test@leo.bar.com,
  4. fname.lname@bar.baz.net
  5. par72@take.the.bus.au
  6. 23skidoo@bar.co.uk
  7. j-q-public@foo.bar
  8. j_q_public@foo.bar
  9. sam.s.schuchts.34@packardbell.net.jp?
  10. halle@thisdomain.com
  11. joeuser@earthlink.net...User
  12. jimuser@earthlink.net
  13. nikto@frombork.pl--bad,
  14. halle@otherdomain.com,
  15. foo@bar.topleveldomain.
  16. joe27@aol.com.
  17. paul@mydomain.com
  18.  
Note that the result incorrectly includes trailing punctuation, and not just dots, and fails to exclude foo@bar.topleveldomain (where "topleveldomain" is not a valid top level domain, being too long). Obviously further filters could be added, but this one is not there yet.

Best Regards,
Paul
Feb 15 '12 #29

P: 1
Excellent script
I found one error if an email address has two consecutive dots in it. such as:

Paul.A..Foothead@mine.com - will display as Paul.A

I'm using this with sorting and uniq like this.

Expand|Select|Wrap|Line Numbers
  1. perl extractemail.pl < myfile | sort | uniq > emaillist.txt
performing a quick bash SED like this
Expand|Select|Wrap|Line Numbers
  1. sed -i -e 's/\.\./\./g' myfile
followed by the
Expand|Select|Wrap|Line Numbers
  1. perl extractemail.pl < myfile | sort | uniq > emaillist.txt
works great
Apr 18 '12 #30

Post your reply

Sign in to post your reply or Sign up for a free account.