Extract email addresses from big file.

superc0red

Hey.

I have a big text file with data,
and i want to extract mail addresses.

How i can do it?

May 17 '07 #1

Subscribe Post Reply

51564

arne

315

Expert 100+

Hey.

I have a big text file with data,
and i want to extract mail addresses.

How i can do it?

I guess there are plenty of ways to do it. Any constraints on the tool/language?

May 17 '07 #2

superc0red

perl / shellscript using awk-sed-cut ??

May 17 '07 #3

arne

315

Expert 100+

perl / shellscript using awk-sed-cut ??

Perl is certainly a reasonable choice, yes. If I had to do it, I would use it.

May 17 '07 #4

Motoma

3,237

Expert 2GB

Regular expressions would be a great way to do this. Try looking at the sed tool.

May 17 '07 #5

ghostdog74

511

Expert 256MB

Expand|Select|Wrap|Line Numbers

 
awk '

{

  for (i=1;i<=NF;i++) {

       if ( $i ~ /[[:alpha:]]@[[:alpha:]]/ )  { 

      print $i      

       }

  }

}' "file"

May 18 '07 #6

superc0red

Thanx for the code dude :)

May 18 '07 #7

prn

254

Expert 100+

It's been quite a while since I did anything with awk, so I wasn't sure how well ghostdog's code would work. It looked like it should handle only alphabetics with no more than one component on each side of the "@". So I made up a test file (test.txt):

Expand|Select|Wrap|Line Numbers

 this is a test file foo@bar.com we are looking for moo@drop.dhcp.bar.com email

addresses inside, 00test@leo.bar.com, a text file with no

particular fname.lname@bar.baz.net other par72@take.the.bus.au restrictions

on the format or locations of the 23skidoo@bar.co.uk addresses inside the file.

Let's try one at the end joe27@aol.com.

I ran ghostdog's awk script on this and got the output:

Expand|Select|Wrap|Line Numbers

 foo@bar.com

moo@drop.dhcp.bar.com

00test@leo.bar.com,

fname.lname@bar.baz.net

23skidoo@bar.co.uk

Note that this output has FIVE email addresses, but the file has SEVEN so there is something wrong. The two that are omitted have digits just beside the "@" so it looks like I was close but not quite right on how much awk would match with this RE. It catches everything between spaces into $i whenever it matches /[[:alpha:]]@[[:alpha:]]/

But note that it also caught the comma following the third address "00test@leo.bar.com," which it should not include in the email address.

Here's a Perl one-liner:

Expand|Select|Wrap|Line Numbers

perl -wne'while(/[\w\.]+@[\w\.]+/g){print "$&\n"}' test.txt

This gives the output

Expand|Select|Wrap|Line Numbers

 foo@bar.com

moo@drop.dhcp.bar.com

00test@leo.bar.com

fname.lname@bar.baz.net

par72@take.the.bus.au

23skidoo@bar.co.uk

joe27@aol.com.

which is almost correct (and does not include the comma following number 3, although it does include the period at the end).

Here's a corrected version:

Expand|Select|Wrap|Line Numbers

perl -wne'while(/[\w\.]+@[\w\.]+\w+/g){print "$&\n"}' test.txt

This yields

Expand|Select|Wrap|Line Numbers

 foo@bar.com

moo@drop.dhcp.bar.com

00test@leo.bar.com

fname.lname@bar.baz.net

par72@take.the.bus.au

23skidoo@bar.co.uk

joe27@aol.com

I'm sure ghostdog74's awk script could also easily be fixed, but as I said, it's been a long time and I'm not sure how much I want to play with it. ;)

HTH,
Paul

May 21 '07 #8

peripatetic

Hi.
Thanks for this. I was using it for a while and thought it was wonderful. However it misses the legitimate hyphen character within emails. Here's an updated version.

Expand|Select|Wrap|Line Numbers

 perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
 

I also piped it through sort to get a sorted, unique list of emails.

Jul 27 '07 #9

Motoma

3,237

Expert 2GB

Hi.
Thanks for this. I was using it for a while and thought it was wonderful. However it misses the legitimate hyphen character within emails. Here's an updated version.

Expand|Select|Wrap|Line Numbers

perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt

I also piped it through sort to get a sorted, unique list of emails.

Great catch peripatetic! Thanks for the addition, and welcome to The Scripts!

Jul 27 '07 #10

HostQ8i

guys can this perl script be used on websites ? and i replace the file with a web adress ? or how can i do this to get the emails included in a website ?

and let's say i have www.domain.com/aa.php=1 have some emails saved inside
and www.domain.com/aa.php=2 have also some mails .. how can i make a loop to get all the aa.php=variable and get the mails in all the files ?
thanks in advance and sorry for my english

Feb 9 '08 #11

David Akpan

I have a big file with many email addresses, how do i extract only the email address, if posible please include the software i can use

Mar 19 '08 #12

Freakin

How would I use a script like this on a group of files that are in a directory to retrieve email addresses from all of them?

May 21 '08 #13

gpraghuram

1,275

Expert 1GB

How would I use a script like this on a group of files that are in a directory to retrieve email addresses from all of them?

Try to combine the find command with xargs and the perl script given here like this.

find . -name "*.txt" | xargs perl <script given here>

Raghu

May 22 '08 #14

RADEP

I tried the above example but it didn't work for me.

I got the following error:

C:\Documents and Settings\user\Desktop\abc\trunk\docs>perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}'db_em
ails.txt | sort -u > output.txt
Can't find string terminator "'" anywhere before EOF at -e line 1.
-uThe system cannot find the file specified.

Dec 28 '09 #15

prn

254

Expert 100+

Hi RADEP,

You are apparently trying to do this in a MS Windows environment rather than a *nix environment. The error you are seeing comes from the Windows command-line parser. I am going to assume (which may get me into trouble) that you entered the offending command on a single line. If so, then the only thing that leaps out at me is the lack of space between the final ' and the apparent input file "db_em". Did you copy and paste your example directly from your command window? If so, the first thing I'd try would be to make sure that you do have a space there.

If that does not help, then we have to look a little further. The prior discussion here has been under a Linux/Unix assumption and the *nix shells do parse command lines differently from Windows. The Perl itself should be OK, especially with peripatetic's modification. However, you may have to run it differently. If the windows command-line parser can't handle this as a one-liner, you can always just put the Perl into a file (e.g., "ExtractEmail.pl") and then you should be able to run it that way as:

C:\...>perl ExtractEmail.pl <db_em

or the like.

Let us know if this meets your needs.

Paul

Dec 31 '09 #16

smcgimpsey

Hey Paul,

I'm in the same boat as RADEP. I'm very much able to use the script on my linux machine, but unable to once I try it on my windows vm.

I tried copying the code verbatim into a .pl file and running it from command line per your suggestion with a similar output to RADEP's experience.

As for the "'" terminator, I have no clue, but I am going to guess that windows will not support the 'sort -u' command near the end. What are your thoughts?

And by the way, thanks for everyone's help in this. It's forums like these that help me get through the work day. :)

Cheers,
Scott

Dec 31 '09 #17

prn

254

Expert 100+

Hi Scott,

I'm afraid I was too lazy the other day.

If you're going to create a file to do the same job, you have to do the read from STDIN explicitly. So the file ExtractEmail.pl could look like this:

Expand|Select|Wrap|Line Numbers

 
while (<STDIN>) {

    while (/[\w\.\-]+@[\w\.\-]+\w+/g)

        {print "$&\n"}

}

Then you can invoke it like this:

Expand|Select|Wrap|Line Numbers

C:\...>perl ExtractEmail.pl <test.txt >out.txt

Of course, the sorting as in a *nix environment is not available in the native windows environment. There are several ways to get the capability. You could install cygwin, which is a port of the bash shell with utilities including sort. (Then you should be able to use the original one-liner.) If you search for "windows unix sort" you should find some advice (which I have not tested) on other ports of the sort utility.

HTH,
Paul

Jan 4 '10 #18

mbstrlbstr

I keep getting this error:

Expand|Select|Wrap|Line Numbers

 syntax error at email.pl line 1, near "){"

Can't find string terminator "'" anywhere before EOF at email.pl line 1

and I'm using this code:

Expand|Select|Wrap|Line Numbers

 perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
 

Jun 1 '10 #19

Exenfris

How about this ...

if for some reason the file lost some spaces or got extra letters and there's this case ....

onani12@yahoo.coms@ <--- or what about this ... lacama@yaho.co

in those cases i want to create ...
onani12@yahoo.com and also the one it catch onani12@yahoo.coms <-- notice doesn't have the @ at the end
and fix lacama@yaho.com ( which is certainly a public email provider ) but just in case we want to keep that one we found
lacama@yaho.com
and add
lacama@yahoo.com

I had many years ago a code that just to do that I'm going to try to find it but if you have a regular expression or short code that can fix that it will be wonderful !

Regards
Angelo

Sep 9 '10 #20

Exenfris

Also what about emails like ...

sam.s.schuchts.34@packardbell.net.jp ?

this is very interesting i love it !

Sep 9 '10 #21

Halle

Hello,

This one-liner is great! I would love to exclude all email addresses ending in one specific domain from the output (e.g. don't print any address ending with thisdomain.com), is there an easy way to do that? Thanks!

Sep 19 '10 #22

Obi Wan

Just pipe the output to e/f/grep with the "invert" option to exclude specific domains e.g. grep -v @DOMAIN

Oct 16 '10 #23

carl win

Hi,

Great code. Unfortunately,

perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt

catches:

joeuser@earthlink.net...User

Can someone correct this???

Many thanks.

Jan 28 '11 #24

MaGiK

Hi,
I know its been a while since last post, but I couldn't just leave it with out an answer.
I did same small upgrades making the code better match emails.
I believe this is what you all were looking for:

Expand|Select|Wrap|Line Numbers

perl -ne'if(/[\w\.\-\_]+@([\w\-\_]+\.)+[A-Za-z]{2,4}/g){print "$&\n"}' *.txt

Have a nice mail extraction!

Aug 3 '11 #25

prashantva

$ egrep -o [A-Za-z0-9_.]*'[-]'*[A-Za-z0-9_.]*@[A-Za-z0-9_.]*[.][A-Za-z]* < filename

This should work in any Unix version. This should take care of all possibilities.

Feb 8 '12 #26

prn

254

Expert 100+

I haven't really revisited this topic for a while now, but here are a few comments.

TIMTOWTDI, prashantva. :) That is, "There Is More Than One Way To Do It." Your way appears to be almost equivalent to what we had before. The only difference I note will show up in the case of more than one hyphen in a name. Check out the result from the "j-q-public" entry in the output of the perl program below versus the egrep re. Some experimentation with the re should fix that.

I'm not quite sure what Exenfris wants, but it looks like a database of valid domains. I don't think that's practical in a short program, certianly not in a one-liner. I suppose you could add a piece that checks whois for each domain. I'll happily leave that as an exercise for someone else. :) I can imagine adding a routine to check for valid domains to the perl script, but personally, I'm not prepared to try to correct those that turn out to be invalid.

Halle asks for a way to exclude specific domains. Obi Wan suggests just piping the output through grep using "-v". That would work, but how convenient it would be depends a lot on how many domains you want to exclude and how often you want to run the search. Personally, when I start needing more features, I like to document them by including them into a single file, so I've added this feature into the perl script below. (Obviously, a shell script could work equally well.) For convenience the exclude pattern is prominently defined using the regex quote operator.

Carl win points out a problem when an address is followed/terminated by multiple dots. I've included a fix for this below too. I haven't seen a way to solve this one in a single re, but someone clever may be able to do that.

File: ExtractEmail.pl

Expand|Select|Wrap|Line Numbers

 
#! /usr/bin/perl

use strict;
 
#exclude the following domains:

my $exclude = qr/thisdomain\.com|otherdomain\.com/o;
 
# We need a variable here because we cannot assign to $1

my $address;
 
while (<STDIN>) {

  while ( /([\w\.\-]+@[\w+\.\-]+\.[A-Za-z]{2,4})\W/g ) {

    $address = $1;              # use a variable so we can modify if needed

    $address =~ s/\.\..*// ;    # terminate the address at multiple "."s

    print "$address\n" unless $address =~ $exclude;  

  }

}

exit;

And here's a test file that should cover all the cases that have been discussed above.
File: emailtest.txt

Expand|Select|Wrap|Line Numbers

 
this is a test file foo@bar.com we are looking for moo@drop.dhcp.bar.com email

addresses inside, 00test@leo.bar.com, a text file with no

particular fname.lname@bar.baz.net other par72@take.the.bus.au restrictions

on the format or locations of the 23skidoo@bar.co.uk addresses inside the

file. And, let's not forget addresses with hyphens like j-q-public@foo.bar or

underscores like j_q_public@foo.bar

Also what about emails like sam.s.schuchts.34@packardbell.net.jp?

Let's also say we want to exclude addresses like halle@thisdomain.com from a

specific domain.

Unfortunately, the simpler regex catches: joeuser@earthlink.net...User so what do we do?

But it should not have a problem with: jimuser@earthlink.net ... User. right?

Trailing stuff with hyphens or something else after the end of the actual address 

should be ruled out automatically as in nikt@frombork.pl--bad, right?

Let's also exclude stuff from otherdomain as in halle@otherdomain.com, OK?

This version also rules out TLDs that are too long as in foo@bar.topleveldomain.

Let's try one at the end joe27@aol.com.

And one at the very end without a final newline paul@mydomain.com

To test this script, you can use

Expand|Select|Wrap|Line Numbers

ExtractEmail.pl <emailtest.txt

I have tested the script with this data both in a command prompt window on MSWindows and in Linux.

To test prashantva's egrep version for comparison, you can use

Expand|Select|Wrap|Line Numbers

 egrep -o [A-Za-z0-9_.]*'[-]'*[A-Za-z0-9_.]*@[A-Za-z0-9_.]*[.][A-Za-z]*  <emailtest.txt
 

Paul

Feb 8 '12 #27

sun123

for i in `cat filename`
do
echo $i >> filename1
done
grep "@" filename

The above code will extract all the email Ids

Feb 15 '12 #28

prn

254

Expert 100+

Let's adjust sun123's script to:

Expand|Select|Wrap|Line Numbers

 
for i in `cat emailtest.txt`

  do

    echo $i >> filename1

  done

grep "@" filename1

This now uses the name of that we have been using for the original input and the file to be grepped in the last line has been corrected from "filename" to "filename1", i.e., the output from the cat instead of the original input.

The "for" loop simply writes "words" from the original file to separate lines in an intermediate file.

sun123's corrected script can alternatively be simplified by eliminating the unnecessary temporary file:

Expand|Select|Wrap|Line Numbers

 
for i in `cat emailtest.txt`

 do

   echo $i | grep "@"

 done

The output from either version is:

Expand|Select|Wrap|Line Numbers

 
foo@bar.com

moo@drop.dhcp.bar.com

00test@leo.bar.com,

fname.lname@bar.baz.net

par72@take.the.bus.au

23skidoo@bar.co.uk

j-q-public@foo.bar

j_q_public@foo.bar

sam.s.schuchts.34@packardbell.net.jp?

halle@thisdomain.com

joeuser@earthlink.net...User

jimuser@earthlink.net

nikto@frombork.pl--bad,

halle@otherdomain.com,

foo@bar.topleveldomain.

joe27@aol.com.

paul@mydomain.com

Note that the result incorrectly includes trailing punctuation, and not just dots, and fails to exclude foo@bar.topleveldomain (where "topleveldomain" is not a valid top level domain, being too long). Obviously further filters could be added, but this one is not there yet.

Best Regards,
Paul

Feb 15 '12 #29

mdnuts

Excellent script
I found one error if an email address has two consecutive dots in it. such as:

Paul.A..Foothead@mine.com - will display as Paul.A

I'm using this with sorting and uniq like this.

Expand|Select|Wrap|Line Numbers

perl extractemail.pl < myfile | sort | uniq > emaillist.txt

performing a quick bash SED like this

Expand|Select|Wrap|Line Numbers

sed -i -e 's/\.\./\./g' myfile

followed by the

Expand|Select|Wrap|Line Numbers

perl extractemail.pl < myfile | sort | uniq > emaillist.txt

works great

Apr 18 '12 #30

Similar topics

filter valid email addresses

by: Hoang | last post by:

anyone know of an algorithm to filter out real email addresses as opposed to computer generated email addresses? I have been going through past email archives in order to find friends email...

Python

Read a html file, extract email addresses?

by: Duke of Hazard | last post by:

I have searched without success for a simple script that can read any text html file and extract the email addresses from it. I am not interested in spamming people. I play a sport that requires me...

Perl

Know of an Access procedure to open eMails in a named OutLook Express folder, search 'n extract eMail addr in body text?

by: MLH | last post by:

I routinely save failure notices from mail servers bouncing mail back to me that I sent with invalid address. I would like to write an access procedure in my contacts database that would open the...

Microsoft Access / VBA

Extract Email in VB.NET

by: Mam | last post by:

Hi I had developed one site,that site hides all the email addresses.Now i want to develope an application whose extract mail addresses from that site,Is there any solution to this.If u know how...

Visual Basic .NET

get all the email addresses from a text file

by: Nico | last post by:

Hi, I have a .txt file with a lot of text mixed with some email addresses. I would like to get all the email addresses in a $mails variable. Does anyone know how to do this in php. Thanks a...

PHP

Need to extract email addresses from Global Address List Distribution LIst

by: tthomas | last post by:

Greetings, I am using CDO.Message to send email messages from my application. I now need to send email to existing distribution lists in our Global Address List. However, our exchange server...

Microsoft Access / VBA

How to extract email address from the letter in Outlook Express?

by: Alexander Vasilevsky | last post by:

How to extract email address from the letter in Outlook Express? http://www.alvas.net - Audio tools for C# and VB.Net developers

C# / C Sharp

extract all hotmail email addresses in a file and store in separatefile

by: Dennis | last post by:

Hi, I have a text file that contents a list of email addresses like this: "foo@yahoo.com" "tom@hotmail.com" "jerry@gmail.com" "tommy@apple.com" I like to

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server