Hey.
I have a big text file with data,
and i want to extract mail addresses.
How i can do it?
29 51564 arne 315
Expert 100+
Hey.
I have a big text file with data,
and i want to extract mail addresses.
How i can do it?
I guess there are plenty of ways to do it. Any constraints on the tool/language?
perl / shellscript using awk-sed-cut ??
arne 315
Expert 100+
perl / shellscript using awk-sed-cut ??
Perl is certainly a reasonable choice, yes. If I had to do it, I would use it.
Regular expressions would be a great way to do this. Try looking at the sed tool.
-
awk '
-
{
-
for (i=1;i<=NF;i++) {
-
if ( $i ~ /[[:alpha:]]@[[:alpha:]]/ ) {
-
print $i
-
}
-
}
-
}' "file"
-
Thanx for the code dude :)
prn 254
Expert 100+
It's been quite a while since I did anything with awk, so I wasn't sure how well ghostdog's code would work. It looked like it should handle only alphabetics with no more than one component on each side of the "@". So I made up a test file (test.txt): - this is a test file foo@bar.com we are looking for moo@drop.dhcp.bar.com email
-
addresses inside, 00test@leo.bar.com, a text file with no
-
particular fname.lname@bar.baz.net other par72@take.the.bus.au restrictions
-
on the format or locations of the 23skidoo@bar.co.uk addresses inside the file.
-
Let's try one at the end joe27@aol.com.
-
I ran ghostdog's awk script on this and got the output: - foo@bar.com
-
moo@drop.dhcp.bar.com
-
00test@leo.bar.com,
-
fname.lname@bar.baz.net
-
23skidoo@bar.co.uk
-
Note that this output has FIVE email addresses, but the file has SEVEN so there is something wrong. The two that are omitted have digits just beside the "@" so it looks like I was close but not quite right on how much awk would match with this RE. It catches everything between spaces into $i whenever it matches /[[:alpha:]]@[[:alpha:]]/
But note that it also caught the comma following the third address "00test@leo.bar.com," which it should not include in the email address.
Here's a Perl one-liner: - perl -wne'while(/[\w\.]+@[\w\.]+/g){print "$&\n"}' test.txt
This gives the output - foo@bar.com
-
moo@drop.dhcp.bar.com
-
00test@leo.bar.com
-
fname.lname@bar.baz.net
-
par72@take.the.bus.au
-
23skidoo@bar.co.uk
-
joe27@aol.com.
which is almost correct (and does not include the comma following number 3, although it does include the period at the end).
Here's a corrected version: - perl -wne'while(/[\w\.]+@[\w\.]+\w+/g){print "$&\n"}' test.txt
This yields - foo@bar.com
-
moo@drop.dhcp.bar.com
-
00test@leo.bar.com
-
fname.lname@bar.baz.net
-
par72@take.the.bus.au
-
23skidoo@bar.co.uk
-
joe27@aol.com
I'm sure ghostdog74's awk script could also easily be fixed, but as I said, it's been a long time and I'm not sure how much I want to play with it. ;)
HTH,
Paul
Hi.
Thanks for this. I was using it for a while and thought it was wonderful. However it misses the legitimate hyphen character within emails. Here's an updated version. - perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
I also piped it through sort to get a sorted, unique list of emails.
Hi.
Thanks for this. I was using it for a while and thought it was wonderful. However it misses the legitimate hyphen character within emails. Here's an updated version. - perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
I also piped it through sort to get a sorted, unique list of emails.
Great catch peripatetic! Thanks for the addition, and welcome to The Scripts!
guys can this perl script be used on websites ? and i replace the file with a web adress ? or how can i do this to get the emails included in a website ?
and let's say i have www.domain.com/aa.php=1 have some emails saved inside
and www.domain.com/aa.php=2 have also some mails .. how can i make a loop to get all the aa.php=variable and get the mails in all the files ?
thanks in advance and sorry for my english
I have a big file with many email addresses, how do i extract only the email address, if posible please include the software i can use
How would I use a script like this on a group of files that are in a directory to retrieve email addresses from all of them?
How would I use a script like this on a group of files that are in a directory to retrieve email addresses from all of them?
Try to combine the find command with xargs and the perl script given here like this.
find . -name "*.txt" | xargs perl <script given here>
Raghu
I tried the above example but it didn't work for me.
I got the following error:
C:\Documents and Settings\user\Desktop\abc\trunk\docs>perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}'db_em
ails.txt | sort -u > output.txt
Can't find string terminator "'" anywhere before EOF at -e line 1.
-uThe system cannot find the file specified.
prn 254
Expert 100+
Hi RADEP,
You are apparently trying to do this in a MS Windows environment rather than a *nix environment. The error you are seeing comes from the Windows command-line parser. I am going to assume (which may get me into trouble) that you entered the offending command on a single line. If so, then the only thing that leaps out at me is the lack of space between the final ' and the apparent input file "db_em". Did you copy and paste your example directly from your command window? If so, the first thing I'd try would be to make sure that you do have a space there.
If that does not help, then we have to look a little further. The prior discussion here has been under a Linux/Unix assumption and the *nix shells do parse command lines differently from Windows. The Perl itself should be OK, especially with peripatetic's modification. However, you may have to run it differently. If the windows command-line parser can't handle this as a one-liner, you can always just put the Perl into a file (e.g., "ExtractEmail.pl") and then you should be able to run it that way as:
C:\...>perl ExtractEmail.pl <db_em
or the like.
Let us know if this meets your needs.
Paul
Hey Paul,
I'm in the same boat as RADEP. I'm very much able to use the script on my linux machine, but unable to once I try it on my windows vm.
I tried copying the code verbatim into a .pl file and running it from command line per your suggestion with a similar output to RADEP's experience.
As for the "'" terminator, I have no clue, but I am going to guess that windows will not support the 'sort -u' command near the end. What are your thoughts?
And by the way, thanks for everyone's help in this. It's forums like these that help me get through the work day. :)
Cheers,
Scott
prn 254
Expert 100+
Hi Scott,
I'm afraid I was too lazy the other day.
If you're going to create a file to do the same job, you have to do the read from STDIN explicitly. So the file ExtractEmail.pl could look like this: -
while (<STDIN>) {
-
while (/[\w\.\-]+@[\w\.\-]+\w+/g)
-
{print "$&\n"}
-
}
Then you can invoke it like this: - C:\...>perl ExtractEmail.pl <test.txt >out.txt
Of course, the sorting as in a *nix environment is not available in the native windows environment. There are several ways to get the capability. You could install cygwin, which is a port of the bash shell with utilities including sort. (Then you should be able to use the original one-liner.) If you search for "windows unix sort" you should find some advice (which I have not tested) on other ports of the sort utility.
HTH,
Paul
I keep getting this error: - syntax error at email.pl line 1, near "){"
-
Can't find string terminator "'" anywhere before EOF at email.pl line 1
and I'm using this code: - perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
How about this ...
if for some reason the file lost some spaces or got extra letters and there's this case .... onani12@yahoo.coms@ <--- or what about this ... lacama@yaho.co
in those cases i want to create ... onani12@yahoo.com and also the one it catch onani12@yahoo.coms <-- notice doesn't have the @ at the end
and fix lacama@yaho.com ( which is certainly a public email provider ) but just in case we want to keep that one we found lacama@yaho.com
and add lacama@yahoo.com
I had many years ago a code that just to do that I'm going to try to find it but if you have a regular expression or short code that can fix that it will be wonderful !
Regards
Angelo
Hello,
This one-liner is great! I would love to exclude all email addresses ending in one specific domain from the output (e.g. don't print any address ending with thisdomain.com), is there an easy way to do that? Thanks!
Just pipe the output to e/f/grep with the "invert" option to exclude specific domains e.g. grep -v @DOMAIN
Hi,
Great code. Unfortunately,
perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt
catches: joeuser@earthlink.net...User
Can someone correct this???
Many thanks.
Hi,
I know its been a while since last post, but I couldn't just leave it with out an answer.
I did same small upgrades making the code better match emails.
I believe this is what you all were looking for: -
perl -ne'if(/[\w\.\-\_]+@([\w\-\_]+\.)+[A-Za-z]{2,4}/g){print "$&\n"}' *.txt
-
Have a nice mail extraction!
$ egrep -o [A-Za-z0-9_.]*'[-]'*[A-Za-z0-9_.]*@[A-Za-z0-9_.]*[.][A-Za-z]* < filename
This should work in any Unix version. This should take care of all possibilities.
prn 254
Expert 100+
I haven't really revisited this topic for a while now, but here are a few comments.
TIMTOWTDI, prashantva. :) That is, "There Is More Than One Way To Do It." Your way appears to be almost equivalent to what we had before. The only difference I note will show up in the case of more than one hyphen in a name. Check out the result from the "j-q-public" entry in the output of the perl program below versus the egrep re. Some experimentation with the re should fix that.
I'm not quite sure what Exenfris wants, but it looks like a database of valid domains. I don't think that's practical in a short program, certianly not in a one-liner. I suppose you could add a piece that checks whois for each domain. I'll happily leave that as an exercise for someone else. :) I can imagine adding a routine to check for valid domains to the perl script, but personally, I'm not prepared to try to correct those that turn out to be invalid.
Halle asks for a way to exclude specific domains. Obi Wan suggests just piping the output through grep using "-v". That would work, but how convenient it would be depends a lot on how many domains you want to exclude and how often you want to run the search. Personally, when I start needing more features, I like to document them by including them into a single file, so I've added this feature into the perl script below. (Obviously, a shell script could work equally well.) For convenience the exclude pattern is prominently defined using the regex quote operator.
Carl win points out a problem when an address is followed/terminated by multiple dots. I've included a fix for this below too. I haven't seen a way to solve this one in a single re, but someone clever may be able to do that.
File: ExtractEmail.pl -
#! /usr/bin/perl
-
use strict;
-
-
#exclude the following domains:
-
my $exclude = qr/thisdomain\.com|otherdomain\.com/o;
-
-
# We need a variable here because we cannot assign to $1
-
my $address;
-
-
while (<STDIN>) {
-
while ( /([\w\.\-]+@[\w+\.\-]+\.[A-Za-z]{2,4})\W/g ) {
-
$address = $1; # use a variable so we can modify if needed
-
$address =~ s/\.\..*// ; # terminate the address at multiple "."s
-
print "$address\n" unless $address =~ $exclude;
-
}
-
}
-
exit;
-
And here's a test file that should cover all the cases that have been discussed above.
File: emailtest.txt -
this is a test file foo@bar.com we are looking for moo@drop.dhcp.bar.com email
-
addresses inside, 00test@leo.bar.com, a text file with no
-
particular fname.lname@bar.baz.net other par72@take.the.bus.au restrictions
-
on the format or locations of the 23skidoo@bar.co.uk addresses inside the
-
file. And, let's not forget addresses with hyphens like j-q-public@foo.bar or
-
underscores like j_q_public@foo.bar
-
Also what about emails like sam.s.schuchts.34@packardbell.net.jp?
-
Let's also say we want to exclude addresses like halle@thisdomain.com from a
-
specific domain.
-
Unfortunately, the simpler regex catches: joeuser@earthlink.net...User so what do we do?
-
But it should not have a problem with: jimuser@earthlink.net ... User. right?
-
Trailing stuff with hyphens or something else after the end of the actual address
-
should be ruled out automatically as in nikt@frombork.pl--bad, right?
-
Let's also exclude stuff from otherdomain as in halle@otherdomain.com, OK?
-
This version also rules out TLDs that are too long as in foo@bar.topleveldomain.
-
Let's try one at the end joe27@aol.com.
-
And one at the very end without a final newline paul@mydomain.com
To test this script, you can use - ExtractEmail.pl <emailtest.txt
I have tested the script with this data both in a command prompt window on MSWindows and in Linux.
To test prashantva's egrep version for comparison, you can use - egrep -o [A-Za-z0-9_.]*'[-]'*[A-Za-z0-9_.]*@[A-Za-z0-9_.]*[.][A-Za-z]* <emailtest.txt
Paul
for i in `cat filename`
do
echo $i >> filename1
done
grep "@" filename
The above code will extract all the email Ids
prn 254
Expert 100+
Let's adjust sun123's script to: -
for i in `cat emailtest.txt`
-
do
-
echo $i >> filename1
-
done
-
grep "@" filename1
-
This now uses the name of that we have been using for the original input and the file to be grepped in the last line has been corrected from "filename" to "filename1", i.e., the output from the cat instead of the original input.
The "for" loop simply writes "words" from the original file to separate lines in an intermediate file.
sun123's corrected script can alternatively be simplified by eliminating the unnecessary temporary file: -
for i in `cat emailtest.txt`
-
do
-
echo $i | grep "@"
-
done
-
The output from either version is: -
foo@bar.com
-
moo@drop.dhcp.bar.com
-
00test@leo.bar.com,
-
fname.lname@bar.baz.net
-
par72@take.the.bus.au
-
23skidoo@bar.co.uk
-
j-q-public@foo.bar
-
j_q_public@foo.bar
-
sam.s.schuchts.34@packardbell.net.jp?
-
halle@thisdomain.com
-
joeuser@earthlink.net...User
-
jimuser@earthlink.net
-
nikto@frombork.pl--bad,
-
halle@otherdomain.com,
-
foo@bar.topleveldomain.
-
joe27@aol.com.
-
paul@mydomain.com
-
Note that the result incorrectly includes trailing punctuation, and not just dots, and fails to exclude foo@bar.topleveldomain (where "topleveldomain" is not a valid top level domain, being too long). Obviously further filters could be added, but this one is not there yet.
Best Regards,
Paul
Excellent script
I found one error if an email address has two consecutive dots in it. such as:
Paul.A..Foothead@mine.com - will display as Paul.A
I'm using this with sorting and uniq like this. - perl extractemail.pl < myfile | sort | uniq > emaillist.txt
performing a quick bash SED like this - sed -i -e 's/\.\./\./g' myfile
followed by the - perl extractemail.pl < myfile | sort | uniq > emaillist.txt
works great
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Hoang |
last post by:
anyone know of an algorithm to filter out real email addresses as opposed to
computer generated email addresses? I have been going through past email
archives in order to find friends email...
|
by: Duke of Hazard |
last post by:
I have searched without success for a simple script that can read any
text html file and extract the email addresses from it. I am not
interested in spamming people. I play a sport that requires me...
|
by: MLH |
last post by:
I routinely save failure notices from mail servers
bouncing mail back to me that I sent with invalid
address. I would like to write an access procedure
in my contacts database that would open the...
|
by: Mam |
last post by:
Hi
I had developed one site,that site hides all the email
addresses.Now i want to develope an application whose
extract mail addresses from that site,Is there any
solution to this.If u know how...
|
by: Nico |
last post by:
Hi,
I have a .txt file with a lot of text mixed with some email addresses. I
would like to get all the email addresses in a $mails variable. Does
anyone know how to do this in php.
Thanks a...
|
by: tthomas |
last post by:
Greetings,
I am using CDO.Message to send email messages from my application. I now
need to send email to existing distribution lists in our Global Address List.
However, our exchange server...
|
by: Alexander Vasilevsky |
last post by:
How to extract email address from the letter in Outlook Express?
http://www.alvas.net - Audio tools for C# and VB.Net developers
|
by: Dennis |
last post by:
Hi, I have a text file that contents a list of email addresses like
this:
"foo@yahoo.com"
"tom@hotmail.com"
"jerry@gmail.com"
"tommy@apple.com"
I like to
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
| |