Philip Ronan wrote:
"Dave Anderson" wrote:
Philip Ronan wrote:I should also mention that your algorithm wouldn't have any success
extracting my email address. I won't say *why* it would fail, but you need
to think a bit harder about what the "obvious alternatives" actually are.
The only URL I've found you giving in this thread is
<http://vzone.virgin.net/phil.ronan/>, and I don't see any trace of an
email address there (obfuscated or not). If you want to challenge me,
kindly provide a URL. I've been arguing based on the example you gave a
while back
OK, I was actually referring to my business site which is here:
<http://www.japanesetranslator.co.uk/>
You could have found this quite easily yourself, but I didn't want to give
it to you straight away just to emphasize the point that your "obvious
alternatives" aren't as obvious as you might think.
I probably could have, but I've got better things to do than spend time
searching the net for a site where you couldn't be bothered to provide a
URL. Given that whether or not *I* find a site is totally irrelevant to
whether or not an address-harvesting tool will find it, your coyness
doesn't accomplish what you seem to think it would.
Now that I've seen the page, I know that what you've done is corrupt the
address in a way which you assume a person will figure out how to
reverse (in this case, by inserting whitespace where it's not allowed).
I just visited another site this morning where the author included a mailto
link with "@" replaced by "_at_". You wouldn't have found that one either.
There are probably thousands of other ways of concealing email addresses in
a similar manner. If you write an algorithm that extracts all of them, then
I expect your false hit rate it going to be extremely high.
I'm sure it's what you expect, but it's probably not what would happen.
[Well, extracting *all* of them is unlikely -- but extracting
essentially all of the commonly-used schemes ought to be practical.]
I don't usually pay much attention to how people obfuscate addresses,
but this thread has started me thinking about it. If I wanted to write
an address extractor, I'd start off by doing some research (including
loosing a spider to bring back a whole bunch of stuff for analysis).
Based only on what I know now, what I'd do is:
1) read each page and convert each entity to its proper character.
2) remove all element start and end tags, saving any attribute-value
strings within them for processing later in the same way as the page
content; for elements whose content is handled out-of-line (e.g., TITLE,
SCRIPT), also save the element's content for processing later.
3) add a single space at the start and end of the text, and collapse
each sequence of whitespace characters into a single space.
4) for each occurance of '@', 'at' (upper or lower case, or mixed case,
or using non-ascii Unicode characters which look like 'a' or 't'), reset
the list of valid domains then search backward for a username candidate
and forward for a hostdomain candidate; skip this occurance unless
plausible candidates for both are found.
a) in this section, ignore any space character immediately adjacent to
the subject text. first, remove any matching brackets surrounding the
subject text: {} / [] / <> / () and additional non-ascii Unicode
characters. Also remove any punctuation character which occurs on both
sides of the subject text. Repeat until all such pairs have been
removed, collapsing any sequences of multiple spaces to single spaces.
Initially set a flag to 'true' if the subject text is a punctuation
character, false otherwise; set the flag to true whenever a pair of
punctuation characters is removed.
b) If the flag is false and the character immediately adjacent on either
side to the subject text is not a space, skip this occurance.
c) scan forward for the first occurance of '.', 'dot' (upper or lower
case, or mixed case, or using non-ascii Unicode characters which look
like 'd', 'o', or 't'); if none is found (or if the number of characters
scanned over which are allowed in domain name segments exceeds 63), skip
to scanning for the next 'at'. Process the subject text as in 4(a). If
the flag is false and the character immediately adjacent on either side
to the subject text is not a space, skip to scanning for the next 'at'.
d) Remove all characters not allowed in domain name segments from the
text scanned over in 4(c) and save the result plus '.' as the candidate
hostdomain.
e) repeat 4(c) until it tries to skip to scanning for the next 'at'; for
each successful case, remove all characters not allowed in domain name
segments from the text scanned over in 4(c) and append the result plus
'.' to the candidate hostdomain.
f) scan forward until as many characters allowed in domain name segments
have been scanned as the length of the longest top-level domain.
Whenever at least one such character has been scanned and the next
character is a punctuation character, query DNS for an MX record for the
domain name consisting of the candidate hostdomain plus the scanned
characters with all those not allowed in domain name segments removed.
If the DNS query returns one or more MX records, save that domain name
as a valid domain.
g) if there is at least one entry in the valid domain list, scan
backward from the first non-space character before the 'at' for the
first punctuation character which is not allowed in a username; if at
least one character was scanned over and none of the characters scanned
over were ones not allowed in a username, prefix the characters scanned
over plus '@' to each item in the valid domain list and save it as a
harvested email address.
That should be fairly efficient, find just about all email addresses
obfuscated using the general "... at ... dot ..." pattern (including
unobfuscated ones), and still have a fairly high percentage of real
addresses. [The algorithm for scanning usernames can probably be
improved; I've run out of steam for now.]
If spammers start using tools which harvest entity-encoded addresses,
pages like those will adapt and start producing entity-encoded fake
addresses. Advantage nullified.
That's very true. but *until* this advantage is nullified, your email
address is vulnerable. And once it's on a spam list, there's no point asking
to have to removed. That's why I said you need to be at least 2 steps ahead.
This applies equally well to "... at ... dot ..." obfuscated addresses
-- both forms are about equally safe (or unsafe), which is why using
JavaScript encoding to avoid the one while still using the other doesn't
make any sense.
Well what I'm saying is that "at/dot" and its variants are much safer
because they are a lot harder to detect reliably.
Harder, yes. Impractical, no. See above.
But surely you have to agree that [spammers'] resources would be better
spent using email harvesting techniques that have a better hit rate.
What matters is what they actually do, not what we think it would be
sensible for them to do. Given their actual use of things like
dictionary attacks, assuming that they'll only use techniques which
produce a very high percentage of real addresses is foolish.
OK, so what *are* they actually doing? I've already pointed out [1] software
that (as far as I can tell) extracts Javascript mailto links and decodes
HTML entities. So we can assume the spammers are doing that already.
[1] <http://groups.google.co.uk/group/com...uthoring.html/
msg/e4ddd8c603db86ed?hl=en>
Since the underlying question is whether spammers are willing to use
harvesting techniques which produce lots of false positives, this is not
relevant. We've *seen* spammers using dictionary attacks (which
necessarily involve a very high fraction of false addresses), so we
*know* that at least some spammers are willing to use such techniques.
Dave