473,785 Members | 2,830 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Help beautify ugly heuristic code

I have a function that recognizes PTR records for dynamic IPs. There is
no hard and fast rule for this - every ISP does it differently, and may
change their policy at any time, and use different conventions in
different places. Nevertheless, it is useful to apply stricter
authentication standards to incoming email when the PTR for the IP
indicates a dynamic IP (namely, the PTR record is ignored since it doesn't
mean anything except to the ISP). This is because Windoze Zombies are the
favorite platform of spammers.

Here is the very ugly code so far. It offends me to look at it, but
haven't had any better ideas. I have lots of test data from mail logs.

# examples we don't yet recognize:
#
# 1Cust65.tnt4.at l4.da.uu.net at ('67.192.40.65' , 4588)
# 1Cust200.tnt8.b ne1.da.uu.net at ('203.61.67.200 ', 4144)
# 1Cust141.tnt30. rtm1.nld.da.uu. net at ('213.116.154.1 41', 2036)
# user64.net2045. mo.sprint-hsd.net at ('67.77.185.64' , 3901)
# wiley-268-8196.roadrunner .nf.net at ('205.251.174.4 6', 4810)
# 221.fib163.satn et.net at ('200.69.163.22 1', 3301)
# cpc2-ches1-4-0-cust8.lutn.cabl e.ntl.com at ('80.4.105.8', 61099)
# user239.res.ope nband.net at ('65.246.82.239 ', 1392)
# xdsl-2449.zgora.dial og.net.pl at ('81.168.237.14 5', 1238)
# spr1-runc1-4-0-cust25.bagu.bro adband.ntl.com at ('80.5.10.25', 1684)
# user-0c6s7hv.cable.m indspring.com at ('24.110.30.63' , 3720)
# user-0c8hvet.cable.m indspring.com at ('24.136.253.22 1', 4529)
# user-0cdf5j8.cable.m indspring.com at ('24.215.150.10 4', 3783)
# mmds-dhcp-11-143.plateautel. net at ('63.99.131.143 ', 4858)
# ca-santaanahub-cuda3-c6b-134.anhmca.adel phia.net at ('68.67.152.134 ', 62047)
# cbl-sd-02-79.aster.com.do at ('200.88.62.79' , 4153)
# h105n6c2o912.br edband.skanova. com at ('213.67.33.105 ', 3259)

import re

ip3 = re.compile('([0-9]{1,3})[.x-]([0-9]{1,3})[.x-]([0-9]{1,3})')
rehmac = re.compile(
'h[0-9a-f]{12}[.]|pcp[0-9]{6,10}pcs[.]|no-reverse|S[0-9a-f]{16}[.][a-z]{2}[.]'
)

def is_dynip(host,a ddr):
"""Return True if hostname is for a dynamic ip.
Examples:
is_dynip('post3 .fabulousdealz. com','69.60.99. 112') False is_dynip('adsl-69-208-201-177.dsl.emhril. ameritech.net', '69.208.201.177 ') True is_dynip('[1.2.3.4]','1.2.3.4')

True
"""
if host.startswith ('[') and host.endswith(']'):
return True
if addr:
if host.find(addr) >= 0: return True
a = addr.split('.')
ia = map(int,a)
m = ip3.search(host )
if m:
g = map(int,m.group s())
if g == ia[1:] or g == ia[:3]: return True
if g[0] == ia[3] and g[1:] == ia[:2]: return True
g.reverse()
if g == ia[1:] or g == ia[:3]: return True
if rehmac.search(h ost): return True
if host.find("%s." % '-'.join(a[2:])) >= 0: return True
if host.find("w%s. " % '-'.join(a[:2])) >= 0: return True
if host.find("dsl% s-" % '-'.join(a[:2])) >= 0: return True
if host.find(''.jo in(a[:3])) >= 0: return True
if host.find(''.jo in(a[1:])) >= 0: return True
x = "%02x%02x%02x%0 2x" % tuple(ia)
if host.lower().fi nd(x) >= 0: return True
z = [n.zfill(3) for n in a]
if host.find('-'.join(z)) >= 0: return True
if host.find("-%s." % '-'.join(z[2:])) >= 0: return True
if host.find("%s." % ''.join(z[2:])) >= 0: return True
if host.find(''.jo in(z)) >= 0: return True
a.reverse()
if host.find("%s." % '-'.join(a[:2])) >= 0: return True
if host.find("%s." % '.'.join(a[:2])) >= 0: return True
if host.find("%s." % a[0]) >= 0 and \
host.find('.ads l.') > 0 or host.find('.dia l-up.') > 0: return True
return False

if __name__ == '__main__':
import fileinput
for ln in fileinput.input ():
a = ln.split()
if len(a) == 2:
ip,host = a
if host.startswith ('[') and host.endswith(']'):
continue # no PTR
if is_dynip(host,i p):
print ip,host
Jul 18 '05 #1
14 2506
On Wed, 08 Dec 2004 16:09:43 -0500, Stuart D. Gathman <st****@bmsi.co m>
wrote:
I have a function that recognizes PTR records for dynamic IPs....
Here is the very ugly code so far.
...
# examples we don't yet recognize:
...


This doesn't help much; post example of all the possible patterns you have
to match (kind of like the docstring at the beginning, only more
elaborate), otherwise it's hard to know what kind of code you're trying to
implement.
--
Mitja
Jul 18 '05 #2
On Wed, 08 Dec 2004 18:00:06 -0500, Mitja wrote:
On Wed, 08 Dec 2004 16:09:43 -0500, Stuart D. Gathman <st****@bmsi.co m>
wrote:
I have a function that recognizes PTR records for dynamic IPs.... Here
is the very ugly code so far.
...
# examples we don't yet recognize:
...


This doesn't help much; post example of all the possible patterns you
have to match (kind of like the docstring at the beginning, only more
elaborate), otherwise it's hard to know what kind of code you're trying
to implement.


This is a heuristic, so there is no exhaustive list or hard rule.
However, I have posted 23K+ examples at http://bmsi.com/python/dynip.samp
with DYN appended for examples which the current algorithm classifies
as dynamic.

Here are the last 20 (which my subjective judgement says are correct):

65.112.76.15 usfshlxmx01.myr eg.net
201.128.108.41 dsl-201-128-108-41.prod-infinitum.com.m x DYN
206.221.177.128 mail128.tanthil lyingyang.com
68.234.254.147 68-234-254-147.stmnca.adel phia.net DYN
63.110.30.30 mx81.goingwitht he-flow.info
62.178.226.189 chello062178226 189.14.15.vie.s urfer.at DYN
80.179.107.85 80.179.107.85.i speednet.net DYN
200.204.68.52 200-204-68-52.dsl.telesp.n et.br DYN
12.203.156.234 12-203-156-234.client.insi ghtBB.com DYN
200.83.68.217 CM-lconC1-68-217.cm.vtr.net DYN
81.57.115.43 pauguste-3-81-57-115-43.fbx.proxad.n et DYN
64.151.91.225 sv2a.entertainm entnewsclips.co m
64.62.197.31 teenfreeway.spa rklist.com
201.9.136.235 201009136235.us er.veloxzone.co m.br DYN
66.63.187.91 91.asandox.com
83.69.188.198 st11h07.ptambre .com
66.192.199.217 66-192-199-217.pyramidcoll ection.org DYN
69.40.166.49 h49.166.40.69.i p.alltel.net DYN
203.89.206.62 smtp.immigratio nexpert.ca
80.143.79.97 p508F4F61.dip0. t-ipconnect.de DYN
Jul 18 '05 #3
Regular expressions.

It takes a while to craft the expressions, but this will be more
elegant, more extensible, and considerably faster to compute (matching
compiled re's is fast).

Example using the top five from your function's comments:

.. host_patterns = [
.. '^1Cust\d+\.tnt \d+\..*\.da\.uu \.net$',
.. '^user\d+\.net\ d+\.mo\.sprint-hsd\.net$',
.. '^.*\.roadrunne r\.nf\.net$',
.. ]
..
.. host_expr = re.compile('|'. join(host_patte rns))
..
.. # only implementing host string matching, but you get the idea.
.. def is_dynip(host):
.. # host names are case insensitive
.. host = host.lower()
.. return host_expr.match (host) is not None
is_dynip("1Cust 200.tnt8.bne1.d a.uu.ne") True is_dynip("googl e.com")

False
(the dots preceding code are to fool indentation stripping... please
ignore them)

Jul 18 '05 #4
On Wed, 08 Dec 2004 18:39:15 -0500, Lonnie Princehouse wrote:
Regular expressions.

It takes a while to craft the expressions, but this will be more
elegant, more extensible, and considerably faster to compute (matching
compiled re's is fast).


I'm already doing that with the rehmac regex. I like your idea for making
it more readable, though. Looking for permutations of the IP address
gives much more bang for the line of code than most host only regexes
since it is ISP independent. At least one ISP uses roman numerals to code
the IP for their dynamic addresses! I tried matching a custom regex
computed from the IP, but compiling the regex for each test was too slow.

I could keep adding more patterns, but I was hoping for a tool that
"learns" from a database of preclassified examples how to recognize the
pattern. And the resulting data would be reasonably compact. I don't ask
for much, do I? A Bayesian classifier would have too big of a database, I
think. I've seen neural nets do amazing things with only 100 or so
neurons - a small weight database. But they are slow in software.

I have posted 10K preclassified (by current algorithm) examples here:
http://bmsi.com/python/dynip.samp
Jul 18 '05 #5
On 8 Dec 2004 15:39:15 -0800, Lonnie Princehouse
<fi************ **@gmail.com> wrote:
Regular expressions.

It takes a while to craft the expressions, but this will be more
elegant, more extensible, and considerably faster to compute (matching
compiled re's is fast).


I think that this problem is probably a little bit harder. As the OP
noted, each ISP uses a different notation. I think that a better
solution is to use a statistical approach, possibly using a custom
Bayesian filter that could "learn" a little bit about some patters.

The basic idea is as follows:

-- break the URL in pieces, using not only the dots, but also hyphens
and underscores in the name.

-- classify each part, using REs to identify common patterns: frequent
strings (com, gov, net, org); normal words (sequences of letters);
normal numbers; combinations of numbers & letters; common substrings
can also be identified (such as isp, in the middle of one of the
strings).

-- check these pieces against the Bayesian filter, pretty much as it's
done for spam.

I think that this approach is promising. It relies on the fact that
real servers usually do not have numbers in their names; however,
exact identification either by a match or a regular expression is very
difficult. I'm willing to try it, but first, more data is needed.

--
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: ca********@gmai l.com
mail: ca********@yaho o.com
Jul 18 '05 #6
I don't think a Bayesian classifier is going to be very helpful here,
unless you have tens of thousands of examples to feed it, or unless it
was specially coded to first break addresses into better tokens for
classification (such as alphanumeric strings and numbers).

The series of if host.find(...) lines in is_dynip() is equivalent to a
regular expression, but much more expensive to execute because of all
the list slicing, and it won't benefit from the re module's speedy
native implementation of regular expressions.

Try building a host_expr (as per my previous post) in the following
way:

# suppose dynamic_host_li st is a list of all the host strings already
known to be
# dynamic.

.. host_patterns = {} # use a dict to guarantee uniqueness. sets would
also work.
.. number_expr = re.compile("\d+ ")
.. for dynamic_host in dynamic_host_li st:
.. pattern = '^' + number_expr.sub ("\d+", dynamic_host) + '$'
.. host_patterns[pattern] = True
..
.. host_expr = re.compile('|'. join(host_patte rns.keys()))

This will catch any hostname that differs only in numbers from any
other host you've
already classified.

For IP addresses, you really just need a mechanism to filter blocks of
IP addresses. It might be easiest to first convert them into hex and
then make liberal use of [0-f] in regular expressions.

Jul 18 '05 #7
On Wed, 08 Dec 2004 19:52:53 -0500, Lonnie Princehouse wrote:
I don't think a Bayesian classifier is going to be very helpful here,
unless you have tens of thousands of examples to feed it, or unless it
We do have tens of thousands of examples to feed it.
The series of if host.find(...) lines in is_dynip() is equivalent to a
regular expression, but much more expensive to execute because of all
It is not equivalent, because the patterns are based on the IP address.
As I mentioned before, I tried building a custom regex from the IP for
each test - but compiling the regex is way too slow to be done for each
test.
For IP addresses, you really just need a mechanism to filter blocks of
IP addresses. It might be easiest to first convert them into hex and
then make liberal use of [0-f] in regular expressions.


The point of the ip address is *not* to recognize ip addresses. The
point is to look for transformations of the ip address in the hostname.
This gives a *huge* bang for the buck. I have been working on this
problem for a while. If the hostname has a transformation of the ip
address - it is (almost certainly) a dynamic address. The ISPs are very
creative in their transformations , using the parts of the ip in various
orders and encoding in hex, base64, decimal with or without zerofill, and
even roman numerals.

The regex engine is just not powerful enough to handle parameterized
regexe (that I know of).
Jul 18 '05 #8
Doh! I misread "a" as host instead of ip in your first post. I'm
sorry about that; I really must slow down. Anyhow,

I believe you can still do this with only compiling a regex once and
then performing a few substitutions on the hostname.

Substitutions:

1st byte of IP => (0)
2nd byte of IP => (1)
3rd byte of IP => (2)
4th byte of IP => (3)
and likewise for hex => (x0) (x1) (x2) (x3)

Each host string will possibly map into multiple expansions, esp. if a
number repeats itself in the IP, or if an IP byte is less than 10 (such
that the decimal and hex representations are the same). Zero-padded
and unpadded will both have to be substituted, and it's probably best
to not to alter the last two fields in the host name since ISPs can't
change those.

With this scheme, here are a few expansions of (ip,host) tuples:

172.182.240.186 ACB6F0BA.ipt.ao l.com
becomes
(x0)(x1)(x2)(x3 ).ipt.aol.com

67.119.55.77 adsl-67-119-55-77.dsl.lsan03.p acbell.net
becomes
adsl-(0)-(1)-(2)-(3).dsl.lsan03. pacbell.net
adsl-(0)-(1)-(2)-(x1).dsl.lsan03 .pacbell.net
81.220.220.143 ip-143.net-81-220-220.henin.rev.n umericable.fr
becomes
ip-(3).net-(0)-(1)-(1).henin.rev.n umericable.fr
ip-(3).net-(0)-(1)-(2).henin.rev.n umericable.fr
ip-(3).net-(0)-(2)-(1).henin.rev.n umericable.fr
ip-(3).net-(0)-(2)-(2).henin.rev.n umericable.fr

etcetera.

Now you can run a precompiled regular expression against these hostname
permutations, i.e. ".*\(0\).*\(1\) .*\(2\).*\(3\). *" would match any
host in which the IP address numbers appeared in the correct order.

There are only a handful dynamic addresses in your sample data that
don't match a decimal or hexadecimal IP-based pattern, e.g.

68.53.109.99 pcp03902856pcs. nash01.tn.comca st.net
68.147.136.167 s01060050bf91c1 e4.cg.shawcable .net

Jul 18 '05 #9
On Wed, 08 Dec 2004 16:09:43 -0500, Stuart D. Gathman <st****@bmsi.co m>
wrote:
I have a function that recognizes PTR records for dynamic IPs. There is
no hard and fast rule for this - every ISP does it differently, and may
change their policy at any time, and use different conventions in
different places. Nevertheless, it is useful to apply stricter
authentication standards to incoming email when the PTR for the IP
indicates a dynamic IP (namely, the PTR record is ignored since it
doesn't
mean anything except to the ISP). This is because Windoze Zombies are
the
favorite platform of spammers.


This is roughly it.... you'll have to experiment and find the right
numbers for different pattern matches, maybe even add some extra criteria
etc. I don't have the time for it right now, but I'd be interested to know
how much my code and yours differ in the detection process (i.e. where are
the return values different).

Hope the indentation makes it through alright.

#!/usr/bin/python

import re
reNum = re.compile(r'\d +')
reWord = re.compile(r'(? <=[^a-z])[a-z]+(?=[^a-z])|^[a-z]+(?=[^a-z])')
#words that imply a dynamic ip
dynWords = ('dial','dialup ','dialin','ads l','dsl','dyn', 'dynamic')
#words that imply a static ip
staticWords = ('cable','stati c')

def isDynamic(host, ip):
"""
Heuristically checks whether hostname is likely to represent
a dynamic ip.
Returns True or False.
"""

#for easier matching
ip=[int(p) for p in ip.split('.')]
host=host.lower ()

#since it's heuristic, we'll give the hostname
#(de)merits for every pattern it matches further on.
#based on the value of these points, we'll decide whether
#it's dynamic or not
points=0;

#the ip numbers; finding those in the hostname speaks
#for itself; also include hex and oct representations
#lowest ip byte is even more suggestive, give extra points
#for matching that
for p in ip[:3]:
#bytes 0, 1, 2
if (host.find(`p`) != -1) or (host.find(oct( p)[1:]) != -1): points+=20
#byte 3
if (host.find(`ip[3]`) != -1) or (host.find(oct( ip[3])[1:]) != -1):
points+=60
#it's hard to distinguish hex numbers from "normal"
#chars, so we simplify it a bit and only search for
#last two bytes of ip concatenated
if host.find(hex(i p[3])[2:]+hex(ip[3])[2:]) != -1: points+=60

#long, seemingly random serial numbers in the hostname are also a hint
#search for all numbers and "award" points for longer ones
for num in reNum.findall(h ost):
points += min(len(num)**2 ,60);

#substrings that are more than just a hint of a dynamic ip
for word in reWord.findall( host):
if word in dynWords: points+=30
if word in staticWords: points-=30

print '[[',points,']]'
return points>80

if __name__=='__ma in__':
for line in open('dynip.sam p').readlines()[:50]:
(ip,host) = line.rstrip('DY N').split()[:2]
if host.find('.') != -1:
print host, ip, ['','DYNAMIC'][isDynamic(host, ip)]
--
Mitja
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
3400
by: Andrzej Adam Filip | last post by:
Could you post some recommendation/advices which options should be used when using tidy to beautify xhtml ? It seems that "wrapped" xhtml produced by standards settings is not "liked" by some search engines . e.g. tidy inserted line breaks in attributes after opening " .....<a href=" http://example.com/"> ...
40
3254
by: Peter Row | last post by:
Hi all, Here is my problem: I have a SQL Server 2000 DB with various NVarChar, NText fields in its tables. For some stupid reason the data was inserted into these fields in UTF8 encoding. However when you retrieve these values into a dataset and ToString() them
13
2146
by: EggsAckley | last post by:
Hi: I have a file that I have been told is a SQL Server backup from a server somewhere. The file is about 200MB in size I am trying to create the database on my local server using RESTORE. I created the backup device, associated it with a backup name etc., copied the file into the backup dir. When I run the RESTORE command, Query Analyzer tells me the database
24
2852
by: chri_schiller | last post by:
I have a home-made website that provides a free 1100 page physics textbook. It is written in html and css. I recently added some chinese text, and since that day there are problems. The entry page has two chinese characters, but these are not seen on all browsers, even though the page is validated by the w3c validator. ( http://www.motionmountain.net/welcome.html)
5
1228
by: Buchwald | last post by:
hello group, I have a long (large) script that shows a random picture when a webpage is refreshed. It's long because i have a lot of pictures: 246 Here is some code: ----------------------------------------------------------------------------------------------- <!-- image1="smallpics/001-smallpic.jpg"
11
2814
by: davecph | last post by:
I'm constructing a website with a layout created with div-tags. They have a fixed width, float left, and display inline. When one of the div's contain a select-element the right-most div floats down for no apparent reason, but when the select-elements are gone they all align as expected. No css apply to the select-elements. image of prob.: http://sdc.novasol.com/site/nov/TMP/withSelectBoxes.gif image of expected:...
4
1326
by: pbd22 | last post by:
Hi. In my script the below code creates a new element on the page with an associated delete button: var row_element = new Element( 'div', { 'class':'container', 'events':{
2
4183
by: =?Utf-8?B?SnJ4dHVzZXIx?= | last post by:
I just started using Windows Live OneCare, I had been using Norton, but was unable to fix the problems I was having. I have yet been unsuccessful with OneCare as well. I keep getting the same warning from OneCare, one is for Adware, the other is for a trojan, I clean both, but almost immediatly, I get the same warning? My Windows Defender is also shut down, not by me as I have no idea how to do this(or to turn it back on), but am still...
0
10153
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9952
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8976
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7500
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5381
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4053
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3654
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2880
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.