By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,427 Members | 1,356 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,427 IT Pros & Developers. It's quick & easy.

String/source code analysis tools

P: n/a
I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.

i.e. a simple example:

If I have two pieces of code like this:

func1( a, b, c, 13, d, e, f )
func2( x, y, z, z )

and

func1( a, b, c, 55, d, e, f )
func2( x, y, z, x )

I would like to be able to detect the redundancies. This is obviously a
simple example, the real code is worlds messier -- say a 3 line script, each
line has 800 characters, copied 10 times over with slight modifications
among the 800 characters. I'm not exaggerating. So I'm wondering if there
is any code out there that will assist me in refactoring this code.

My feeling that a general solution this is very intractable, but I thought
I'd ask. I'd probably have to roll my own based on the specifics of the
situation.

It is actually sort of like a diff algorithm maybe, but it wouldn't go line
by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.

thanks,
MB

Jul 18 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Man, I love Python! After writing this, with about 10 minutes of googling,
I found the difflib, which can do diffs token by token. I can do what I
want with about 10 lines of code probably. Wow.

I think the diff is pretty much the best solution -- but if anyone has any
other pointers I would appreciate it. I would have to diff all pairs of
files and I can get a score of how similar they are to each other. So if I
have 10 files I would have to run it 45 times to get all pairs of diffs.
That should be OK since they are small files in general.

MB
"Moosebumps" <mo********@moosebumps.com> wrote in message
news:j0******************@newssvr27.news.prodigy.c om...
I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.

i.e. a simple example:

If I have two pieces of code like this:

func1( a, b, c, 13, d, e, f )
func2( x, y, z, z )

and

func1( a, b, c, 55, d, e, f )
func2( x, y, z, x )

I would like to be able to detect the redundancies. This is obviously a
simple example, the real code is worlds messier -- say a 3 line script, each line has 800 characters, copied 10 times over with slight modifications
among the 800 characters. I'm not exaggerating. So I'm wondering if there is any code out there that will assist me in refactoring this code.

My feeling that a general solution this is very intractable, but I thought
I'd ask. I'd probably have to roll my own based on the specifics of the
situation.

It is actually sort of like a diff algorithm maybe, but it wouldn't go line by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.

thanks,
MB

Jul 18 '05 #2

P: n/a
Am Donnerstag 22 April 2004 08:56 schrieb Moosebumps:
It is actually sort of like a diff algorithm maybe, but it wouldn't go line
by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.


What about difflib? (part of the standard library) You'd have to write your
own tokenization function, but that shouldn't be hard...

Heiko.

Jul 18 '05 #3

P: n/a
"Moosebumps" <mo********@moosebumps.com> wrote in message
news:j0******************@newssvr27.news.prodigy.c om...
I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.


Not in Python, but could be used to do this.
We offer a clone detection tool that works on very large source code basis,
and detects cloned clone with "slight modifications".
You'd have to provide a grammar for your 'scripting language'.
See http://www.semanticdesigns.com/Produ...one/index.html.
--
Ira D. Baxter, Ph.D., CTO 512-250-1018
Semantic Designs, Inc. www.semdesigns.com
Jul 18 '05 #4

P: n/a
[Ira Baxter]
"Moosebumps" <mo********@moosebumps.com> wrote in message
news:j0******************@newssvr27.news.prodigy.c om...
I have a whole bunch of script files in a custom scripting
"language" that were basically copied and pasted all over the place
-- a huge mess, basically. I want to clean this up using Python --
and I'm wondering if there is any sort of algorithm for detecting
copied and pasted code with slight modifications.

Not in Python, but could be used to do this. We offer a clone
detection tool that works on very large source code basis, and detects
cloned clone with "slight modifications". You'd have to provide a
grammar for your 'scripting language'. See
http://www.semanticdesigns.com/Produ...one/index.html.


Thanks for the reference, I'm saving it for later perusal or study.

Many years ago, because I had a cleaning problem which I presume similar
to yours, I wrote then used a tool for this, but all in C. I called
it `mdiff' (for "multi-diff"), and it is likely found within some old
pretest of `Free wdiff' -- I did not really touch `wdiff' in years, even
if I ponder republishing it this summer, given I find some free time.

`mdiff' seeks for identical sequences of lines within one or more files
(I used it for many dozens of files at once). One difficulty was to
design a way for displaying the output in a usable way, and this was an
interesting problem at least. `mdiff' did the job for me, but I do not
really remember the state of this project nor how `mdiff' would behave
if recompiled today. But, as usual with me, if you feel like toying,
just ask for the sources, or wander for them from my home web page! :-)

--
François Pinard http://www.iro.umontreal.ca/~pinard

Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.