473,323 Members | 1,589 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,323 software developers and data experts.

String/source code analysis tools

I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.

i.e. a simple example:

If I have two pieces of code like this:

func1( a, b, c, 13, d, e, f )
func2( x, y, z, z )

and

func1( a, b, c, 55, d, e, f )
func2( x, y, z, x )

I would like to be able to detect the redundancies. This is obviously a
simple example, the real code is worlds messier -- say a 3 line script, each
line has 800 characters, copied 10 times over with slight modifications
among the 800 characters. I'm not exaggerating. So I'm wondering if there
is any code out there that will assist me in refactoring this code.

My feeling that a general solution this is very intractable, but I thought
I'd ask. I'd probably have to roll my own based on the specifics of the
situation.

It is actually sort of like a diff algorithm maybe, but it wouldn't go line
by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.

thanks,
MB

Jul 18 '05 #1
4 1801
Man, I love Python! After writing this, with about 10 minutes of googling,
I found the difflib, which can do diffs token by token. I can do what I
want with about 10 lines of code probably. Wow.

I think the diff is pretty much the best solution -- but if anyone has any
other pointers I would appreciate it. I would have to diff all pairs of
files and I can get a score of how similar they are to each other. So if I
have 10 files I would have to run it 45 times to get all pairs of diffs.
That should be OK since they are small files in general.

MB
"Moosebumps" <mo********@moosebumps.com> wrote in message
news:j0******************@newssvr27.news.prodigy.c om...
I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.

i.e. a simple example:

If I have two pieces of code like this:

func1( a, b, c, 13, d, e, f )
func2( x, y, z, z )

and

func1( a, b, c, 55, d, e, f )
func2( x, y, z, x )

I would like to be able to detect the redundancies. This is obviously a
simple example, the real code is worlds messier -- say a 3 line script, each line has 800 characters, copied 10 times over with slight modifications
among the 800 characters. I'm not exaggerating. So I'm wondering if there is any code out there that will assist me in refactoring this code.

My feeling that a general solution this is very intractable, but I thought
I'd ask. I'd probably have to roll my own based on the specifics of the
situation.

It is actually sort of like a diff algorithm maybe, but it wouldn't go line by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.

thanks,
MB

Jul 18 '05 #2
Am Donnerstag 22 April 2004 08:56 schrieb Moosebumps:
It is actually sort of like a diff algorithm maybe, but it wouldn't go line
by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.


What about difflib? (part of the standard library) You'd have to write your
own tokenization function, but that shouldn't be hard...

Heiko.

Jul 18 '05 #3
"Moosebumps" <mo********@moosebumps.com> wrote in message
news:j0******************@newssvr27.news.prodigy.c om...
I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.


Not in Python, but could be used to do this.
We offer a clone detection tool that works on very large source code basis,
and detects cloned clone with "slight modifications".
You'd have to provide a grammar for your 'scripting language'.
See http://www.semanticdesigns.com/Produ...one/index.html.
--
Ira D. Baxter, Ph.D., CTO 512-250-1018
Semantic Designs, Inc. www.semdesigns.com
Jul 18 '05 #4
[Ira Baxter]
"Moosebumps" <mo********@moosebumps.com> wrote in message
news:j0******************@newssvr27.news.prodigy.c om...
I have a whole bunch of script files in a custom scripting
"language" that were basically copied and pasted all over the place
-- a huge mess, basically. I want to clean this up using Python --
and I'm wondering if there is any sort of algorithm for detecting
copied and pasted code with slight modifications.

Not in Python, but could be used to do this. We offer a clone
detection tool that works on very large source code basis, and detects
cloned clone with "slight modifications". You'd have to provide a
grammar for your 'scripting language'. See
http://www.semanticdesigns.com/Produ...one/index.html.


Thanks for the reference, I'm saving it for later perusal or study.

Many years ago, because I had a cleaning problem which I presume similar
to yours, I wrote then used a tool for this, but all in C. I called
it `mdiff' (for "multi-diff"), and it is likely found within some old
pretest of `Free wdiff' -- I did not really touch `wdiff' in years, even
if I ponder republishing it this summer, given I find some free time.

`mdiff' seeks for identical sequences of lines within one or more files
(I used it for many dozens of files at once). One difficulty was to
design a way for displaying the output in a usable way, and this was an
interesting problem at least. `mdiff' did the job for me, but I do not
really remember the state of this project nor how `mdiff' would behave
if recompiled today. But, as usual with me, if you feel like toying,
just ask for the sources, or wander for them from my home web page! :-)

--
François Pinard http://www.iro.umontreal.ca/~pinard

Jul 18 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Scott Chapman | last post by:
I'm sure I'm not using the conventional names for this tool, more of a cross-referencer?. I couldn't find it on Google, I think for this reason. I need a tool that will go through a python...
9
by: TCMA | last post by:
I am looking for some tools to help me understand source code of a program written in C++ by someone else. Are there any non-commercial, open source C or C++ tools to reverse engineer C or C++...
6
by: Xing Xu | last post by:
Hi guiders, sorry , since I don't know which group suit for this question,I just post this question at these group. As we know , we can get the run-time call graph by some proved tools . now...
1
by: vipindeep | last post by:
Dear reader, Are there any widely used dynamic analysis tools which are used for detecting errors in programming, for example null dereferences, deadlocks, etc.. Some of the tools which I...
6
by: rahul8143 | last post by:
hello, Is there any source code analysis tool avail for C programmers under Linux? That tool should go through all source code files and print functional dependencies. tool can run in any...
31
by: smachin1000 | last post by:
Hi All, Does anyone know of a tool that can automatically analyze C source to remove unused #includes? Thanks, Sean
2
by: jarnie | last post by:
Is there any freeware that can provide statistics on a (VB).NET project? I'm looking for something similar to the Aivosto's Project Analyzer, specificially the code metrics section. Ideally...
6
by: beantaxi | last post by:
Hello all, I'm looking for a very simple code analysis tool. I have a large codebase to analyze, and all I really need to do is to find all uses of all methods in a few interfaces. Many tools...
6
by: Crooter | last post by:
Hello colleagues, Could anybody tell me if there are existing open-source solutions to extract the program tree using a program source code? I'm aware that GCC has program flow information and...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.