473,425 Members | 1,573 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,425 software developers and data experts.

anyone interested in decompilation

Decompilation is the process of recovering human readable source code
from a program executable. Many decompilers exist for Java and .NET as
the program executables (class files) maintain much of the information
found in the source code. This is not true for machine code
executables however.

In recent years decompilation for machine code has moved from the
domain of crackpots and academic hopefuls to a number of real
technologies that are available to the general public. Decompilers for
machine code now exist which produce output that rivals disassemblers
as a tool for analysing programs for security flaws, malware or just
simply to see how something works. Full source code recovery that is
economically attainable will soon be a reality.

The legal challenges posed by this technology differs country to
country. As such, much research is being done in secret in countries
that prohibit some uses of the technology, whereas some research is
being done more publicly in countries that have laws which support the
technology (Australia, for example).

Boomerang is an open source decompiler written (primarily) by two
Australian researchers. Open source projects need contributors. If
you have an interest in decompilation, we'd like to hear from you.
We're not only interested in talking to programmers. The project
suffers from a lack of documentation, tutorials and community. There
are many tasks that can be performed by users with minor technical
knowledge.

For more information on machine code decompilation see the Boomerang
web site (http://boomerang.sourceforge.net/). For interesting
technical commentary on machine code decompilation, see my blog
(http://quantumg.blotspot.com/).

Thanks for reading this message,

QuantumG

Aug 2 '06
66 3036
Martin Ambuhl wrote:
Igmar Palsenberg wrote:
Gernot Frisch wrote:
How would one be so stupid as to put symbolic information in an
executable? I mean: What's the deal of a compiler then?
That remarkable feature is called 'debugging'. You know, when you fire
up your debugger, it knows that at some point, a certains variable
exists, and how it's called.

Against my better judgment, I will dip my toe into this completely
off-topic thread. Your "counterexample" suggests that this "decompiler"
is a remarkably useless thing. If one has an executable in which the
symbolic information needed for us to know that a certain variable
exists and how it's named, then one has a pre-release copy of that
executable. That pre-release version should belong to people with
access to the source code, and those people have no use for this
"decompiler" at all.
You are making assumptions about how software is being delivered. For
example, the Mars Rover software, almost certainly had symbolic
information in it as it was running.

A company that doesn't have proper source control measures, or had
disgruntled employees might easily be in a position where a decompiler
would be handy to have.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 3 '06 #51
In article <11**********************@b28g2000cwb.googlegroups .com>,
<we******@gmail.comwrote:
>My question is
*which* compiler and .EXE outputs are you targetting? You are
declaring main as "int main (int argc, char * argv, char * envp)" which
I am pretty sure is GNU/UNIX-only.
I'm not sure what you mean by "GNU/UNIX" in this context.

As a datapoint, the envp parameter is supported by SGI IRIX no
matter whether you are using SGI's MipsPro compilers or gcc or
SGI's older series of compilers, or the third-party commercial
compiler DCC that used to be available for IRIX.

It seems likely to me that the envp parameter is supported on
a variety of Linux; Linux is not UNIX.

Also, main with envp is allowed in at least some MS Windows.
I find some matches when searching for main envp site:microsoft.com
but unfortunately my browser is acting up right now so I can't
post an example link right at the moment.
--
"It is important to remember that when it comes to law, computers
never make copies, only human beings make copies. Computers are given
commands, not permission. Only people can be given permission."
-- Brad Templeton
Aug 3 '06 #52
we******@gmail.com wrote:
You are making assumptions about how software is being delivered. For
example, the Mars Rover software, almost certainly had symbolic
information in it as it was running.
That is not a useful counterexample. There is no reason to deliver the
Mars Rover software with the necessary symbolic information for either
debugging or "decompiling" unless the source code was available. To
suggest that somehow "decompiling" becomes a useful thing on the basis
of Mars Rover is, frankly, not credible. To hallucinate that
a) the Mars Rover software was delivered with the symbolic information
provided, and
b) the Mars Rover software was delivered but with its source code
available, and
c) the process of "decompilation" would be reliable enough, or timely
enough, to make it preferred to simply getting the source code
is to live in the fringes of sanity.
Aug 3 '06 #53
In article <11**********************@b28g2000cwb.googlegroups .com>,
<we******@gmail.comwrote:
>QuantumG wrote:
>Decompilation is the process of recovering human readable source code
from a program executable.
>The legal challenges posed by this technology differs country to
country.
>Its probably only illegal in the US and Japan (and maybe Canada).
Not that it matters to comp.lang.c, but extracting [see the
reference for full restrictions]
http://laws.justice.gc.ca/en/C-42/23...l#Section-30.6

30.6 It is not an infringement [..]

a) make a single reproduction of
the copy by adapting, modifying or
converting the computer program
or translating it into another
computer language if the person
proves that the reproduced copy is

(i) essential for the compatibility of
the computer program with a
particular computer,
Thus in Canada, decompilation is not inherently illegal, since the
above process might involve decompilation.
--
Okay, buzzwords only. Two syllables, tops. -- Laurie Anderson
Aug 3 '06 #54
QuantumG posted:
Full source code recovery that is economically attainable will soon be a
reality.

The addition of two numbers yields 100.

What are the two numbers?

The project suffers from a lack of documentation, tutorials and
community.

It also suffers from a lack of "reality".

--

Frederick Gotham
Aug 3 '06 #55
QuantumG wrote:
Decompilation is the process of recovering human readable source code
from a program executable. Many decompilers exist for Java and .NET as
the program executables (class files) maintain much of the information
found in the source code. This is not true for machine code
executables however.

In recent years decompilation for machine code has moved from the
domain of crackpots and academic hopefuls to a number of real
technologies that are available to the general public. Decompilers for
machine code now exist which produce output that rivals disassemblers
as a tool for analysing programs for security flaws, malware or just
simply to see how something works. Full source code recovery that is
economically attainable will soon be a reality.

The legal challenges posed by this technology differs country to
country. As such, much research is being done in secret in countries
that prohibit some uses of the technology, whereas some research is
being done more publicly in countries that have laws which support the
technology (Australia, for example).

Boomerang is an open source decompiler written (primarily) by two
Australian researchers. Open source projects need contributors. If
you have an interest in decompilation, we'd like to hear from you.
We're not only interested in talking to programmers. The project
suffers from a lack of documentation, tutorials and community. There
are many tasks that can be performed by users with minor technical
knowledge.

For more information on machine code decompilation see the Boomerang
web site (http://boomerang.sourceforge.net/). For interesting
technical commentary on machine code decompilation, see my blog
(http://quantumg.blotspot.com/).
You want comp.compilers I think. This comes up once or so per year.

P.S.
You can't turn the DNA of a dead cow back into a cow. That sort of
thing only works on "Jurasic Park" movies.

When you want another cow, the best way to get one is to get a momma
cow and a daddy cow (sometimes known as 'bulls') and let them do their
business.

When you want to get your source code back, if you are using a compiled
language, the best thing is to restore from backup or pull from CVS.

I hope you succeed and make a workable decompiler, despite the known
impossibility of the general solution.

I also recommend that you stick to news:comp.compilers because that is
the arena where this sort of thing has ardent admirers.

Over here, in comp.lang.c we are not terribly interested in it. You
might say, "It's written in C!" but so is Microsoft Word, and
Microsoft Word is not topical here. You might say, "It outputs C
target language!" Which would be doubly interesting if the input were
a COBOL program but in any case, we don't care about that either.

Once you have it all working properly, I promise to give it a look.
Until then, don't go away mad -- just go away.
[If you know that a program was compiled by a particular compiler, I gather
it's possible to do pattern matching on the code idioms it uses to recover
more source than one might expect. And debug symbols help a lot. -John]

Aug 3 '06 #56
<we******@gmail.comwrote in message
news:11**********************@b28g2000cwb.googlegr oups.com...
The point being that WATCOM and Intel's compiler optimizations can
perform some pretty extreme code transformations. Intel does
constant
propagation and function cloning, and for static functions WATCOM
C/C++ just totally ignores function prologue/epilogue and may
inline.
You *could* try to detect which compiler was used to compile the
code,
however, its possible to link with different libraries and compilers
than the original object code compiler.
I think we're all, except maybe for the OP, aware that it's impossible
to get the original source code back, just like using cloning cannot
get the original cow back from hamburger.

However, it is reasonable to think it's possible to generate source
code that can be recompiled to produce a binary that works at least as
well as the one you decompiled. The difference between the generated
source and the original source will vary depending on the extent of
optimizations (if any), presence of symbol information, amount of
preprocessing in the original, etc.

Just reaching this "round trip" goal would be a worthy exercise and
valuable for many purposes, some legal and some not.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Aug 4 '06 #57

Martin Ambuhl wrote:
That is not a useful counterexample. There is no reason to deliver the
Mars Rover software with the necessary symbolic information for either
debugging or "decompiling" unless the source code was available. To
suggest that somehow "decompiling" becomes a useful thing on the basis
of Mars Rover is, frankly, not credible.
Really, the way you guys are hung up on symbols anyone would figure
you've never read source code written in French or German or whichever
natural language it is that you don't understand. Recreating sensible
symbols is not a problem people have trouble solving (for foriegn
languages, or even asm code), so clearly it is not a problem they will
have solving for the output of a decompiler.

QuantumG

Aug 4 '06 #58

we******@gmail.com wrote:
As it should be. While IdaPro is a great tool, its still too much of a
pain in the butt doing this sort of thing by hand. And there are not
many of us who can do it.
Which can be a good thing for some of us, but it's still a waste of
time if it can be automated.
So I looked at your page and through the examples. My question is
*which* compiler and .EXE outputs are you targetting? You are
declaring main as "int main (int argc, char * argv, char * envp)" which
I am pretty sure is GNU/UNIX-only. Is your plan to support a really
wide range of compilers?
Yes. Boomerang is a general decompiler for machine code. There is
very little compiler specific code in it. Primarily, we're not
interested in decompiling the runtime library code so we use patterns
to recognise it and find the start of the code that is interesting.
So, for example, we can load ELF binaries that were compiled by GCC and
skip all the glibc runtime to find main(). Or we can load an EXE that
was compiled by Microsoft Visual C and skip the libc runtime to find
WinMain() or, for console apps, main(). From there we use all general
algorithms and try to assume as little as possible about what compiler
was used to make the binary, if any was used at all!
The point being that WATCOM and Intel's compiler optimizations can
perform some pretty extreme code transformations. Intel does constant
propagation and function cloning, and for static functions WATCOM C/C++
just totally ignores function prologue/epilogue and may inline. You
*could* try to detect which compiler was used to compile the code,
however, its possible to link with different libraries and compilers
than the original object code compiler.
Yes, absolutely. We currently don't try to "uninline" anything, and if
two constants are combined by the compiler they will have to be
uncombined by the user after the decompiler has done its job.

QuantumG

Aug 4 '06 #59
MQ

Frederick Gotham wrote:
The addition of two numbers yields 100.

What are the two numbers?
10 + 10, or 11 + 1, igoring identities and arithmetic equivalents of
the above.

MQ

Aug 4 '06 #60

Frederick Gotham wrote:
QuantumG posted:
Full source code recovery that is economically attainable will soon be a
reality.


The addition of two numbers yields 100.

What are the two numbers?
This is an extreme example, but it does illustrate a point. If you all
you want to do is to find source code for a program that adds two
numbers to get the result 100 (let's assume decimal here :-), then
there are obviously infinitely many such programs. But this is the
point: they are all just as good.

The program that adds 50 and 50 can be written infinitely many ways
too. Any program can be modified by changing only comments and
identifiers and not affect the result. Programs can be written with for
or while loops, you can use array indexes or pointers, there are lots
of variations at the source code level.

This turns up a lot on machine code, of course, since operations like
addition are so overloaded. The function that adds a pointer and an
integer (let's assume the pointer points to something with sizeof(1)
for simplicity) is the same as the program that adds two integers.
Identical binary code. So to decompile that function in isolation,
there are three possibilities (ignoring the infinite variations with
identifier names and comments): type the parameters as pointer, int;
int, int; or int, pointer. In a real-world program however, it will be
obvious which of these will mesh with the rest of the program, since
there are type clues all over.

Here is the important point: you can't get the original source code
back, but it doesn't matter. The important thing is to get source code
that encapsulates what the program is doing. In many cases, it is
important that the code can be recompiled. In some cases, it is
important that the code is also readable and maintainable. I can't
think of a situation where you absolutely need the original source
code, or even very close to it. The reason is that the program does
something; there is enough information there for the processor to
execute it. In principle, it should be possible to produce source code
that when compiled does the same thing. We've done it for small
programs, and even for larger programs with a lot of manual help.
The project suffers from a lack of documentation, tutorials and
community.
It also suffers from a lack of "reality".
Frederick Gotham
Well, you are entitled to your opinion, of course. I believe that the
evidence so far shows otherwise.

- emmerik

Aug 4 '06 #61
QuantumG wrote:
Martin Ambuhl wrote:
>That is not a useful counterexample. There is no reason to deliver
the Mars Rover software with the necessary symbolic information
for either debugging or "decompiling" unless the source code was
available. To suggest that somehow "decompiling" becomes a useful
thing on the basis of Mars Rover is, frankly, not credible.

Really, the way you guys are hung up on symbols anyone would figure
you've never read source code written in French or German or
whichever natural language it is that you don't understand.
Recreating sensible symbols is not a problem people have trouble
solving (for foriegn languages, or even asm code), so clearly it is
not a problem they will have solving for the output of a decompiler.
And for such translation of code written in other natural
languages, my ID2ID utility serves very nicely. It can process a
whole suite of source files, making compatible changes across the
whole set. See:

<http://cbfalconer.home.att.net/download/>

--
Chuck F (cb********@yahoo.com) (cb********@maineline.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.netUSE maineline address!
Aug 5 '06 #62

CBFalconer wrote:
And for such translation of code written in other natural
languages, my ID2ID utility serves very nicely. It can process a
whole suite of source files, making compatible changes across the
whole set.
Neat.

Of course, this does bring to the fore an important question: for what
uses of a decompiler do you need to do this? If you're trying to
determine what malware does or look for security flaws you really don't
need maintainable source code.. you just need enough understanding of
the parts that do what you are interested in.

On the other hand, if you've lost your source code and you want to
"recover" it from one of the binaries you have produced from it, you
definitely want maintainable source code. However, you hardly need to
reverse engineer the output of the decompiler do you? After all, if
you're the original programmer, chances are you remember what most of
the identifiers were and can quickly replace the generic ones with
them. Your tool would certainly be better than using a text editor to
do it though!

QuantumG

Aug 5 '06 #63
"QuantumG" <qg@biodome.orgwrites:
CBFalconer wrote:
>And for such translation of code written in other natural
languages, my ID2ID utility serves very nicely. It can process a
whole suite of source files, making compatible changes across the
whole set.

Neat.
Not really : you could use SED or a windows equivalent with much more
confidence since there appears to be no source code or decent
documentation. Or use IDE refactoring tools.

But if you are willing to run an unknown exe good luck :-;
>
Of course, this does bring to the fore an important question: for what
uses of a decompiler do you need to do this? If you're trying to
determine what malware does or look for security flaws you really don't
need maintainable source code.. you just need enough understanding of
the parts that do what you are interested in.

On the other hand, if you've lost your source code and you want to
"recover" it from one of the binaries you have produced from it, you
definitely want maintainable source code. However, you hardly need to
How ironic - maybe this utility can be used after all to help produce
the original source for itself that is lost.
Aug 5 '06 #64
Richard <rg****@gmail.comwrites:
"QuantumG" <qg@biodome.orgwrites:
>CBFalconer wrote:
>>And for such translation of code written in other natural
languages, my ID2ID utility serves very nicely. It can process a
whole suite of source files, making compatible changes across the
whole set.

Neat.

Not really : you could use SED or a windows equivalent with much more
confidence since there appears to be no source code or decent
documentation. Or use IDE refactoring tools.

But if you are willing to run an unknown exe good luck :-;
What are you talking about? The zip file contains the complete
sources, along with a "readme.txt" file and a Windows executable, and
a makefile.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Aug 5 '06 #65
em*****@gmail.com wrote:
Frederick Gotham wrote:
QuantumG posted:
Full source code recovery that is economically attainable will soon be a
reality.
The addition of two numbers yields 100.

What are the two numbers?

This is an extreme example, but it does illustrate a point. If you all
you want to do is to find source code for a program that adds two
numbers to get the result 100 (let's assume decimal here :-), then
there are obviously infinitely many such programs. But this is the
point: they are all just as good.
No, that _is_ the point: they're not. Suppose you have code that
decompiles to this:

return Var00042 + 100;

Now, what does the 100 stand for? Decompilers say that it doesn't
matter; 100 is 100 is 100, and you have working code, right? Maintenance
programmers say that it matters a lot. Was it:
- a literal constant 100?
- a #defined constant WEEKS_IN_TWO_WORKING_YEARS_MINUS_VACATION?
- two #defined constants WEEKS_IN_TWO_YEARS and VACATION_PER_YEAR, *2?
- three #defined constants NUMBER_OF_YEARS, WEEKS_IN_YEAR, WEEKS_OFF?
- 'd'?

For blindly recompiling the program, it does not matter. For maintaining
it, it matters quite a bit.
Here is the important point: you can't get the original source code
back, but it doesn't matter. The important thing is to get source code
that encapsulates what the program is doing. In many cases, it is
important that the code can be recompiled. In some cases, it is
important that the code is also readable and maintainable. I can't
think of a situation where you absolutely need the original source
code, or even very close to it.
When the code must be readable and maintainable, it must also be obvious
_why_ the code does what it does, and that it can be modified and
customised in the same ways that the original could.

Richard
Aug 7 '06 #66

Richard Bos wrote:
When the code must be readable and maintainable, it must also be obvious
_why_ the code does what it does, and that it can be modified and
customised in the same ways that the original could.
Absolutely. And it's the job of the re-engineer to turn the output of
a decompiler into maintainable code. It takes intelligence to do this
and depends on the kind of maintainence tasks you need to perform. It
is clearly not the duty of the decompiler.

QuantumG

Aug 10 '06 #67

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Peter Hansen | last post by:
The term "mock filesystem" refers to code allowing unit or acceptance tests to create, read and write, and manipulate in other ways "virtual" files, without any actual disk access. Everything is...
6
by: Skip Montanaro | last post by:
I wrote PEP 304, "Controlling Generation of Bytecode Files": http://www.python.org/peps/pep-0304.html quite awhile ago. The first version appeared in January 2003 in response to questions...
0
by: get.certified | last post by:
Hi there... Hope you will be doing fine. Actually, we offer Services regarding Testing of Certification of Microsoft, CISCO, ORACLE, CompTia (A+), MacroMedia, CheckPoint, and many other...
2
by: spamharvestor | last post by:
I want to learn C. As I was searching for helpful links, I came across this message with many C and C++ links - http://tinyurl.com/8ep3f However, most of the links shown, such as ...
0
by: Tony Caduto | last post by:
I just recently ported a Delphi synchronous socket library to C#, at least the core, but it can do read and writelines (CRLF delimited) without the over head of the NetworkStream and...
0
by: wl | last post by:
Hi, I'm currently finishing a project that might come in handy for someone else as well: I have created 3 classes: - PersistenceItem: allows for an automatic mapping between properties or...
3
by: LP | last post by:
Hello, I am in process evaluating different reporting tools. I did use CR 9 product in the past, but I can't say that I had good experience with it. I am just curious if anyone has tried the...
7
by: windandwaves | last post by:
Hi Folk Is there anyone out there in a hotel reservation system. I have written one over the last few years and I would like to know if anyone has a similar interest. > Nicolaas
3
dmjpro
by: dmjpro | last post by:
Few days ago I tried to decompile a class file which generates some JVM instructions. It is not generating the original code. Actually i got that class file from a jar file. Now my question is that...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.