473,327 Members | 1,919 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

Javascript Collection, Obfuscation, Crawling?

Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Thanks!

Jul 24 '07 #1
8 2006
Steve H. wrote on 24 jul 2007 in comp.lang.javascript:
Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious
What is the sense of determining if js is obfuscated?

You would first need a decent definition of obfuscation.

Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?

The common understanding on ths NG is, methinks, that obfuscation only
deters the users that cannot even read plain js.
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.
You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?
Someone has to crawl, and if it's not Google, it must be you, meseems.

Builing one is not that difficult, just write a httpxml function.

I would use Google with some simple words to get a fast amount of URLs
and measure the amount of bytes between <script and /scriptin the
received strings, and check for external .js files.
--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)
Jul 24 '07 #2
On Jul 24, 11:23 am, "Evertjan." <exjxw.hannivo...@interxnl.net>
wrote:
Steve H. wrote on 24 jul 2007 in comp.lang.javascript:

What is the sense of determining if js is obfuscated?

You would first need a decent definition of obfuscation.

Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?
No, I do not think this, nor does my employer.
The common understanding on ths NG is, methinks, that obfuscation only
deters the users that cannot even read plain js.
I agree.
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.
This is a bit assuming, but thank you for the suggestion. Let's just
say that there are enough people in my vicinity to verify my results
and ensure that perform statistical tests properly. Having said that,
I am no stranger to the field.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.
Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Someone has to crawl, and if it's not Google, it must be you, meseems.

Builing one is not that difficult, just write a httpxml function.
I wasn't really concerned with difficulty, I was just wondering if
someone knew of a method to save me some time; I am currently juggling
multiple projects and this one is a little lower in priority than
others.
I would use Google with some simple words to get a fast amount of URLs
and measure the amount of bytes between <script and /scriptin the
received strings, and check for external .js files.
I will probably write my own crawler in conjunction with the google
api.

Thank you again for your suggestions, but I found many of your
statements assuming and/or loaded. I wish you would have asked me
questions for clarification without introducing a bias into the way
you ask said questions; personally, I find that a bit insulting.

--
Steve

Jul 24 '07 #3
Steve H. said the following on 7/24/2007 2:02 PM:
Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious
The only effective way to know either of those is to sit and study the
code manually and determine whether it is obfuscated and/or malicious.
There is no tell-tell sign as to whether code is obfuscated or
malicious. I assume you will be doing that yourself? Also, you would
need a pretty good understanding of JS to know what is malicious or not
(and it depends on your definition of malicious).

Is this code malicious?

<script type="text/javascript">
function closeTheWindow(){
self.close();
}
</script>

Personally, I find it malicious as there is no good use for it but it
does nothing "malicious" to the users computer.

As for obfuscated code, if a site has both - obfuscated and plain code -
does it go as obfuscated or not?
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.
A simple Google Search for the four words "and", "but", "or" and "the"
returns 14,140,000,000 pages. Ironically, if I add "OR usenet" it lowers
the results when it would reason to leave them the same or increase
them. Gotta love Google.

<URL:
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=+and+OR+but+OR+or+OR +the>

How much larger a sample do you want? I think that search would fairly
indicative of the web in general as it doesn't skew the results towards
any particular thing other than English pages.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.
You could use a browser to retrieve the pages as a crawler but the task
of determining whether a page has script in it or not - across a very
large number of pages - would best be left to some other program. An XHR
request running locally doesn't suffer from a cross domain issue so you
could simply feed an IE page a million or so URL's and have it retrieve
each, search it for script, log the results. The script pages you would
have to go back manually and review though. 3 seconds to do a page is
being very generous for the time it would take to retrieve the page,
search it's contents, log the results, read another URL and issue a
request for it. At 3 seconds, a simple million pages would take you 800+
hours to machine process them. Then the time of manually processing
those 1 million pages gets astronomical. Make it 14 billion pages and
your grandchildren wouldn't get it finished with one computer.
Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?
While I am curious about some hard analysis to see whether it is as
prevalent as I think it is (scripting itself) along with
malicious/obfuscated code, without a very large sample (Above a billion
pages), then the results would have to be skewed in one direction or the
other and in the end that makes those statistics useless for a real
world observation.

--
Randy
Chance Favors The Prepared Mind
comp.lang.javascript FAQ - http://jibbering.com/faq/index.html
Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/
Jul 25 '07 #4
On Jul 25, 4:02 am, "Steve H." <steve.c.ha...@gmail.comwrote:
Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?
Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>

and compare the amount of white space to character data. If the
average length of identifiers is short (say 2 characters) and the
percentage of white space is very low (say less than 5%, testing will
tell), the code is likely obfuscated.

I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. The fact that obfuscated code is also (very)
difficult to read is seen as a bonus by some, though it should not be
the primary purpose for using it.

For example, Google's map scripts are (or were, I haven't checked
lately) obfuscated, yet within a very short time manually 'de-
obfuscated' versions appeared on the web, published by those who
wanted to share how it worked. I expect Google wasn't concerned about
that as they were likely after the minification benefits rather than
attempting to protect their copyright.

As for malicious code, I think you need to know exactly what you are
looking for, e.g. the recently publicised IE and Firefox protocol
handling flaw or the supposed iPhone vulnerability. I think
javascript might be used as a transport to say deliver an malicious
object (say applet, animation or image), but it is unlikely that the
script itself will be malicious.
--
Rob

Jul 25 '07 #5
RobG said the following on 7/24/2007 10:32 PM:
On Jul 25, 4:02 am, "Steve H." <steve.c.ha...@gmail.comwrote:
>Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>
That pattern exists in minimized code but very seldom appears in the raw
code of obfuscated code. Many of which start with a pattern similar to this:

var x = ".............."
eval(x)

--
Randy
Chance Favors The Prepared Mind
comp.lang.javascript FAQ - http://jibbering.com/faq/index.html
Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/
Jul 25 '07 #6
Steve H. wrote on 24 jul 2007 in comp.lang.javascript:
Thank you again for your suggestions, but I found many of your
statements assuming and/or loaded. I wish you would have asked me
questions for clarification without introducing a bias into the way
you ask said questions; personally, I find that a bit insulting.
You were on the asking side, providing not even enough info about your
own presumed qualities, so if you want only niceties, try a paid
helpdesk.

This is usenet, so get used to it, Steve.
>You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.
This is a bit assuming, but thank you for the suggestion. Let's just
say that there are enough people in my vicinity to verify my results
and ensure that perform statistical tests properly. Having said that,
I am no stranger to the field.
Again, how could we know you are "no stranger to the field" of
statistics?

In the medical field, where I work, checking your own research statistics
is rightly felt to introduce hidden biases.
>Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?

No, I do not think this, nor does my employer.
So why are you [plural] searching for obfuscation at all, if,
as I surmize, you are after malicious code on the web?

===

I think a properly, in the statistical sense, conducted pilot will give
you a reasonable idea about the computer time involved to find enough of
the code you are after. Perhaps the main enterprize would take 12 years,
or 2 hours of computer time, who is to say without a pilot? And even then
extrapolation, the standard goal of a pilot, remains dangerous as some
hidden timing effect could act exponentially or the pilot's url batch
could prove to be non representative on a larger scale.

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)
Jul 25 '07 #7
RobG wrote:
I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. [...]
I wouldn't be so sure about that. For example, omitting white space
characters tends to require delimiter characters that were otherwise not
needed.
PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not the
best source of advice on designing systems that use javascript.
-- Richard Cornford, <f8*******************@news.demon.co.uk>
Aug 1 '07 #8
On Jul 24, 7:32?pm, RobG <rg...@iinet.net.auwrote:
On Jul 25, 4:02 am, "Steve H." <steve.c.ha...@gmail.comwrote:


Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.
Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>

and compare the amount of white space to character data. If the
average length of identifiers is short (say 2 characters) and the
percentage of white space is very low (say less than 5%, testing will
tell), the code is likely obfuscated.

I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. The fact that obfuscated code is also (very)
difficult to read is seen as a bonus by some, though it should not be
the primary purpose for using it.

For example, Google's map scripts are (or were, I haven't checked
lately) obfuscated, yet within a very short time manually 'de-
obfuscated' versions appeared on the web, published by those who
wanted to share how it worked. I expect Google wasn't concerned about
that as they were likely after the minification benefits rather than
attempting to protect their copyright.

As for malicious code, I think you need to know exactly what you are
looking for, e.g. the recently publicised IE and Firefox protocol
handling flaw or the supposed iPhone vulnerability. I think
javascript might be used as a transport to say deliver an malicious
object (say applet, animation or image), but it is unlikely that the
script itself will be malicious.

--
Rob- Hide quoted text -

- Show quoted text -
Also, another test for obfuscation maybe to check if there are any
comments in the script. Comments are usually removed from the source
in compressed/obfuscated code.
Aug 1 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

41
by: Mr. x | last post by:
Hello, Can I make my java script code be invisible to other people who enter into my site by IE browser ? - How ? Thanks :)
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.