473,566 Members | 2,770 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

processing raw logs faster

I am trying to process raw IIS log files and insert them into a MySQL
database. I have no problem accomplishing this, but the php code runs
very slow.

Right now, it is processing 10,000 lines in a log file every ~300
seconds. I am dealing with daily log files with records over 500,000.
It takes hours to process a daily file.

<?php
error_reporting (E_ALL);

$server = "localhost:3306 "; // db host
$user = ""; // db username
$pswd = ""; // db password
$db = ""; // db name

$DB_link = mysql_connect($ server,$user,$p swd) OR DIE (mysql_error()) ;

mysql_select_db ($db, $DB_link) OR DIE (mysql_error()) ;

$start_time = time();

$handle = fopen("/logs/ex050830.log", "r");
$linecnt = 0;
$totalcnt = 0;
while (!feof($handle) ) {

$list = array();
$buffer = fgets($handle, 20000);

if (! preg_match("/^\s*?#/", $buffer) ){
$linecnt++;
$line = split(" ", $buffer);

$stmt = "INSERT INTO logs ( `hit_date` , `hit_time` , `s-sitename` ,
`s-computername` , ".
"`s-ip` , `cs-method` , `cs-uri-stem` , `cs-uri-query` ,
`s-port` , `cs-username` , `c-ip` , ".
"`cs-version` , `User-Agent` , `Cookie` , `Referer` ,
`cs-host` , `sc-status` , `sc-substatus` , ".
"`sc-win32-status` , `sc-bytes` , `cs-bytes` , `time_taken` )
".
" VALUES ( '".$line[0]."', '".$line[1]."', '".$line[2]."',
'".$line[3]."', '".$line[4]."', ".

"'".$line[5]."','".$line[6]."','".$line[7]."','".$line[8]."','".$line[9]."',
".

"'".$line[10]."','".$line[11]."','".$line[12]."','".$line[13]."','".$line[14]."','".$line[15]."',
".

"'".$line[16]."','".$line[17]."','".$line[18]."','".$line[19]."','".$line[20]."','".trim($li ne[21])."'
)";
@mysql_query($s tmt);

if( $linecnt >= 10000 ){
$totalcnt += $linecnt;
echo "[ ".$totalcnt ." ( ". ( time() - $start_time) ." )
]\t";
$linecnt = 0;
}

}

}
fclose($handle) ;

?>

Aug 31 '05 #1
8 1579
On 31 Aug 2005 12:43:15 -0700, "ch***********@ gmail.com"
<ch***********@ gmail.com> wrote:
I am trying to process raw IIS log files and insert them into a MySQL
database. I have no problem accomplishing this, but the php code runs
very slow.


Is it the PHP that's slow, or the database updates? Put some timing code in to
get an idea of which bit takes longest, so you can optimise the right part.

--
Andy Hassall :: an**@andyh.co.u k :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Aug 31 '05 #2
You should really rethink your approach. What information do you want
to extract out of the log? Simply sticking every pagehit into the
database will cause MySQL to blow up in no time.

Aug 31 '05 #3
>Is it the PHP that's slow, or the database updates? Put some timing
code in toget an idea of which bit takes longest, so you can optimise
the right part.
I removed the sql insert and the code ran faster but not by much.
You should really rethink your approach. What information do you want
to extract out of the log? Simply sticking every pagehit into the
database will cause MySQL to blow up in no time.


What would you suggest?

Sep 1 '05 #4
On Wed, 31 Aug 2005 17:27:53 -0700, ch***********@g mail.com wrote:

What would you suggest?


You can get a decent database from http://www.postgresql.org. It's kinda
better suited for heavy OLTP processing then MySQL. Not as good as Oracle
RDBMS, but definitely getting there.

--
http://www.mgogala.com

Sep 1 '05 #5
Mladen Gogala wrote:
On Wed, 31 Aug 2005 17:27:53 -0700, ch***********@g mail.com wrote:

What would you suggest?
You can get a decent database from http://www.postgresql.org. It's kinda
better suited for heavy OLTP processing then MySQL. Not as good as Oracle
RDBMS, but definitely getting there.


No - he wanted it to go faster.

ch***********@g mail.com wrote:
$list = array();
$buffer = fgets($handle, 20000);

if (! preg_match("/^\s*?#/", $buffer) ){
Here's your first problem. Regexes are slow. My PERL RE's are a bit a bit
rusty - but that looks a bit suspect anyway. Try coding it without REs.
$stmt = "INSERT INTO logs ( `hit_date` , `hit_time` , `s-sitename` ,
`s-computername` , ".
"`s-ip` , `cs-method` , `cs-uri-stem` , `cs-uri-query` ,
`s-port` , `cs-username` , `c-ip` , ".
"`cs-version` , `User-Agent` , `Cookie` , `Referer` ,
`cs-host` , `sc-status` , `sc-substatus` , ".
"`sc-win32-status` , `sc-bytes` , `cs-bytes` , `time_taken` )
".
Join the strings together - OK it doesn't help the readability - but you
will get some performance benefit. Actually it would be a lot better to
move the invariant parts outside the loop:

$stub="INSERT INTO logs....VALUES( ";
while (!feof($handle) ) {
....
$stmt=$stub . "'".$line[0]."', '".$line[1]."', '".$line[2]."',

You could try a more functional approach to generating the VALUES clause -
something like:

$stmt = $stub . "'" . implode("','",$ line) . "')";

This could be more efficient:
if( $linecnt >= 10000 ){
$totalcnt += $linecnt;
echo "[ ".$totalcnt ." ( ". ( time() - $start_time) ." )
]\t";
$linecnt = 0;
}


Instead:

if (!($linecnt % 10000)) {
echo "[ ".$linecnt. " ( ". ( time() - $start_time) ." ) ]\t";
}

You should also get a boost by using INSERT DELAYED (assuming your DBMS and
table are compliant)

$stub="INSERT DELAYED INTO logs....VALUES( ";

HTH

C.
Sep 1 '05 #6
NC
ch***********@g mail.com wrote:

I am trying to process raw IIS log files and insert them into a MySQL
database. I have no problem accomplishing this, but the php code runs
very slow.


Why do you need PHP code at all? If I remember correctly, IIS log
files are comma-separated text files that can be imported directly
into MySQL using LOAD DATA INFILE command...

Cheers,
NC

Sep 1 '05 #7
Colin McKinnon wrote:
Here's your first problem. Regexes are slow. My PERL RE's are a bit a bit
rusty - but that looks a bit suspect anyway. Try coding it without REs.


The PCRE extension caches compile regex, so the overhead isn't that
high when you keep testing for the same expression. It's highly
unlikely that you can write something more efficient for the expression
/^\s*#/ in PHP.

The call to split() definitely should be replaced by explode(),
however. The Posix regex functions do recompile the expression each
time.

Sep 1 '05 #8
On Thu, 01 Sep 2005 09:49:27 +0100, Colin McKinnon wrote:
You can get a decent database from http://www.postgresql.org. It's
kinda better suited for heavy OLTP processing then MySQL. Not as good
as Oracle RDBMS, but definitely getting there.

No - he wanted it to go faster.


Well, that is precisely what he will get, unless he's using MySQL in the
single user mode. Being an Oracle DBA for 15+ years and working for large
companies (my largest database was > 1TB and with 3000 concurrent users) I
forgot that there are such databases. MySQL is a serious junk which has
problems with transactions (if "dirty reads" are disabled and standard
ANSI-level transaction isolation enabled, database speed is 25% of the
original), MySQL is prone to lock escalation (many row locks are converted
to a table lock, row locking was introduced in MySQL v4, so it is very
new) and MySQL is notorious for index corruption and having to rebuild
indexes. Postgres has no lock escalation, normally operates in
ANSI-standard serialization level and indexes have to be rebuilt only on
rare occasions. For large loads, which is precisely what he's describing,
nothing in the world beats Oracle RDBMS with parallel/direct insert.
Postgres is the best free alternative. MySQL is a serious junk. The ease
with which you dismissed Postgres tells me that you probably don't know
much about databases.

ch***********@g mail.com wrote:
$list = array();
$buffer = fgets($handle, 20000);

if (! preg_match("/^\s*?#/", $buffer) ){
Here's your first problem. Regexes are slow.


I love unexplained declarations/qualifications like this! In the form of
verdict, such absolute statements do not need explanations. Unfortunately,
this verdict is bound to be struck down on appeal. Regex is a standard
programming mechanism and is not "slow" unless used carelessly. Loops can
also be slow, if used carelessly. Avoid using loops, too? How about switch
statements? I will not even mention new PHP5 and try-catch combination
with exceptions. Creating monstrosities without regex will save you
nothing because the few instructions saved by refraining from using
regular expressions will be more then compensated by the monstrosity of
the program without RE. Your logic leads directly to the assembly language
programming - it's without a doubt much faster then PHP, Perl, Java or
anything else. You will not have to suffer from regex. Just buy few
mathematical books describing the theory of algorithms, including finite
automata and implement that in the assembly language. The speed will be
unbeatable, once you finish he program in the year 2525. Obey the
principles of the good programming and you will find that the programs
will work fine, even with RE. Write stupid programs which are
algorithmically incorrect and not even coding in binary will make them
less stupid.
My PERL RE's are a bit a bit
rusty - but that looks a bit suspect anyway. Try coding it without REs.


Some advice is not meant to be followed. This is one such example. First
thing to do is to profile your program, using Zend Studio if possible, and
find out where the time is spent. That is the most logical thing to do. If
your complaint is that "the program is slow", find out exactly what is
slow and where is the time spent. My experience tells me that the main
cause for slowness is usually inadequate usage of the database. With
MySQL, it's not hard to do. Unless you know the database very well, you
will likely have performance problems. Avoiding RE will then do nothing
but mutilate, scar and disembowel the program.

--
http://www.mgogala.com

Sep 2 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
5081
by: dmcconkey | last post by:
Hi folks, I have a client with four websites. Each site has a contact form that is identical. They all have "required" fields validated through a JavaScript onSubmit() function. Upon validation, post values go to a PHP processing page that adds values to a database and generates an email to someone in marketing. For three of these sites,...
0
1087
by: DMan | last post by:
Need some help on how to make the following faster.... Public XmlDocument ProcessXML( XmlDocument xmlData ) { XmlNode originalXML = xmlData.Clone(); try { // Process XML data here... }
6
2106
by: soren juhu | last post by:
Hi, I am developing a C Program for reading over a million files of size 1 kilobytes each and sending the contents to another program using some middle ware. I need some help on designing the program to process such a large number of files in less than 8 hours. TIA Soren
2
1419
by: Remco Groot Beumer | last post by:
Hello I've created an ASP.NET page, which uses some variables which are set after login. When person A logs in, a new public object is instanced (i'm not sure if that is a correct word), after which the variables are set. The problem occurs when person B logs in at the site. After he logs in, all the variables of person A are overwritten...
2
1490
by: Luiz Vianna | last post by:
Hi folks, I got a problem that certainly someone had too. After a user request, I (my server) must process a lot of data that will expend some time. During this process I must inform the user the status of the overall process. What I'm facing is that my ASPX/vb file that does this long task is completely processed before some response...
17
2287
by: Luc Mercier | last post by:
Hi Folks, I'm new here, and I need some advice for what tool to use. I'm using XML for benchmarking purposes. I'm writing some scientific programs which I want to analyze. My program generates large XML logs giving semi-structured information on the flow of the program. The XML tree looks like the method calls tree, but at a much higher...
10
5702
by: Enrique Cruiz | last post by:
Hello all, I am currently implementing a fairly simple algorithm. It scans a grayscale image, and computes a pixel's new value as a function of its original value. Two passes are made, first horizontally and second vertically. The problem I have is that the vertical pass is 3 to 4 times slower than the horizontal, although the code is...
6
2989
by: surfivor | last post by:
I may be involved in a data migration project involving databases and creating XML feeds. Our site is PHP based, so I imagine the team might suggest PHP, but I had a look at the PHP documentation for one of the Pear modules for creating XML and it didn't look like much. I've used Perl XML:Twig and I have the impression that there is more Perl...
2
7213
by: tech101 | last post by:
Can I get the master (or slaves) to automatically remove the binary logs once they are processed by all slaves? It says in the mysql manual : If you are using replication, you should not delete old binary log files until you are sure that no slave still needs to use them. For example, if your slaves never run more than three days behind,...
0
7666
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7584
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7888
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8108
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
6260
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
3643
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
2083
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1201
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
925
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.