ig*********@gmail.com wrote:
That might be true. Still regexp inspite of being very long should be
very straightforward.
Here is the regexp (I would understand if noone would read it):
^([[:alpha:]]{3} +[[:digit:]]{1,2} +[[:digit:]]{1,2}:[[:digit:]]{1,2}:
[[:digit:]]{1,2}) +([^ ]+)-([[:digit:]]+)-([[:alnum:]]+)\\[([[:digit:]]
+)\\] +([^ ]+)( +(([^ ]+)\\(([[:digit:]]*)\\)))?: (.*)\\n?$
This is hideous.
It simply matches the line in the log file and has no fancy stuff
involved.
Lots of fancy stuff involved for dubious reasons.
The way I read the file should not matter as I've tryed running read-
file-line-by-line code separately (I've commented regexps stuff) and
it ran really fast.
So you're saying it runs very fast when you read a file line by line but
doesn't run fast when you don't? Well yes then of course it matters.
Here is the Pthon code I've benchmarked:
import re
PATTERN = re.compile(r"^(\w{3}\s*\d{1,2}\s*\d{1,2}\:\d{1,2}\ :\d{1,2})
\s*(([^\s]+)\[(\d*)\])\s*([^\s]+)\s*(([^\s]+)\((\d*)\))\:\s(.*?)\n?$")
----
You can set regex matching modes by specifying a special constant as a third
parameter to re.search(). re.I or re.IGNORECASE applies the pattern case
insensitively. re.S or re.DOTALL makes the dot match newlines. re.M or
re.MULTILINE makes the caret and dollar match after and before line breaks in
the subject string. There is no difference between the single-letter and
descriptive options, except for the number of characters you have to type in.
To specify more than one option, "or" them together with the | operator:
re.search("^a", "abc", re.I | re.M).
By default, Python's regex engine only considers the letters A through Z, the
digits 0 through 9, and the underscore as "word characters". Specify the flag
re.L or re.LOCALE to make \w match all characters that are considered letters
given the current locale settings. Alternatively, you can specify re.U or
re.UNICODE to treat all letters from all scripts as word characters. The
setting also affects word boundaries.
----
The above implies that Pyhton's newline mode is *ON* by default. POSIX
regcomp() is NOT newline on by default.
fp = open("some.log", "r")
for line in fp:
mo = PATTERN.match(line)
fp.close()
Which is line by line.
And here is the C code:
#include <stdio.h>
#include <regex.h>
// Just a helper function.
char * read_line(FILE * in) {
size_t line_len;
char * buf;
buf = fgetln(in, &line_len);
if (!buf) return NULL;
while (line_len 0 && (buf[line_len - 1] == (char) 10 ||
buf[line_len - 1] == (char) 13)) line_len--;
Remove that. There's no reason you should have to do any of that if fgetln()
is doing what it's told.
DESCRIPTION
The fgetln() function returns a pointer to the next line from the stream
referenced by stream. This line is not a C string as it does not end
with a terminating NUL character. The length of the line, including the
final newline, is stored in the memory location to which len points.
(Note, however, that if the line is the last in a file that does not end
in a newline, the returned text will not contain a newline.)
Looks like some BSD4.4 function. Also looks like it operates on a FILE stream
and uses a static pointer of some sort. In short, please just avoid this
function altogether and use fgets().
char *line = malloc(line_len + 1);
strncpy(line, buf, line_len);
line[line_len] = (char) 0;
return line;
}
Right. So the issue here is that you don't know your maximum line length which
is probably what led you to find a function like fgetln() in the first place.
This is one area where you've got to either establish a reasonable boundary
size and use that as the size of your temporary buffer or use fread() and do
buffer management yourself. What this means is that if you do not forsee any
line being longer than let's say 1024 characters. Use a simple temporary
buffer of 1024 char, and throw an error when you hit max line length.
int main(int argc, char **argv) {
regex_t regex;
int errc = regcomp(®ex, "^([a-zA-Z_]{3} +[0-9]{1,2} +[0-9]{1,2}:
[0-9]{1,2}:[0-9]{1,2}) +([^ ]+)-([0-9]+)-([a-zA-Z_]+)\\[([0-9]+)\\] +
([^ ]+)( +(([^ ]+)\\(([0-9]*)\\)))?: (.*)\\n?$", REG_EXTENDED |
REG_ICASE);
Add REG_NEWLINE.
Please remove "\\n?" from your regex.
Also, your regex:
^<0 or 1 matches of a formatted string>: <0 or more chars>$
is not the most efficient use of regex, and you should probably examine your
logfile format as well.
However the problem to me is that you did not set REG_NEWLINE.