By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,386 Members | 1,782 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,386 IT Pros & Developers. It's quick & easy.

Search for an expression in a text file

P: n/a
Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

Please note that i've provided the operating system and compiler info,
just for the sake of it. Some people on this group don't like this and
ask to re-post on an appropriate group for linux. I've still doen this
because I never recieve a good answer on any other group.

Thanks,
Ritesh

Apr 24 '07 #1
Share this Question
Share on Google+
15 Replies


P: n/a
[You should read this with a neutral tone. There's
more to becoming a good programmer than
writing code!]

ritesh <riteshkap...@gmail.comwrote:
Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these
files
Do you have a question about the C language?
(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on
linux. This takes too much time, and kills the
responsiveness of the application.

Is there any function in C or maybe any other way to get
this job done faster?
Most grep implementations are based on publically available
regex libraries. There is no standard function beyond strstr()
which is not likely to be as optimised as taylored libraries.

You may even try tools like [f]lex to generate lexical analysers.
Please note that i've provided the operating system and
compiler info, just for the sake of it.
Hint: If these are at all relevant, there's a very good chance
your post of off-topic in clc.
Some people on this
group don't like this and ask to re-post on an appropriate
group for linux.
Because other groups can answer platform specific questions
much better, and clc isn't particularly interested in becoming
yet another high noise to signal dumping ground for anything
and everything where main is a valid identifier.
I've still doen this
because I never recieve a good answer on any other group.
That doesn't mean that clc has to rectify your problem.

Have you tried simple web/code searches? Looking for C
source on grep or find/replace should give you ample
sources.

--
Peter

Apr 24 '07 #2

P: n/a
Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

Apr 24 '07 #3

P: n/a

"ritesh" <ri**********@gmail.comwrote in message
news:11**********************@t38g2000prd.googlegr oups.com...
Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.
<OT>
The efficient way to do this would be to index your text files. Search for
"inverted index" on the web for more info. More generally, your question is
in the field of "information retrieval" for which there is also a (not very
active) newsgroup: comp.theory.info-retrieval.
</OT>

--
Jonas
Apr 24 '07 #4

P: n/a
ritesh said:
Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?
No specific function in C, no, but there is certainly a way. Several
ways, in fact.

One is to slurp grep into your program, to avoid the overhead of
creating a fresh process, complete with environment etc. It should not
be difficult to find the source for grep, so you can turn grep into
grep().

Another possibility is to code the search yourself, using (say)
Knuth-Morris-Pratt or Boyer-Moore, and taking advantage of any data
knowledge you have which might be able to speed up the search.

And of course you can pre-calculate, as someone else suggested - build
an index that gives you key starting-points.

Other possibilities may well exist, and perhaps the others here will
chip in a few ideas when the grumpiness wears off.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Apr 24 '07 #5

P: n/a
ritesh <ri**********@gmail.comwrites:
I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

And you haven't defined what you mean by an "expression".

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Apr 24 '07 #6

P: n/a
Keith Thompson said:
ritesh <ri**********@gmail.comwrites:
>>
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job
done faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?
The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.
And you haven't defined what you mean by an "expression".
He doesn't have to. C defines it for him. :-)

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Apr 24 '07 #7

P: n/a
Richard Heathfield <rj*@see.sig.invalidwrites:
Keith Thompson said:
>ritesh <ri**********@gmail.comwrites:
>>>
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job
done faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.
The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Apr 24 '07 #8

P: n/a
On Apr 24, 3:40 pm, Keith Thompson <k...@mib.orgwrote:
ritesh <riteshkap...@gmail.comwrites:
I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files
(this is being done on Linux using gcc 3.4.2)
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.
Is there any function in C or maybe any other way to get this job done
faster?

[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

And you haven't defined what you mean by an "expression".
By 'expression' I mean a regular expression used by the grep command.

Apr 24 '07 #9

P: n/a
On Apr 24, 4:02 pm, Keith Thompson <k...@mib.orgwrote:
Richard Heathfield <r...@see.sig.invalidwrites:
Keith Thompson said:
ritesh <riteshkap...@gmail.comwrites:
>Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.
>Is there any function in C or maybe any other way to get this job
done faster?
[...]
The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?
The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]
The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

I also looked at the some of the code avilable for 'grep' on the net,
thats a whole lot of code - I don't think i can understand it all and
make it work faster in the way I want to :(

Apr 24 '07 #10

P: n/a
ritesh wrote:
On Apr 24, 4:02 pm, Keith Thompson <k...@mib.orgwrote:
>>Richard Heathfield <r...@see.sig.invalidwrites:
>>>Keith Thompson said:
>>>>ritesh <riteshkap...@gmail.comwrites:
>>>>>Currently the search is done using the 'grep' utility on linux. This
>takes too much time, and kills the responsiveness of the application.
>>>>>Is there any function in C or maybe any other way to get this job
>done faster?

[...]
>>>>The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?
>>>The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]


The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

I also looked at the some of the code avilable for 'grep' on the net,
thats a whole lot of code - I don't think i can understand it all and
make it work faster in the way I want to :(

<OT>
Ask in comp.unix.programmer as the solution to your problem will be way
off topic in comp.lang.c. Some keywords/functions you should look at,
are regcomp() and pthread_create(). Your system probably has man pages
for these functions.

</OT>

HTH
Bjørn
--
Looking for an embeddable web server?
http://www.metasystems.no/products/h...der/index.html
Apr 24 '07 #11

P: n/a
On Apr 24, 12:37 pm, "Jonas" <spamhereple...@gmail.comwrote:
"ritesh" <riteshkap...@gmail.comwrote in message

news:11**********************@t38g2000prd.googlegr oups.com...
Two Point I missed out -
1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.
2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

<OT>
The efficient way to do this would be to index your text files. Search for
"inverted index" on the web for more info. More generally, your question is
in the field of "information retrieval" for which there is also a (not very
active) newsgroup: comp.theory.info-retrieval.
</OT>

--
Jonas

I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)
It might help if I could do the same with my text files - but how are
these indeces generated? Do you have a pointer to a good web page that
shows this.

thanks

Apr 24 '07 #12

P: n/a
ritesh said:
On Apr 24, 4:02 pm, Keith Thompson <k...@mib.orgwrote:
<snip>
>>
As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?
Here's a very simple example. Let's pretend it really is C code that
you're searching. (Yes, I know, you suggested it's only "similar", but
I need a concrete example here, so I've picked C source as that
example.)

And let's say you know you're looking for a function name in "live"
code. So if you find a logical line whose first non-whitespace
character is a #, you know you can ignore the rest of the line. If you
find a /* pair, you can look specifically for a */ pair without
worrying what comes between. If you're really up to speed, you might
even be able to prune out "dead" code between #if/#endif and the like.
If you find a quote character that isn't a character constant, you can
zoom through to the next unescaped quote character.

And so on and so forth.

Let's take another simple example that isn't C code. Say you have a file
consisting of various kinds of records, with the record type being
noted at the start of the line. If the data you're looking for is
guaranteed only to appear within a particular kind of record, you can
simply check the label and only bother to search the record itself if
it's the right record type.

Many of us tend to focus on writing "generic" software that can handle
any kind of data, and that's all well and good ("write once, use many",
as it were) but, when speed is of the essence, specific knowledge of
the data format can lead to significant speed improvements over generic
algorithms.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Apr 24 '07 #13

P: n/a
On 24 Apr 2007 05:27:58 -0700, ritesh <ri**********@gmail.comwrote:
>As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?
In addition to what Richard wrote, you may be able to take advantage
of knowledge about the data you're searching *for*, as well. For
example, in some cases a simple state machine can provide very fast
searches for particular strings.

--
Al Balmer
Sun City, AZ
Apr 24 '07 #14

P: n/a

"ritesh" <ri**********@gmail.comwrote in message
news:11**********************@s33g2000prh.googlegr oups.com...
On Apr 24, 12:37 pm, "Jonas" <spamhereple...@gmail.comwrote:
>"ritesh" <riteshkap...@gmail.comwrote in message

news:11**********************@t38g2000prd.googleg roups.com...
Two Point I missed out -
1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.
2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

<OT>
The efficient way to do this would be to index your text files. Search
for
"inverted index" on the web for more info. More generally, your question
is
in the field of "information retrieval" for which there is also a (not
very
active) newsgroup: comp.theory.info-retrieval.
</OT>

--
Jonas


I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)
It might help if I could do the same with my text files - but how are
these indeces generated? Do you have a pointer to a good web page that
shows this.
Well, for full-blown search engine libraries there are e.g. CLucene
(http://clucene.sourceforge.net/) or Estraier
(http://hyperestraier.sourceforge.net/). I haven't used any of them, though.

If your demands are smaller, you might consider implementing the indexer &
search engine yourself. If the amount of text is small, the index could be
created in RAM, which simplifies things. And if the text files change
infrequently, you would not have to any 'delete' functionality -- simply
rebuild the index every once in a while.

Try a web search for "create inverted index" or similar, it seems to give a
few useful hits.

--
Jonas
Apr 24 '07 #15

P: n/a
ritesh wrote:
>
.... snip ...
>
I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)

It might help if I could do the same with my text files - but how
are these indeces generated? Do you have a pointer to a good web
page that shows this.
Well, you obviously need at least an array of shorts to define the
files involved, together with an array of longs to identify the
position in the file. This will obviously expand the storage for
short words, such as 'I', 'is', etc. So you probably need a list
of words to ignore. How to handle mistypings is another
consideration.

I would suggest a hash table, based on the word itself, leading to
lists of arrays. Only the tables associated with one file will be
in play at any one moment.

See hashlib for one possible source for the hash table.

<http://cbfalconer.home.att.net/download/>

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>
<http://www.aaxnet.com/editor/edit043.html>
cbfalconer at maineline.net

--
Posted via a free Usenet account from http://www.teranews.com

Apr 24 '07 #16

This discussion thread is closed

Replies have been disabled for this discussion.