Search for an expression in a text file

ritesh

Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

Please note that i've provided the operating system and compiler info,
just for the sake of it. Some people on this group don't like this and
ask to re-post on an appropriate group for linux. I've still doen this
because I never recieve a good answer on any other group.

Thanks,
Ritesh

Apr 24 '07 #1

Subscribe Reply

2102

Peter Nilsson

[You should read this with a neutral tone. There's
more to becoming a good programmer than
writing code!]

ritesh <riteshkap...@gmail.comwrote:

Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these
files

Do you have a question about the C language?

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on
linux. This takes too much time, and kills the
responsiveness of the application.

Is there any function in C or maybe any other way to get
this job done faster?

Most grep implementations are based on publically available
regex libraries. There is no standard function beyond strstr()
which is not likely to be as optimised as taylored libraries.

You may even try tools like [f]lex to generate lexical analysers.

Please note that i've provided the operating system and
compiler info, just for the sake of it.

Hint: If these are at all relevant, there's a very good chance
your post of off-topic in clc.

Some people on this
group don't like this and ask to re-post on an appropriate
group for linux.

Because other groups can answer platform specific questions
much better, and clc isn't particularly interested in becoming
yet another high noise to signal dumping ground for anything
and everything where main is a valid identifier.

I've still doen this
because I never recieve a good answer on any other group.

That doesn't mean that clc has to rectify your problem.

Have you tried simple web/code searches? Looking for C
source on grep or find/replace should give you ample
sources.

--
Peter

Apr 24 '07 #2

ritesh

Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

Apr 24 '07 #3

Jonas

"ritesh" <ri**********@gmail.comwrote in message
news:11**********************@t38g2000prd.googlegr oups.com...

Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

<OT>
The efficient way to do this would be to index your text files. Search for
"inverted index" on the web for more info. More generally, your question is
in the field of "information retrieval" for which there is also a (not very
active) newsgroup: comp.theory.info-retrieval.
</OT>

--
Jonas

Apr 24 '07 #4

Richard Heathfield

ritesh said:

Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

No specific function in C, no, but there is certainly a way. Several
ways, in fact.

One is to slurp grep into your program, to avoid the overhead of
creating a fresh process, complete with environment etc. It should not
be difficult to find the source for grep, so you can turn grep into
grep().

Another possibility is to code the search yourself, using (say)
Knuth-Morris-Pratt or Boyer-Moore, and taking advantage of any data
knowledge you have which might be able to speed up the search.

And of course you can pre-calculate, as someone else suggested - build
an index that gives you key starting-points.

Other possibilities may well exist, and perhaps the others here will
chip in a few ideas when the grumpiness wears off.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.

Apr 24 '07 #5

Keith Thompson

ritesh <ri**********@gmail.comwrites:

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

And you haven't defined what you mean by an "expression".

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Apr 24 '07 #6

Richard Heathfield

Keith Thompson said:

ritesh <ri**********@gmail.comwrites:
>>
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job
done faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

And you haven't defined what you mean by an "expression".

He doesn't have to. C defines it for him. :-)

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.

Apr 24 '07 #7

Keith Thompson

Richard Heathfield <rj*@see.sig.invalidwrites:

Keith Thompson said:

>ritesh <ri**********@gmail.comwrites:
>>>
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job
done faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Apr 24 '07 #8

ritesh

On Apr 24, 3:40 pm, Keith Thompson <k...@mib.orgwrote:

ritesh <riteshkap...@gmail.comwrites:
I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

And you haven't defined what you mean by an "expression".

By 'expression' I mean a regular expression used by the grep command.

Apr 24 '07 #9

ritesh

On Apr 24, 4:02 pm, Keith Thompson <k...@mib.orgwrote:

Richard Heathfield <r...@see.sig.invalidwrites:
Keith Thompson said:

ritesh <riteshkap...@gmail.comwrites:

>Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

>Is there any function in C or maybe any other way to get this job
done faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

I also looked at the some of the code avilable for 'grep' on the net,
thats a whole lot of code - I don't think i can understand it all and
make it work faster in the way I want to :(

Apr 24 '07 #10

B. Augestad

ritesh wrote:

On Apr 24, 4:02 pm, Keith Thompson <k...@mib.orgwrote:

>>Richard Heathfield <r...@see.sig.invalidwrites:

>>>Keith Thompson said:

>>>>ritesh <riteshkap...@gmail.comwrites:

>>>>>Currently the search is done using the 'grep' utility on linux. This
>takes too much time, and kills the responsiveness of the application.

>>>>>Is there any function in C or maybe any other way to get this job
>done faster?

[...]

>>>>The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

>>>The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

I also looked at the some of the code avilable for 'grep' on the net,
thats a whole lot of code - I don't think i can understand it all and
make it work faster in the way I want to :(

<OT>
Ask in comp.unix.programmer as the solution to your problem will be way
off topic in comp.lang.c. Some keywords/functions you should look at,
are regcomp() and pthread_create(). Your system probably has man pages
for these functions.

</OT>

HTH
Bjørn
--
Looking for an embeddable web server?
http://www.metasystems.no/products/h...der/index.html

Apr 24 '07 #11

ritesh

On Apr 24, 12:37 pm, "Jonas" <spamhereple...@gmail.comwrote:

"ritesh" <riteshkap...@gmail.comwrote in message

news:11**********************@t38g2000prd.googlegr oups.com...

Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

<OT>
The efficient way to do this would be to index your text files. Search for
"inverted index" on the web for more info. More generally, your question is
in the field of "information retrieval" for which there is also a (not very
active) newsgroup: comp.theory.info-retrieval.
</OT>

--
Jonas

I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)
It might help if I could do the same with my text files - but how are
these indeces generated? Do you have a pointer to a good web page that
shows this.

thanks

Apr 24 '07 #12

Richard Heathfield

ritesh said:

On Apr 24, 4:02 pm, Keith Thompson <k...@mib.orgwrote:

<snip>

>>
As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

Here's a very simple example. Let's pretend it really is C code that
you're searching. (Yes, I know, you suggested it's only "similar", but
I need a concrete example here, so I've picked C source as that
example.)

And let's say you know you're looking for a function name in "live"
code. So if you find a logical line whose first non-whitespace
character is a #, you know you can ignore the rest of the line. If you
find a /* pair, you can look specifically for a */ pair without
worrying what comes between. If you're really up to speed, you might
even be able to prune out "dead" code between #if/#endif and the like.
If you find a quote character that isn't a character constant, you can
zoom through to the next unescaped quote character.

And so on and so forth.

Let's take another simple example that isn't C code. Say you have a file
consisting of various kinds of records, with the record type being
noted at the start of the line. If the data you're looking for is
guaranteed only to appear within a particular kind of record, you can
simply check the label and only bother to search the record itself if
it's the right record type.

Many of us tend to focus on writing "generic" software that can handle
any kind of data, and that's all well and good ("write once, use many",
as it were) but, when speed is of the essence, specific knowledge of
the data format can lead to significant speed improvements over generic
algorithms.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.

Apr 24 '07 #13

Al Balmer

On 24 Apr 2007 05:27:58 -0700, ritesh <ri**********@gmail.comwrote:

>As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

In addition to what Richard wrote, you may be able to take advantage
of knowledge about the data you're searching *for*, as well. For
example, in some cases a simple state machine can provide very fast
searches for particular strings.

--
Al Balmer
Sun City, AZ

Apr 24 '07 #14

Jonas

"ritesh" <ri**********@gmail.comwrote in message
news:11**********************@s33g2000prh.googlegr oups.com...

On Apr 24, 12:37 pm, "Jonas" <spamhereple...@gmail.comwrote:
>"ritesh" <riteshkap...@gmail.comwrote in message

news:11**********************@t38g2000prd.googleg roups.com...

Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

<OT>
The efficient way to do this would be to index your text files. Search
for
"inverted index" on the web for more info. More generally, your question
is
in the field of "information retrieval" for which there is also a (not
very
active) newsgroup: comp.theory.info-retrieval.
</OT>

--
Jonas

I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)
It might help if I could do the same with my text files - but how are
these indeces generated? Do you have a pointer to a good web page that
shows this.

Well, for full-blown search engine libraries there are e.g. CLucene
(http://clucene.sourceforge.net/) or Estraier
(http://hyperestraier.sourceforge.net/). I haven't used any of them, though.

If your demands are smaller, you might consider implementing the indexer &
search engine yourself. If the amount of text is small, the index could be
created in RAM, which simplifies things. And if the text files change
infrequently, you would not have to any 'delete' functionality -- simply
rebuild the index every once in a while.

Try a web search for "create inverted index" or similar, it seems to give a
few useful hits.

--
Jonas

Apr 24 '07 #15

CBFalconer

ritesh wrote:

>

.... snip ...

>
I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)

It might help if I could do the same with my text files - but how
are these indeces generated? Do you have a pointer to a good web
page that shows this.

Well, you obviously need at least an array of shorts to define the
files involved, together with an array of longs to identify the
position in the file. This will obviously expand the storage for
short words, such as 'I', 'is', etc. So you probably need a list
of words to ignore. How to handle mistypings is another
consideration.

I would suggest a hash table, based on the word itself, leading to
lists of arrays. Only the tables associated with one file will be
in play at any one moment.

See hashlib for one possible source for the hash table.

<http://cbfalconer.home.att.net/download/>

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>
<http://www.aaxnet.com/editor/edit043.html>
cbfalconer at maineline.net

--
Posted via a free Usenet account from http://www.teranews.com

Apr 24 '07 #16

Similar topics

String search vs regexp search

by: Anand Pillai | last post by:

To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1:...

Python

Display context snippet for search phrase match optimisation request

by: Follower | last post by:

Hi, I am working on a function to return extracts from a text document with a specific phrase highlighted (i.e. display the context of the matched phrase). The requirements are: * Match...

Python

boolean search expressions?

by: David | last post by:

Hi, I'm trying to add a search facility to a page that looks for matches in one, other or both memo fields of a database. The code below works fine if the visitor types in one word, or the term...

ASP / Active Server Pages

Search for files with no extension

by: Prem | last post by:

Hi, I need to search a particular directory for all the files that do not have any extension and have a specific naming convension. The first 3 characters of the file name are alpha and the rest...

C# / C Sharp

Search for multiple things in a string

by: tshad | last post by:

Can you do a search for more that one string in another string? Something like: someString.IndexOf("something1","something2","something3",0) or would you have to do something like: if...

C# / C Sharp

RegEx conditional search and replace

by: Martin Evans | last post by:

Sorry, yet another REGEX question. I've been struggling with trying to get a regular expression to do the following example in Python: Search and replace all instances of "sleeping" with "dead"....

Python

At last something interesing to discuss: What is the fastest way to search a client-side database?

by: Harry Haller | last post by:

What is the fastest way to search a client-side database? I have about 60-65 kb of data downloaded to the client which is present in 3 dynamically created list boxes. The boxes are filled from 3...

Javascript

Doing an "Or" search using criteria from a text box

by: Robertf987 | last post by:

Hi, Yet another thing I need help with I'm affraid. I'll first explain what I want, then I'll try to explain what I have. I'm using Microsoft Access 2000. What I want is to be able to do a...

Microsoft Access / VBA

Play with image A Technique in Java - Search image, Convert image to text , Hide data

by: Debadatta Mishra | last post by:

Introduction In this article I will provide you an approach to manipulate an image file. This article gives you an insight into some tricks in java so that you can conceal sensitive information...

Java

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET