473,385 Members | 1,872 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

searching data for a large set of substrings

C3
I have to process some data in C that is given to me as a char * array. I
have a fairly large number of substrings (well, they're not actually
printable, but let's treat them as strings) that I have to search for in the
data.

I need to keep a count of how often each of these substrings occurs in my
original data and then print it out at the end.

This is a fairly mundane task, but since I have so many substrings, it's a
pain having to #define them all. Can anybody suggest an efficient way of
doing this task?

BTW, I have to use C. If I could use Perl, I'd be in heaven.
cheers,
Nov 14 '05 #1
9 1781
C3 wrote:

I have to process some data in C
that is given to me as a char * array. I
have a fairly large number of substrings (well, they're not actually
printable, but let's treat them as strings)
that I have to search for in the data.

I need to keep a count of how often each of these
substrings occurs in my original data and
then print it out at the end.

This is a fairly mundane task,
but since I have so many substrings, it's a
pain having to #define them all.
Can anybody suggest an efficient way of
doing this task?

BTW, I have to use C. If I could use Perl, I'd be in heaven.

/* BEGIN new.c output */

The substring "string", occurs 4 times.
The substring "C", occurs 3 times.
The substring "kibo", occurs 0 times.
The substring "hen", occurs 1 times.
The substring "of", occurs 5 times.

/* END new.c output */

/* BEGIN new.c */

#include <stdio.h>
#include <string.h>

int main(void)
{
struct substring {
char *string;
long unsigned count;
} substring[] = {{"string"},{"C"},{"kibo"},{"hen"},{"of"}},
*str_ptr;
char *data[] = {
"I have to process some data in C that is given to me "
"as a char * array.\n I have a fairly large number of "
"substrings\n(well, they're not actually printable, "
"but let's treat them as strings)\nthat I have to search "
"for in the data.",
"I need to keep a count of how often each of these "
"substrings occurs in my original data and then print it "
"out at the end.",
"This is a fairly mundane task, but since I have so many "
"substrings, it's a pain having to #define them all.\n"
"Can anybody suggest an efficient way of doing this task?\n",
"BTW, I have to use C.\n"
"If I could use Perl, I'd be in heaven.\ncheers,"
};
size_t nstring, ndata;
char **data_ptr, *ptr;
long unsigned count;

nstring = sizeof substring / sizeof *substring;
for (str_ptr = substring; nstring-- != 0; ++str_ptr) {
count = 0;
data_ptr = data;
ndata = sizeof data / sizeof *data;
while (ndata-- != 0) {
ptr = *data_ptr;
ptr = strstr(*data_ptr, str_ptr -> string);
while (ptr != NULL) {
++count;
ptr = strstr(ptr + 1, str_ptr -> string);
}
++data_ptr;
}
str_ptr -> count = count;
}
puts("/* BEGIN new.c output */\n");
nstring = sizeof substring / sizeof *substring;
for (str_ptr = substring; nstring-- != 0; ++str_ptr) {
printf("The substring \"%s\", occurs %lu times.\n",
str_ptr -> string, str_ptr -> count);
}
puts("\n/* END new.c output */");
return 0;
}

/* END new.c */
--
pete
Nov 14 '05 #2
C3
This has the guts of what I'm after, but I need to give it a fairly long
list of substrings (byte sequences), which can't be done dynamically due to
the struct.

I have already read my substrings into char[] arrays, now I just need to do
the substring matching. Any help you could give me would be appreciated.

cheers,
Nov 14 '05 #3


C3 wrote:
This has the guts of what I'm after, but I need to give it a fairly long
list of substrings (byte sequences), which can't be done dynamically due to
the struct.

I have already read my substrings into char[] arrays, now I just need to do
the substring matching. Any help you could give me would be appreciated.


Write a new function that uses function strstr to do the matching. This
new function will have parameters that will point to the substr array,
a parameter that will describe the size of the array, and a pointer to
the data string that you wish to search. Depending on how you want
to store the results of the search, other parameters may be needed.

Here is an example of a function, PrintResults, that will simply print
the results of the search. (Case issue avoided for simplicity)

#include <stdio.h>
#include <string.h>

void PrintResults(char *substr[], size_t cnt,const char *string);

int main(void)
{
char *substr[] = {"george","bill","adam","a"};
char *teststr = "this is george and this is adam.\n"
"there are many georges and some \n"
"adams, but george is the best because\n"
"he is the mostest. george is the king\n"
"of the jungle";

printf("In the string:\n\n\"%s\"\n\nSearch found:\n",teststr);
PrintResults(substr,(sizeof substr)/(sizeof *substr),teststr);
return 0;
}

void PrintResults(char *substr[], size_t cnt,const char *string)
{
size_t i;
unsigned subcnt;
const char *s1, *s2;

if(!substr || !string || *string == '\0') return;
for(i = 0; i < cnt;i++)
{
subcnt = 0;
if(*substr[i] != '\0')
for(s1 = string;(s2 = strstr(s1,substr[i])); s1 = (s2+1))
subcnt++;
printf("%u occurrances of \"%s\"\n",subcnt,substr[i]);
}
return;
}
--
Al Bowers
Tampa, Fl USA
mailto: xa******@myrapidsys.com (remove the x to send email)
http://www.geocities.com/abowers822/

Nov 14 '05 #4
C3 wrote:

This has the guts of what I'm after, but I need to give it a
fairly long list of substrings (byte sequences), which can't be
done dynamically due to the struct.

I have already read my substrings into char[] arrays, now I just
need to do the substring matching. Any help you could give me
would be appreciated.


What is 'this'? What is 'the struct'? You can compare strings
with strcmp().

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!
Nov 14 '05 #5
C3 wrote:
This has the guts of what I'm after, but I need to give it a fairly long
list of substrings (byte sequences), which can't be done dynamically due to
the struct.

I have already read my substrings into char[] arrays, now I just need to do
the substring matching. Any help you could give me would be appreciated.

cheers,

Look at lex (which generates C code)
Nov 14 '05 #6
C3
Unfortunately, it appears this won't work for another reason. My substrings
may contain the null character. I have the length, so it shouldn't be a
problem.
Nov 14 '05 #7
Mac
On Mon, 04 Oct 2004 10:08:31 +1000, C3 wrote:
Unfortunately, it appears this won't work for another reason. My substrings
may contain the null character. I have the length, so it shouldn't be a
problem.


I was going to ask about that. I'm glad you figured it out yourself.

I don't know how big the data sets are or how many substrings you have,
but you could just go through the data one character at a time, and for
each byte call memcmp() once for each substring.

char *data, **substring;
long i, j, nsubstrings, *count, data_length, *substring_length;

/* allocate space and populate substrings and so on */
....

for (j=0; j<data_length; j++)
{
for(i=0; i<nsubstrings; i++)
if (memcmp(data[j], substring[i], substring_length[i]) == 0))
++count[i];
}

This is just a sketch of an idea to be taken for what it is worth.

In particular, it might be better to use size_t for some of the types.

--Mac

Nov 14 '05 #8
C3
Thanks for the suggestion. I guess there's no Perlesque q&d way to do what I
need.

cheers,
Nov 14 '05 #9
C3
Thank you. I've always believed this to be the case, for oh so very obvious
reasons. How did you know? :-)

cheers,
char *substr[] = {"george","bill","adam","a"};
char *teststr = "this is george and this is adam.\n"
"there are many georges and some \n"
"adams, but george is the best because\n"
"he is the mostest. george is the king\n"
"of the jungle";

Nov 14 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Michi | last post by:
I was wondering what the best solution is for making large numbers of TEXT (or BLOB?) fields searchable. For example, if I have a forum, what is the best way to be able to search for specific...
3
by: Will McGugan | last post by:
Hi, Is there a simple way of replacing a large number of substrings in a string? I was hoping that str.replace could take a dictionary and use it to replace the occurrences of the keys with the...
5
by: David Wender | last post by:
I want to create a dataview with a sort on multiple columns. However, when I use FindRows, I only want to search some of the columns, not all. Is this possible? I have not been able to make it...
8
by: girish | last post by:
Hi, I want to generate all non-empty substrings of a string of length >=2. Also, each substring is to be paired with 'string - substring' part and vice versa. Thus, gives me , , , , , ] etc....
1
by: Matik | last post by:
Hey, First, sorry if this post appear twice, because, I can not find my post recently send, trying to post it once again. I'm out of ideas, so, I thought, will search help here again :( I'm...
10
by: Tyler | last post by:
Hello All: After trying to find an open source alternative to Matlab (or IDL), I am currently getting acquainted with Python and, in particular SciPy, NumPy, and Matplotlib. While I await the...
8
by: theCancerus | last post by:
Hi All, I am not sure if this is the right place to ask this question but i am very sure you may have faced this problem, i have already found some post related to this but not the answer i am...
7
by: boss1 | last post by:
i have a text file like below: named "rez.text" and i wnt to transfer each column into oracle database which have the same column name as views here.Every time it will goto database atomatecally...
4
by: Costa | last post by:
I am looking for a c/c++ text search engine library that supports: - free text searching - not only beginning of words but substrings as well - wildcard searching - I want strings such as...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.