473,322 Members | 1,307 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Slow String operations...

I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883
Any help appreciated...
Mugunth
Feb 8 '08 #1
8 1766

Thankyou for your answers...
The call to Tokenize and StripPunctuations are the string operations.
For the first 10 seconds, they usually tokenize about 8200 files.
Second 10 seconds they tokenize only 5000 files...
and third 10 second, it's even lesser....
Nearly all the files are of the same size... but the algorithm gets
progressively slower with time...
This is my strip punctuation code

char[] punctuations = { '#', '!', '*', '-', '"', ','};
int len = sbFileContents.Length;
for (int i = 0 ; i < len; i ++)
{
if (sbFileContents[i].CompareTo(punctuations[0]) ==
0||
sbFileContents[i].CompareTo(punctuations[1]) == 0
||
sbFileContents[i].CompareTo(punctuations[2]) == 0
||
sbFileContents[i].CompareTo(punctuations[3]) == 0
||
sbFileContents[i].CompareTo(punctuations[4]) == 0
||
sbFileContents[i].CompareTo(punctuations[5]) ==
0)
{
sbFileContents[i] = ' ';
}
}

this is my tokenize code...
string[] returnArray;
string[] delimiters = { " ", "?", ". " };
int count = 0;
string[] strArray = fileContents.ToString().
Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);

returnArray = new string[strArray.Length];

PorterStemmer ps = new PorterStemmer();
foreach (String str in strArray)
{
string word;
if (bStem)
{
word = ps.stemTerm(str);
}
else
{
word = str;
}

if(!IsStopWord(word))
returnArray[count++] = word;
}
return returnArray;
Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?
Is there any way to set the size of the heap at program start?

Regards,
Mugunth
Feb 8 '08 #2

"Mugunth" <mu***********@gmail.comwrote in message news:64**********************************@d21g2000 prf.googlegroups.com...
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
I sugest that you recheck your math
From your output I get the following

time total dif
10 8201 8201
20 13160 4905
30 17661 4555
40 21926 4265
50 26489 4563
60 30703 4214
70 35479 4776

Besides the first data point, it looks quite linear.
If you calculate the number of files processed in each 10 sec interval it ranges from ~4200-4900 with no noticable dropoff

I am not sure why the first interval was so much faster, but this is not slowing to a crawl
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883




Feb 8 '08 #3
On Feb 8, 1:15 pm, Mugunth <mugunth.ku...@gmail.comwrote:

<snip>
Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?
Possible, but I wouldn't expect that to be the problem.

Again though, if you could produce a *complete* program it would make
life a lot easier.
It doesn't need to look at different files - just going through the
same file thousands of times should demonstrate the issue given what
you've been saying.

Jon
Feb 8 '08 #4
Bill Butler's answer makes a good point.

Do the file sizes vary wildly, or are they approximatly the same over the
sample size? This could account for differences. Also, look at your
process. If some files have more replacements than others, then the work
being done in each 10 seconds is not properly counted by file count.

"Mugunth" wrote:
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883
Any help appreciated...
Mugunth
Feb 8 '08 #5
On Feb 8, 4:06 pm, Mugunth <mugunth.ku...@gmail.comwrote:

<snip>
This is the complete source code..
That's very helpful.
But dataset is huge.. which I cannot upload...
Could you give us a single sample file to load 38000 times though?
Is each file big?

Jon
Feb 8 '08 #6
<DOC>
<DOCNOABC19981002.1830.0000 </DOCNO>
<DOCTYPEMISCELLANEOUS </DOCTYPE>
<TXTTYPECAPTION </TXTTYPE>
<TEXT>
The troubling connections between the global economic crisis and
American
jobs. monica Lewinsky and Linda Tripp, private conversations made
public. Gene autry had died, the most famous singing cowboy of them
all. And the artist who sold only one painting in his lifetime and
is an icon today.
</TEXT>
</DOC>

this is one single file...
Feb 8 '08 #7
On Feb 8, 4:17 pm, Mugunth <mugunth.ku...@gmail.comwrote:

<snip>
Again, the first second it can parse 3200 files...
the last 5 seconds (30-34) it could parse only 3500 files...
My data set is not this disparate...
I would strongly suggest that you modify your code to load a single
file thousands of times. That way you *know* whether the performance
is actually degrading or whether it's just different data.

Jon
Feb 8 '08 #8
Just comment different parts which cost most time one by one out. You
might find the factor.
Feb 21 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Rami Saarinen | last post by:
I have been making a client-server type of software mainly on Linux and now I am trying to get it all work also on windows. However, I am having slight problems running the program in Windows 2000....
22
by: Marc Mones | last post by:
Hello, I'working with IBM DB2 V8.1 and CLI/ODBC. I've got a problem with the following statement: ******************************************************************************** SELECT...
2
by: Rocky A | last post by:
This is my first posting so please be gentle I've been writing access programs for only about a year and may have bit off more than I can chew but....... I've written an operations program for...
10
by: Jos Vernon | last post by:
I've been trying a mixed mode project but I'm having speed problems. I was using a DLL and linking into it from C#. I thought I'd try and stick the C# functionality and the raw unmanaged code...
11
by: ajou_king | last post by:
I was running some tests on my Win32 1GHZ processor to see how long it would take to transmit objects numerous times via TCP/IP using C# ..NET Remoting vs the C++ trustworthy method of binary...
29
by: zoro | last post by:
Hi, I am new to C#, coming from Delphi. In Delphi, I am using a 3rd party string handling library that includes some very useful string functions, in particular I'm interested in BEFORE (return...
17
by: Marc | last post by:
Hi, Before I had installed Visual basic.net 2003 on my laptop toshiba Tecra S1 Tecra S1 Centrino 1.6GHz / XP Pro / 15.0 256mb Windows XP servic pack2 After installing Visual Studio 2005...
6
by: arnuld | last post by:
This works fine, I welcome any views/advices/coding-practices :) /* C++ Primer - 4/e * * Exercise 8.9 * STATEMENT: * write a program to store each line from a file into a *...
0
by: Demetris | last post by:
Hi, I would like to thank you first again for the replies. Actually to clear some things out, I have an application called Moody's Financial Analyst which sits on Oracle (the server mentioned...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.