473,626 Members | 3,947 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Slow String operations...

I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883
Any help appreciated...
Mugunth
Feb 8 '08 #1
8 1781

Thankyou for your answers...
The call to Tokenize and StripPunctuatio ns are the string operations.
For the first 10 seconds, they usually tokenize about 8200 files.
Second 10 seconds they tokenize only 5000 files...
and third 10 second, it's even lesser....
Nearly all the files are of the same size... but the algorithm gets
progressively slower with time...
This is my strip punctuation code

char[] punctuations = { '#', '!', '*', '-', '"', ','};
int len = sbFileContents. Length;
for (int i = 0 ; i < len; i ++)
{
if (sbFileContents[i].CompareTo(punc tuations[0]) ==
0||
sbFileContents[i].CompareTo(punc tuations[1]) == 0
||
sbFileContents[i].CompareTo(punc tuations[2]) == 0
||
sbFileContents[i].CompareTo(punc tuations[3]) == 0
||
sbFileContents[i].CompareTo(punc tuations[4]) == 0
||
sbFileContents[i].CompareTo(punc tuations[5]) ==
0)
{
sbFileContents[i] = ' ';
}
}

this is my tokenize code...
string[] returnArray;
string[] delimiters = { " ", "?", ". " };
int count = 0;
string[] strArray = fileContents.To String().
Split(delimiter s,
StringSplitOpti ons.RemoveEmpty Entries);

returnArray = new string[strArray.Length];

PorterStemmer ps = new PorterStemmer() ;
foreach (String str in strArray)
{
string word;
if (bStem)
{
word = ps.stemTerm(str );
}
else
{
word = str;
}

if(!IsStopWord( word))
returnArray[count++] = word;
}
return returnArray;
Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?
Is there any way to set the size of the heap at program start?

Regards,
Mugunth
Feb 8 '08 #2

"Mugunth" <mu***********@ gmail.comwrote in message news:64******** *************** ***********@d21 g2000prf.google groups.com...
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
I sugest that you recheck your math
From your output I get the following

time total dif
10 8201 8201
20 13160 4905
30 17661 4555
40 21926 4265
50 26489 4563
60 30703 4214
70 35479 4776

Besides the first data point, it looks quite linear.
If you calculate the number of files processed in each 10 sec interval it ranges from ~4200-4900 with no noticable dropoff

I am not sure why the first interval was so much faster, but this is not slowing to a crawl
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883




Feb 8 '08 #3
On Feb 8, 1:15 pm, Mugunth <mugunth.ku...@ gmail.comwrote:

<snip>
Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?
Possible, but I wouldn't expect that to be the problem.

Again though, if you could produce a *complete* program it would make
life a lot easier.
It doesn't need to look at different files - just going through the
same file thousands of times should demonstrate the issue given what
you've been saying.

Jon
Feb 8 '08 #4
Bill Butler's answer makes a good point.

Do the file sizes vary wildly, or are they approximatly the same over the
sample size? This could account for differences. Also, look at your
process. If some files have more replacements than others, then the work
being done in each 10 seconds is not properly counted by file count.

"Mugunth" wrote:
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883
Any help appreciated...
Mugunth
Feb 8 '08 #5
On Feb 8, 4:06 pm, Mugunth <mugunth.ku...@ gmail.comwrote:

<snip>
This is the complete source code..
That's very helpful.
But dataset is huge.. which I cannot upload...
Could you give us a single sample file to load 38000 times though?
Is each file big?

Jon
Feb 8 '08 #6
<DOC>
<DOCNOABC199810 02.1830.0000 </DOCNO>
<DOCTYPEMISCELL ANEOUS </DOCTYPE>
<TXTTYPECAPTI ON </TXTTYPE>
<TEXT>
The troubling connections between the global economic crisis and
American
jobs. monica Lewinsky and Linda Tripp, private conversations made
public. Gene autry had died, the most famous singing cowboy of them
all. And the artist who sold only one painting in his lifetime and
is an icon today.
</TEXT>
</DOC>

this is one single file...
Feb 8 '08 #7
On Feb 8, 4:17 pm, Mugunth <mugunth.ku...@ gmail.comwrote:

<snip>
Again, the first second it can parse 3200 files...
the last 5 seconds (30-34) it could parse only 3500 files...
My data set is not this disparate...
I would strongly suggest that you modify your code to load a single
file thousands of times. That way you *know* whether the performance
is actually degrading or whether it's just different data.

Jon
Feb 8 '08 #8
Just comment different parts which cost most time one by one out. You
might find the factor.
Feb 21 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
5838
by: Rami Saarinen | last post by:
I have been making a client-server type of software mainly on Linux and now I am trying to get it all work also on windows. However, I am having slight problems running the program in Windows 2000. I have tried Python 2.2.3 and the latest release 2.3.1 (?) If I have 2 clients and 1 server. The server listening some predefined port and all the sending (in client and server) is done by creating a new connection. On the beginning the client...
22
3296
by: Marc Mones | last post by:
Hello, I'working with IBM DB2 V8.1 and CLI/ODBC. I've got a problem with the following statement: ******************************************************************************** SELECT S_ART, S_SPRACHE, S_MANDANT, S_NR, S_SUB, S_OWNER, S_SATZ FROM SY0001_00005 WHERE S_ART = ? AND S_SPRACHE = ? AND S_MANDANT = ? AND S_NR = ? AND
2
2812
by: Rocky A | last post by:
This is my first posting so please be gentle I've been writing access programs for only about a year and may have bit off more than I can chew but....... I've written an operations program for work that encompasses simple workorder entry, purchase orders, vendors and customer lists. My data is kept in a database on our network and the programs are linked and sitting seperatly on each workers desktop (about 6 people). When I was...
10
3186
by: Jos Vernon | last post by:
I've been trying a mixed mode project but I'm having speed problems. I was using a DLL and linking into it from C#. I thought I'd try and stick the C# functionality and the raw unmanaged code into a mixed mode project. It works but it's incredibly slow. All the optimization settings are the same. All the code which was in the DLL is compiled with the /clr off. Only the bits which were in C# have been re-written in managed C++ (and I've...
11
11943
by: ajou_king | last post by:
I was running some tests on my Win32 1GHZ processor to see how long it would take to transmit objects numerous times via TCP/IP using C# ..NET Remoting vs the C++ trustworthy method of binary streams. I ran the test for 50K, 100K, 500K iterations, where each iteration consists of sending an object from a client process to a server process, and the server process sends back an ack. Here are the results: .NET Remoting C++...
29
4301
by: zoro | last post by:
Hi, I am new to C#, coming from Delphi. In Delphi, I am using a 3rd party string handling library that includes some very useful string functions, in particular I'm interested in BEFORE (return substring before a pattern), AFTER (return substring after a pattern), and BETWEEN (return substring between 2 patterns). My questions are: 1. Can any tell me how I can implement such functionality in C#? 2. Is it possible to add/include function...
17
3022
by: Marc | last post by:
Hi, Before I had installed Visual basic.net 2003 on my laptop toshiba Tecra S1 Tecra S1 Centrino 1.6GHz / XP Pro / 15.0 256mb Windows XP servic pack2 After installing Visual Studio 2005 My computer seems to be too slow to debug my developping applications.
6
5699
by: arnuld | last post by:
This works fine, I welcome any views/advices/coding-practices :) /* C++ Primer - 4/e * * Exercise 8.9 * STATEMENT: * write a program to store each line from a file into a * vector<string>. Now, use istringstream to read read each line * from the vector a word at a time.
0
1162
by: Demetris | last post by:
Hi, I would like to thank you first again for the replies. Actually to clear some things out, I have an application called Moody's Financial Analyst which sits on Oracle (the server mentioned previously). This applications of course needs the oracle client to communicate with the server. When I connect to the machine (the client) with administrative priviledges the applications runs fine. When I connect on the machine as a power user,...
0
8199
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8705
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8638
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8365
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8505
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7196
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5574
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4092
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
1511
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.