I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?
This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883
Any help appreciated...
Mugunth 8 1781
Thankyou for your answers...
The call to Tokenize and StripPunctuatio ns are the string operations.
For the first 10 seconds, they usually tokenize about 8200 files.
Second 10 seconds they tokenize only 5000 files...
and third 10 second, it's even lesser....
Nearly all the files are of the same size... but the algorithm gets
progressively slower with time...
This is my strip punctuation code
char[] punctuations = { '#', '!', '*', '-', '"', ','};
int len = sbFileContents. Length;
for (int i = 0 ; i < len; i ++)
{
if (sbFileContents[i].CompareTo(punc tuations[0]) ==
0||
sbFileContents[i].CompareTo(punc tuations[1]) == 0
||
sbFileContents[i].CompareTo(punc tuations[2]) == 0
||
sbFileContents[i].CompareTo(punc tuations[3]) == 0
||
sbFileContents[i].CompareTo(punc tuations[4]) == 0
||
sbFileContents[i].CompareTo(punc tuations[5]) ==
0)
{
sbFileContents[i] = ' ';
}
}
this is my tokenize code...
string[] returnArray;
string[] delimiters = { " ", "?", ". " };
int count = 0;
string[] strArray = fileContents.To String().
Split(delimiter s,
StringSplitOpti ons.RemoveEmpty Entries);
returnArray = new string[strArray.Length];
PorterStemmer ps = new PorterStemmer() ;
foreach (String str in strArray)
{
string word;
if (bStem)
{
word = ps.stemTerm(str );
}
else
{
word = str;
}
if(!IsStopWord( word))
returnArray[count++] = word;
}
return returnArray;
Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?
Is there any way to set the size of the heap at program start?
Regards,
Mugunth
"Mugunth" <mu***********@ gmail.comwrote in message news:64******** *************** ***********@d21 g2000prf.google groups.com...
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
I sugest that you recheck your math
From your output I get the following
time total dif
10 8201 8201
20 13160 4905
30 17661 4555
40 21926 4265
50 26489 4563
60 30703 4214
70 35479 4776
Besides the first data point, it looks quite linear.
If you calculate the number of files processed in each 10 sec interval it ranges from ~4200-4900 with no noticable dropoff
I am not sure why the first interval was so much faster, but this is not slowing to a crawl
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?
This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883
On Feb 8, 1:15 pm, Mugunth <mugunth.ku...@ gmail.comwrote:
<snip>
Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?
Possible, but I wouldn't expect that to be the problem.
Again though, if you could produce a *complete* program it would make
life a lot easier.
It doesn't need to look at different files - just going through the
same file thousands of times should demonstrate the issue given what
you've been saying.
Jon
Bill Butler's answer makes a good point.
Do the file sizes vary wildly, or are they approximatly the same over the
sample size? This could account for differences. Also, look at your
process. If some files have more replacements than others, then the work
being done in each 10 seconds is not properly counted by file count.
"Mugunth" wrote:
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?
This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883
Any help appreciated...
Mugunth
On Feb 8, 4:06 pm, Mugunth <mugunth.ku...@ gmail.comwrote:
<snip>
This is the complete source code..
That's very helpful.
But dataset is huge.. which I cannot upload...
Could you give us a single sample file to load 38000 times though?
Is each file big?
Jon
<DOC>
<DOCNOABC199810 02.1830.0000 </DOCNO>
<DOCTYPEMISCELL ANEOUS </DOCTYPE>
<TXTTYPECAPTI ON </TXTTYPE>
<TEXT>
The troubling connections between the global economic crisis and
American
jobs. monica Lewinsky and Linda Tripp, private conversations made
public. Gene autry had died, the most famous singing cowboy of them
all. And the artist who sold only one painting in his lifetime and
is an icon today.
</TEXT>
</DOC>
this is one single file...
On Feb 8, 4:17 pm, Mugunth <mugunth.ku...@ gmail.comwrote:
<snip>
Again, the first second it can parse 3200 files...
the last 5 seconds (30-34) it could parse only 3500 files...
My data set is not this disparate...
I would strongly suggest that you modify your code to load a single
file thousands of times. That way you *know* whether the performance
is actually degrading or whether it's just different data.
Jon
Just comment different parts which cost most time one by one out. You
might find the factor. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Rami Saarinen |
last post by:
I have been making a client-server type of software mainly on Linux and
now I am trying to get it all work also on windows. However, I am having
slight problems running the program in Windows 2000. I have tried Python
2.2.3 and the latest release 2.3.1 (?)
If I have 2 clients and 1 server. The server listening some predefined
port and all the sending (in client and server) is done by creating a
new connection. On the beginning the client...
|
by: Marc Mones |
last post by:
Hello,
I'working with IBM DB2 V8.1 and CLI/ODBC. I've got a problem with the
following
statement:
********************************************************************************
SELECT S_ART, S_SPRACHE, S_MANDANT, S_NR, S_SUB, S_OWNER, S_SATZ
FROM SY0001_00005
WHERE S_ART = ? AND S_SPRACHE = ? AND S_MANDANT = ? AND S_NR = ? AND
|
by: Rocky A |
last post by:
This is my first posting so please be gentle
I've been writing access programs for only about a year and may have
bit off more than I can chew but.......
I've written an operations program for work that encompasses simple
workorder entry, purchase orders, vendors and customer lists. My data
is kept in a database on our network and the programs are linked and
sitting seperatly on each workers desktop (about 6 people). When I was...
|
by: Jos Vernon |
last post by:
I've been trying a mixed mode project but I'm having speed problems.
I was using a DLL and linking into it from C#. I thought I'd try and stick
the C# functionality and the raw unmanaged code into a mixed mode project.
It works but it's incredibly slow. All the optimization settings are the
same. All the code which was in the DLL is compiled with the /clr off. Only
the bits which were in C# have been re-written in managed C++ (and I've...
|
by: ajou_king |
last post by:
I was running some tests on my Win32 1GHZ processor to see how long it
would take to transmit objects numerous times via TCP/IP using C#
..NET Remoting vs the C++ trustworthy method of binary streams. I ran
the test for 50K, 100K, 500K iterations, where each iteration consists
of sending an object from a client process to a server process, and the
server process sends back an ack.
Here are the results:
.NET Remoting C++...
| |
by: zoro |
last post by:
Hi,
I am new to C#, coming from Delphi. In Delphi, I am using a 3rd party
string handling library that includes some very useful string
functions, in particular I'm interested in BEFORE (return substring
before a pattern), AFTER (return substring after a pattern), and
BETWEEN (return substring between 2 patterns).
My questions are:
1. Can any tell me how I can implement such functionality in C#?
2. Is it possible to add/include function...
|
by: Marc |
last post by:
Hi,
Before I had installed Visual basic.net 2003 on my laptop toshiba Tecra S1
Tecra S1 Centrino 1.6GHz / XP Pro / 15.0 256mb
Windows XP servic pack2
After installing Visual Studio 2005
My computer seems to be too slow to debug my developping applications.
|
by: arnuld |
last post by:
This works fine, I welcome any views/advices/coding-practices :)
/* C++ Primer - 4/e
*
* Exercise 8.9
* STATEMENT:
* write a program to store each line from a file into a
* vector<string>. Now, use istringstream to read read each line
* from the vector a word at a time.
|
by: Demetris |
last post by:
Hi,
I would like to thank you first again for the replies.
Actually to clear some things out, I have an application called
Moody's Financial Analyst
which sits on Oracle (the server mentioned previously). This
applications of course needs the oracle client
to communicate with the server. When I connect to the machine (the
client) with administrative priviledges the applications runs fine.
When I connect on the machine as a power user,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |