473,788 Members | 2,882 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

string dictionary and memory issue.

Hi,

I am currently building a lexical analysis component to pull keywords
out of content,
I currently have a functional first build, but I am having problems
ince I am easily loading over 300 000 strings in memory,

when I am doing the actual analysis I can reach upto 400 mb of ram
usage.

I currently have built my dictionary out of a tree built by nodes
containing hashtables. each node represents a letter of the string and
a flag representing the end of a string.

I thought of building such a structure for speed. The reading speed of
the tree, and loading speed is
remarkably fast, but the cost to the system is greater then I had
expected.

Could anyone suggest a solution to this ?
should I store my node data as compressed data?
Best Regards,
Alexandre Brisebois

Mar 10 '06 #1
4 2045
Alexandre Brisebois (www.pointnetsolutions.com) wrote:
I am currently building a lexical analysis component to pull keywords
out of content,
I currently have a functional first build, but I am having problems
ince I am easily loading over 300 000 strings in memory,

when I am doing the actual analysis I can reach upto 400 mb of ram
usage.

I currently have built my dictionary out of a tree built by nodes
containing hashtables. each node represents a letter of the string and
a flag representing the end of a string.


How many entries are in each node? I wouldn't expect there'd be that
many - particularly if you're only dealing with ASCII characters. I'd
suggest using a plain array, either as a list of entries (for nodes
without many sub-entries) or an array with nulls in (for those where
traversing a list would be expensive). You could use ArrayList/List<T>
instead of the straight arrays, but call TrimToSize on all of them when
you've finished loading the list of words to avoid wastage.

Jon

Mar 10 '06 #2
well each node has beteew 0 to possible characters

I tryed loading 117 000~ words in a straight foward hashtable.
a 4 meg xml file became 25 megs in memory...

but I still find that extremly large. I am tring to find a way to use
as little ram as possible,
without completely hurting my speed.

I will try with the ArrayList and List generic collections.
Best Regards,
Alexandre Brisebois

If you have any other ideas please do let me know.

Mar 10 '06 #3
Alexandre Brisebois (www.pointnetsolutions.com) wrote:
well each node has beteew 0 to possible characters

I tryed loading 117 000~ words in a straight foward hashtable.
a 4 meg xml file became 25 megs in memory...
Is that the total memory of the process? If so, that's not particularly
surprising. The framework creates a fair amount of overhead (try
loading a virtually empty file to see what I mean) and the strings will
all be converted to Unicode, which will double the size of an ASCII
file.
but I still find that extremly large. I am tring to find a way to use
as little ram as possible, without completely hurting my speed.


Do you actually have a concrete requirement with respect to memory? On
most modern systems, 25MB is very little. One can waste a lot of time
going for the "best possible" performance instead of performance which
is "good enough".

Jon

Mar 10 '06 #4
Well I want to try and have this run on a shared hosting machine.

So I do wish to find a way to reduce the amount of memory use.
Though the more I look into this, the more I underant that I will have
to build a distributed system.

Im still not sure how im going to have to structure it all.

so far my dictionary files contain roufly 200 000 words and is growing
I also built a Thesaurus ADT so then we are talking about an ther easy
200 000 words ( interlinked by references not copies )

and the system continiously identifies unknown words which need to be
reviewed.

so all this is taking up a lot of resources as it is loaded.

I have never done lexical analysis before, so this stuff is all new to
me and I have not found the best strategy so far.
so I keep looking and trying different things.

Using a database would make everything much simpler,
but I dont have that luxury.

that is the main reason I am looking for some type of strategy to build
either partial or full thesaurus and dictionary ADTs
I currently am looking at the partial loading option, where I would
load it up in a graph containing only the words it needs
for a particular analysis and unloading afterwards.

also looking to maybe merge the thesaurus with the dictionary. but I
and trying to find anything else that will prevent me from doing so.

Best Regards,
Alexandre Brisebois

Mar 10 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
5596
by: Rodd Snook | last post by:
I have an application which makes extensive use of the Scripting.Dictionary object. I'm not doing anything silly like putting them outside the page scope -- just creating quite a few of them and stuffing quite a lot of data (from and MS SQL database) into them. On Windows 2000 server, everything is fine. If the data structures get really big it slows down, but for normal operation it's no problem. Recently our hosting provider moved to...
17
4672
by: Chad Myers | last post by:
I've been perf testing an application of mine and I've noticed that there are a lot (and I mean A LOT -- megabytes and megabytes of 'em) System.String instances being created. I've done some analysis and I'm led to believe (but can't yet quantitatively establish as fact) that the two basic culprits are a lot of calls to: 1.) if( someString.ToLower() == "somestring" ) and
10
6468
by: Petr Jakeš | last post by:
I have a standard 12-key mobile phone keypad connected to my Linux machine as a I2C peripheral. I would like to write a code which allows the text entry to the computer using this keypad (something like T9 on the mobile phones) According to the http://www.yorku.ca/mack/uist01.html dictionary-based disambiguation is coming in the mind. With dictionary-based disambiguation, each key is pressed only once. For example, to enter the, the...
94
4778
by: smnoff | last post by:
I have searched the internet for malloc and dynamic malloc; however, I still don't know or readily see what is general way to allocate memory to char * variable that I want to assign the substring that I found inside of a string. Any ideas?
3
2522
by: Rich Shepard | last post by:
I need to learn how to process a byte stream from a form reader where each pair of bytes has meaning according to lookup dictionaries, then use the values to build an array of rows inserted into a sqlite3 database table. Here's the context: The OMR card reader sends a stream of 69 bytes over the serial line; the last byte is a carriage return ('\r') indicating the end of record. Three pairs (in specific positions at the beginning of the...
1
2541
by: Eran | last post by:
Hi, I have a huge data structure, which I previosly stored in a Dictionary<int, MyObj> MyObj is relatively small (2 int, 1 DateTime, 1 bool). The dictionary I am using is quite large (25,000), and I have 500 such dictionaries. What I've noticed is that the total memory consumed became over 1 GB. When I changed the implementation to List<MyObj>, or SortedList<int,
1
2330
by: buu | last post by:
It's strange to me, but, create a dictionary and fill it with 1 mil. of some objects. then, see the memory consumption (arised, of course). then, clean the dictionary.... memory consumption is the same... write MyDic=nothing memory consumption is still the same force GC to collect... memory consumption is little bit smaller
6
7272
by: Paul.N.Phillips | last post by:
I am using a static dictionary to objects (like cache) but woundered if it is better to use cache. Which one would should I use?
20
10399
by: Simon Strobl | last post by:
Hello, I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? SImon
0
9656
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10175
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10112
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9969
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8993
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6750
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5399
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5536
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3675
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.