memory optimization

Evangelista Sami

hello all

i am implementing an application in which a large number of (char *)
is stored into a hash table structure.
As there may be collisions, each (char *) inserted into my hash table
is put into an array. here is the definition of my hash table :

typedef struct
{
short size;
char **elements;
} hash_cell;
hash_cell hash_table[100000];

with this approach, each time i want to store an additional (char *) i
reallocate the array elements and add it an additional element.
this works fine but i have an overhead of a (char *) per element.

Another idea is to use a single (char *) by cell in the hash table and
to encode the size of each (char *) in this one with for instance 16
bits. the definition of this hash table should be :

typedef struct
{
short size;
char *elements;
} hash_cell;
hash_cell hash_table[100000];

For example if in the cell 0 i have three elements "toto", "abc", and
"peter",
i will have :
hash_table[0].elements[0 .. 1] = 4
hash_table[0].elements[2 .. 5] = "toto"
hash_table[0].elements[6 .. 7] = 3
hash_table[0].elements[8 .. 10] = "abc"
hash_table[0].elements[11 .. 12] = 3
hash_table[0].elements[13 .. 17] = "peter"

this makes me save 4 - 2 = 2 bytes if i assume that the size of (char
*) is 4 bytes and that the length of each element is encoded with 2
bytes.

However when i make the "ps aux" command during the execution, it
appears that the whole memory consumed by the first method is less
than with the second method and i cant see why.

Any explanation would be very useful

Sami Evangelista

Nov 14 '05 #1

Subscribe Reply

2347

Eric Sosman

Evangelista Sami wrote:

hello all

i am implementing an application in which a large number of (char *)
is stored into a hash table structure.
As there may be collisions, each (char *) inserted into my hash table
is put into an array. here is the definition of my hash table :

typedef struct
{
short size;
char **elements;
} hash_cell;
hash_cell hash_table[100000];

with this approach, each time i want to store an additional (char *) i
reallocate the array elements and add it an additional element.
this works fine but i have an overhead of a (char *) per element.
You haven't said so, but I'll assume that each `char*' points
to the start of a '\0'-terminated C string and not to a completely
arbitrary chunk of data.
Another idea is to use a single (char *) by cell in the hash table and
to encode the size of each (char *) in this one with for instance 16
bits.
The "size of each (char *)" is constant for any given
implementation. I'll assume that what you actually want to
do is encode the lengths of the C strings -- but using two
bytes to encode the information already provided by the one-
byte '\0' terminator seems a poor trade-off ...
the definition of this hash table should be :

typedef struct
{
short size;
char *elements;
} hash_cell;
hash_cell hash_table[100000];

For example if in the cell 0 i have three elements "toto", "abc", and
"peter",
i will have :
hash_table[0].elements[0 .. 1] = 4
hash_table[0].elements[2 .. 5] = "toto"
hash_table[0].elements[6 .. 7] = 3
hash_table[0].elements[8 .. 10] = "abc"
hash_table[0].elements[11 .. 12] = 3
hash_table[0].elements[13 .. 17] = "peter"

this makes me save 4 - 2 = 2 bytes if i assume that the size of (char
*) is 4 bytes and that the length of each element is encoded with 2
bytes.
Well, let's see. For each string you've discarded one '\0'
byte and added two length bytes, then discarded one `char*'. If
`sizeof(char*)' is four, the net savings is three bytes per string.
However when i make the "ps aux" command during the execution, it
appears that the whole memory consumed by the first method is less
than with the second method and i cant see why.
Some of what follows is speculation: The C language Standard
says nothing about how realloc() and friends are implemented, and
different versions behave differently. Still, a few things stand
out:

In the first version the three strings "toto", "abc", and
"peter" are stored somewhere in memory, along with their trailing
'\0' bytes -- that's fifteen bytes, if I've counted correctly.
To this you add three `char*' pointers (assumed to be four bytes
each, but that's not guaranteed), plus some unknown amount of
memory-management overhead for the `hash_table[n].elements'
array. If the overhead is eight bytes (again, not guaranteed),
the total is thirty-five bytes.

In the second version you *still* have the three strings
"toto", "abc", and "peter" lying around in memory someplace,
and you've made copies of them so you can make the copies
adjacent to one another and interleave them with the lengths.
So: fifteen bytes of original data, twelve more for the copies
(omitting the '\0' terminators), six more for the lengths, and
a presumed eight-byte overhead for the dynamic array. That's
forty-one bytes, six more than the first method used.

You have saved three bytes per string at the cost of making
a duplicate copy of the string's data. If the string is longer
than three bytes, this is a poor trade. (If the strings are
all three bytes long or less, you could use a much more efficient
data structure than a hash table!)
Any explanation would be very useful

Another effect to consider is the operation of realloc().
The first method reallocates small arrays of `char*', while
the second reallocates larger (presumably) arrays of string
data interleaved with lengths. Now, how do you suppose the
realloc() function works its magic? Details differ, but in
typical implementations realloc() will first try to see if
there's enough unused memory immediately following the array
you're resizing, and if so it will simply "claim" that memory
and return the same pointer it was given. If not, it will do
the equivalent of a malloc() of the new size, then copy the
pre-existing data from the old region to the new, and finally
free() the old region and return a pointer to the new. (Again
I emphasize that not all implementations behave this way, but
this is a typical strategy.)

Now, what happens to the old region after it's released?
It becomes available for future allocations -- but note that
your `char*' arrays or concatenated strings are always growing,
never shrinking. The released areas are likely to be too small
to satisfy future allocations, unless you get lucky and happen
to release an abutting region as well. The next malloc() or
realloc() *might* be able to re-use one of these "fragments, "
but it's more likely to need to claim some brand-new memory
because the fragments tend to be too small. And in your second
method the amount of data you're reallocating is probably larger,
so the useless fragments are also larger, the brand-new claimed
areas are larger, and the total memory used is larger.

As I say, lots of speculation. Your mileage may vary.

--
Er*********@sun .com

Nov 14 '05 #2

Evangelista Sami

Eric Sosman <Er*********@su n.com> wrote in message news:<40******* *******@sun.com >...

Evangelista Sami wrote:
hello all

i am implementing an application in which a large number of (char *)
is stored into a hash table structure.
As there may be collisions, each (char *) inserted into my hash table
is put into an array. here is the definition of my hash table :

typedef struct
{
short size;
char **elements;
} hash_cell;
hash_cell hash_table[100000];

with this approach, each time i want to store an additional (char *) i
reallocate the array elements and add it an additional element.
this works fine but i have an overhead of a (char *) per element.

You haven't said so, but I'll assume that each `char*' points
to the start of a '\0'-terminated C string and not to a completely
arbitrary chunk of data.
Another idea is to use a single (char *) by cell in the hash table and
to encode the size of each (char *) in this one with for instance 16
bits.

The "size of each (char *)" is constant for any given
implementation. I'll assume that what you actually want to
do is encode the lengths of the C strings -- but using two
bytes to encode the information already provided by the one-
byte '\0' terminator seems a poor trade-off ...

i may have '\0' in my strings so i can't use to indicate the end of the string

the definition of this hash table should be :

typedef struct
{
short size;
char *elements;
} hash_cell;
hash_cell hash_table[100000];

For example if in the cell 0 i have three elements "toto", "abc", and
"peter",
i will have :
hash_table[0].elements[0 .. 1] = 4
hash_table[0].elements[2 .. 5] = "toto"
hash_table[0].elements[6 .. 7] = 3
hash_table[0].elements[8 .. 10] = "abc"
hash_table[0].elements[11 .. 12] = 3
hash_table[0].elements[13 .. 17] = "peter"

this makes me save 4 - 2 = 2 bytes if i assume that the size of (char
*) is 4 bytes and that the length of each element is encoded with 2
bytes.

Well, let's see. For each string you've discarded one '\0'
byte and added two length bytes, then discarded one `char*'. If
`sizeof(char*)' is four, the net savings is three bytes per string.
However when i make the "ps aux" command during the execution, it
appears that the whole memory consumed by the first method is less
than with the second method and i cant see why.

Some of what follows is speculation: The C language Standard
says nothing about how realloc() and friends are implemented, and
different versions behave differently. Still, a few things stand
out:

In the first version the three strings "toto", "abc", and
"peter" are stored somewhere in memory, along with their trailing
'\0' bytes -- that's fifteen bytes, if I've counted correctly.
To this you add three `char*' pointers (assumed to be four bytes
each, but that's not guaranteed), plus some unknown amount of
memory-management overhead for the `hash_table[n].elements'
array. If the overhead is eight bytes (again, not guaranteed),
the total is thirty-five bytes.

In the second version you *still* have the three strings
"toto", "abc", and "peter" lying around in memory someplace,
and you've made copies of them so you can make the copies
adjacent to one another and interleave them with the lengths.
So: fifteen bytes of original data, twelve more for the copies
(omitting the '\0' terminators), six more for the lengths, and
a presumed eight-byte overhead for the dynamic array. That's
forty-one bytes, six more than the first method used.

You have saved three bytes per string at the cost of making
a duplicate copy of the string's data. If the string is longer
than three bytes, this is a poor trade. (If the strings are
all three bytes long or less, you could use a much more efficient
data structure than a hash table!)

what structure for example ?

Any explanation would be very useful

Another effect to consider is the operation of realloc().
The first method reallocates small arrays of `char*', while
the second reallocates larger (presumably) arrays of string
data interleaved with lengths. Now, how do you suppose the
realloc() function works its magic? Details differ, but in
typical implementations realloc() will first try to see if
there's enough unused memory immediately following the array
you're resizing, and if so it will simply "claim" that memory
and return the same pointer it was given. If not, it will do
the equivalent of a malloc() of the new size, then copy the
pre-existing data from the old region to the new, and finally
free() the old region and return a pointer to the new. (Again
I emphasize that not all implementations behave this way, but
this is a typical strategy.)

Now, what happens to the old region after it's released?
It becomes available for future allocations -- but note that
your `char*' arrays or concatenated strings are always growing,
never shrinking. The released areas are likely to be too small
to satisfy future allocations, unless you get lucky and happen
to release an abutting region as well. The next malloc() or
realloc() *might* be able to re-use one of these "fragments, "
but it's more likely to need to claim some brand-new memory
because the fragments tend to be too small. And in your second
method the amount of data you're reallocating is probably larger,
so the useless fragments are also larger, the brand-new claimed
areas are larger, and the total memory used is larger.

As I say, lots of speculation. Your mileage may vary.

Nov 14 '05 #3

David Resnick

ev******@cnam.f r (Evangelista Sami) wrote in message news:<5f******* *************** ****@posting.go ogle.com>...

hello all

i am implementing an application in which a large number of (char *)
is stored into a hash table structure.
As there may be collisions, each (char *) inserted into my hash table
is put into an array. here is the definition of my hash table :

typedef struct
{
short size;
char **elements;
} hash_cell;
hash_cell hash_table[100000];

with this approach, each time i want to store an additional (char *) i
reallocate the array elements and add it an additional element.
this works fine but i have an overhead of a (char *) per element.

Another idea is to use a single (char *) by cell in the hash table and
to encode the size of each (char *) in this one with for instance 16
bits. the definition of this hash table should be :

typedef struct
{
short size;
char *elements;
} hash_cell;
hash_cell hash_table[100000];

For example if in the cell 0 i have three elements "toto", "abc", and
"peter",
i will have :
hash_table[0].elements[0 .. 1] = 4
hash_table[0].elements[2 .. 5] = "toto"
hash_table[0].elements[6 .. 7] = 3
hash_table[0].elements[8 .. 10] = "abc"
hash_table[0].elements[11 .. 12] = 3
hash_table[0].elements[13 .. 17] = "peter"

this makes me save 4 - 2 = 2 bytes if i assume that the size of (char
*) is 4 bytes and that the length of each element is encoded with 2
bytes.

However when i make the "ps aux" command during the execution, it
appears that the whole memory consumed by the first method is less
than with the second method and i cant see why.

Any explanation would be very useful

Sami Evangelista

I don't have an explaination for you about the memory usage you see
reported by ps.

You could probably save some more space by putting your short into
your string as the first two bytes. On many (32 bit at least)
systems, your short will followed by two padding bytes in the struct
that you could save here. This would save two bytes for occupied
cells and 4 bytes in unoccupied cells (where you would have a NULL
in the char* to implicitly say there are 0 elements). You would
need to do a sizeof (struct hash_cell) to see if this will help.
You could squeeze a bit more space out of the table if you are willing
to accept the limitation that the strings stored are <= 256 bytes in
length and use only one byte there. Or that there will be no more
than 256 strings in a single cell (if there are, your hash table
is presumably either way too small or your hash algorithm is not
doing it's job, assuming that duplicate entries are not being
added).

-David

Nov 14 '05 #4

Eric Sosman

Evangelista Sami wrote:

Eric Sosman <Er*********@su n.com> wrote in message news:<40******* *******@sun.com >...
Evangelista Sami wrote:
hello all

i am implementing an application in which a large number of (char *)
is stored into a hash table structure.
As there may be collisions, each (char *) inserted into my hash table
is put into an array. here is the definition of my hash table :

typedef struct
{
short size;
char **elements;
} hash_cell;
hash_cell hash_table[100000];

with this approach, each time i want to store an additional (char *) i
reallocate the array elements and add it an additional element.
this works fine but i have an overhead of a (char *) per element.

You haven't said so, but I'll assume that each `char*' points
to the start of a '\0'-terminated C string and not to a completely
arbitrary chunk of data.
[...]

i may have '\0' in my strings so i can't use to indicate the end of the string

(Let's avoid the term "string" for such an array: "string"
has a specific meaning in C, and this isn't it.)

If you don't use a terminator, you must have some other way
to determine the array length. Perhaps there's some way to
calculate it from the array contents (for example, consider the
way an LZW-compressed sequence denotes its endpoint), but more
likely there's another piece of "overhead" you failed to mention:
You're storing the array lengths somewhere.

You have saved three bytes per string at the cost of making
a duplicate copy of the string's data. If the string is longer
than three bytes, this is a poor trade. (If the strings are
all three bytes long or less, you could use a much more efficient
data structure than a hash table!)

what structure for example ?

How about a bit-map for single-byte "arrays," another for
two-byte arrays, and another for three-byte arrays? Assuming
an eight-bit byte, this requires 256 + 65536 + 16777216 bits,
or 32 + 8192 + 2097152 bytes. No space required for pointers
or lengths or copies of the array contents -- in fact, you can
discard the "original" arrays after entering them in the bit
map, for even more savings.

Depending on the distribution of the data, it might be
advantageous to use a trie instead of the largest bit-map, or
even instead of all three.

... but comp.lang.c is not a "Data Structures 101" course.

--
Er*********@sun .com

Nov 14 '05 #5

Similar topics

4707

c++ memory optimization

by: BCC | last post by:

Hi, I have a c++ program that creates hundreds of thousands of objects. Eventually, we would like to be able to create millions, but memory is limiting us. For example, 125,000 objects currently takes up 286 megs of memory. This leads to a couple questions... First, do accessor functions take up space? Are there certain tricks that I

C / C++

2311

Memory management strategy

by: ira2402 | last post by:

Hi All, We are developing sw for a small embedded OS and we have limited memory. We are looking for algorithms, links, and articles about this. The goal is efficient utilization of small amount of memory - means - allocation for fixed length blocks / variable length blocks. Thanks.

C / C++

2431

Memory Performance Optimization ???

by: Peter Olcott | last post by:

What can anyone: (1) Tell Me about (2) Provide Links to (3) Recommend Books Pertaining to Memory Performance Optimization? The main thing that I am looking for is rules-of-thumb that maximize the execution speed of time critical applications.

C# / C Sharp

1767

Memory problem in asp.net calling unmanaged DLL

by: monika.saxena | last post by:

Hi all, In one of my projects which is a web based application in asp.net, a third party tool - "Frontline Solver DLL" (It is an unmanaged DLL and ..NET is calling it using the PInvoke) is used. This DLL is used for solving and optimizing some non linear problems. Problem : The problem I am facing is as the solver runs and does the

ASP.NET

1749

Possible memory leak?

by: Tuvas | last post by:

I have a function in a program that works something like this. def load_pic_data(width,heigth,inpdat, filt=TRUE): data='' total=0 tnum=0 size=100 for y in range(0,heigth): row='' for x in range(0,width):

Python

3080

Memory deallocation does not work.

by: christophe.chazeau | last post by:

Hi, I have a problem with a really simple chunk of code which should work but does obviously does not. This chunk of code is just a POC aimed at finding a bug in a larger project in which the same problem seems to occur. Here the deal : when I run this piece of code, I expect all the memory allocated by the "Test" object to be freed but what I observe is that after the second sleep (after all the additions to the vector), the memory...

C / C++

2404

mysql_real_escape_string memory leak?

by: vivek | last post by:

Could someone please help me figure out why the memory usage fluctuates when I use mysql_real_escape_string? I'm finding (what I think are) memory leaks with a few mysql functions in php and I'm trying to figure them all out. This one is pretty vexing. Thanks in advance. Here is example code: class memTest { function __construct() { $con = mysql_connect("***************","*******","****");

PHP

1645

which one requires less memory?

by: khan | last post by:

Since we did not want to make an error in counting of bytes, we used the code char *p; ... p = malloc(strlen("hello")+1); strcpy(p,"hello"); instead of the intended

C / C++

2313

Storing file information in memory

by: deciacco | last post by:

I'm writing a command line utility to move some files. I'm dealing with thousands of files and I was wondering if anyone had any suggestions. This is what I have currently: $arrayVirtualFile = array( 'filename'=>'filename', 'basename'=>'filename.ext', 'extension'=>'ext', 'size'=>0,

PHP

3891

volatile malloc memory?

by: Dan | last post by:

What is the correct way to allocate memory that is defined as volatile so that all standard compilers will not optomise out writes to the memory that are never read again? I have some encryption programs that, depending on the implementation/compiler, part of the code that clears memory at the end of the program gets optomised away.

C / C++

9672

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9519

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10437

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10214

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10001

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7538

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4113

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3723

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2920

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General