473,808 Members | 2,745 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Hashing to Disk

Hi,

Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.

Thanks,
James

Dec 1 '05 #1
8 4233
"Foodbank" <v8********@yah oo.com> writes:
Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.


It's not at all clear what you mean by this. Are you talking about
some kind of hash table? If so, please be more specific.

Note that C has no built-in support for hash tables. I suspect you
have more of an algorithm question than a C question. If you can show
us some code, we might be able to help you.

Note also that none of lseek, read, and write exist in standard C
(though they're similar to fseek, fread, and fwrite, which do). "fd"
is probably an abbreviation of "file descriptor", which also does not
exist in standard C (the closest equivalent is the type FILE*). You
might be looking for a Unix-specific newsgroup like
comp.unix.progr ammer.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Dec 1 '05 #2
Keith Thompson wrote
(in article <ln************ @nuthaus.mib.or g>):
"Foodbank" <v8********@yah oo.com> writes:
Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.


It's not at all clear what you mean by this. Are you talking about
some kind of hash table? If so, please be more specific.


It sounds like he might be trying to save a hash table in memory
out to disk for a subsequent execution, but just guessing here.
If so, insert standard warnings about byte order and portability
for objects larger than bytes.

--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw

Dec 1 '05 #3
Yes, this is an algorithm based question. Basically, I'm trying to
save the hash table to file instead of dynamically in memory. So, for
example, I could access the hash table residing in the file at a later
date. I hope that helps.

Thanks,
James

Dec 1 '05 #4
"Foodbank" <v8********@yah oo.com> writes:
Yes, this is an algorithm based question. Basically, I'm trying to
save the hash table to file instead of dynamically in memory. So, for
example, I could access the hash table residing in the file at a later
date. I hope that helps.


It doesn't help much.

First, read this: <http://cfaj.freeshell. org/google/>.

There are a number of ways to represent a hash table. If the
in-memory representation uses pointers, it's going to be difficult to
represent them meaningfully in an external file; probably you can
translate them to some sort of numeric index. If you use binary
files, they won't be readable on different platforms (varying byte
order, varying sizes of fundamental data types, etc.) unless you go to
some extra effort to canonicalize everything. You might consider a
text-based representation.

If you can show us the C code that implements the hash table,
particularly any type declarations, we can probably help you come up
with a disk-based representation.

If your question isn't specific to C, you might try comp.programmin g.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Dec 1 '05 #5
Hi, here is my in-memory hash. Also, I've added the sections to create
and close the file.
As a note, I've removed some of the sections that are irrelevent to the
sections I need help with to save space and for ease of reading.

Thanks,
James

Expand|Select|Wrap|Line Numbers
  1. #include <stdio.h>
  2. #include <string.h>
  3. #include <sys/fcntl.h>
  4. #define HASHTSIZE  20011    // A prime number
  5. #define MAXWORDLEN 30     //maximum word length from text file
  6.  
  7. struct hash {
  8. int     count;
  9. char    word[MAXWORDLEN];
  10. } hash;              // the single reusable hash struct
  11.  
  12. int NUMCOLLISIONS;        //# of collisions
  13. int first_hash(char *);          //first hash function
  14. int second_hash(char *);     //second hash function
  15. int main(int, char **);
  16. int get_word(char *);
  17.  
  18. char word_buff[100];
  19.  
  20. int main(int argc, char *argv[]) {
  21. int h, inc, fd, n;
  22. unlink("hash_file");          // remove the old file if present
  23. fd = open("hash_file", O_CREAT|O_RDWR|O_EXCL, 0666);
  24. while(get_word(word_buff) > 0) {
  25. inc = second_hash(word_buff);    // second hash
  26. h = first_hash(word_buff);     // first hash
  27.  
  28. #ifdef     PREVIOUS CODE FROM IN-MEMORY HASH / COLLISION RESOLUTION
  29. STRATEGY
  30. while(strcmp(word_buff, hashtable[h].word)) {
  31. if(hashtable[h].count == 0) {
  32. strcpy(hashtable[h].word, word_buff);
  33. break;
  34. }
  35. NUMCOLLISIONS++; // wrong word is a collision
  36. }
  37. hashtable[h].count++;      // counts both old & new words
  38. #endif
  39.  
  40. //CODE HELP HERE FOR DISK
  41.  
  42. }
  43.  
  44.  
Dec 1 '05 #6
In article <11************ *********@g47g2 000cwa.googlegr oups.com>,
Foodbank <v8********@yah oo.com> wrote:
Hi, here is my in-memory hash. #include <stdio.h>
#include <string.h>
#include <sys/fcntl.h>
#define HASHTSIZE 20011 // A prime number
#define MAXWORDLEN 30 //maximum word length from text file struct hash {
int count;
char word[MAXWORDLEN];
} hash; // the single reusable hash struct
That's no so bad, you aren't storing pointers.
int NUMCOLLISIONS; //# of collisions
int first_hash(char *); //first hash function
int second_hash(cha r *); //second hash function
int main(int, char **);
int get_word(char *); char word_buff[100]; int main(int argc, char *argv[]) {
int h, inc, fd, n;
unlink("hash_fi le"); // remove the old file if present
fd = open("hash_file ", O_CREAT|O_RDWR| O_EXCL, 0666);
For later user, you will need to add in a call such as:

init_hashtable( fd);
while(get_word( word_buff) > 0) {
inc = second_hash(wor d_buff); // second hash
h = first_hash(word _buff); // first hash

#ifdef PREVIOUS CODE FROM IN-MEMORY HASH / COLLISION RESOLUTION
STRATEGY
while(strcmp(wo rd_buff, hashtable[h].word)) {
I am going to postulate here that the difficulty is that
hashtable[] might not fit into memory, and you want to pull the
appropriate entry from disk.
if(hashtable[h].count == 0) {
strcpy(hashtabl e[h].word, word_buff);
break;
}
NUMCOLLISIONS++ ; // wrong word is a collision
}
hashtable[h].count++; // counts both old & new words
#endif


Okay, what you can do is something like this:

..... h = first_hash(word _buff); // no change, just establishing context

while ( strcmp(word_buf f, (hashentry = get_hashtable_e ntry(h))->word ) ) {
if (hashentry->count == 0) {
strncpy( hashentry->word, word_buff, MAXWORDLEN );
mark_hashtable_ entry_dirty(h);
break;
}
NUMMCOLLISSIONS ++;

release_hashtab le_entry(h);

/* ... something here that updates h, because the code you show
infinite loops if the count it finds is non-zero */
}

/* your previous comment was that the increment "counts both
old & new words", but as you do not have multiple entries
for the same hash bucket, your hashing loop is going to have
to keep running until you find an entry with count of 0;
you cannot drop out of the while loop as long as the count is
non-zero. That being the case, it would make more sense to
increment the count where you do the strcpy... except increment
is the wrong term, as it is always (by induction) only going
to be changing a count of 0 to a count of 1, never (say)
a count of 19 to a count of 20. And if that's the case, that
it isn't really a count but just an in-use flag, then you could
instead just test the first character of the stored word: if
it is '\0' then the entry is unused.
*/

hashentry->count++;
mark_hashtable_ entry_dirty(h);
release_hashtab le_entry(h);

/* ... any other trailing stuff for the get_word() while loop goes here */
}

flush_hashtable ();

close(fd);

This introduces the functions init_hashtable( ), flush_hashtable (),
get_hashtable_e ntry(), mark_hashtable_ entry_dirty(), and
release_hashtab le_entry()

init_hashtable( ) would record the fd (or FILE*) for future use,
and would initialize an array of entries, each large enough to
hold a disk block, together with an array of disk block numbers and
another of markers as to whether the blocks are "clean" or "dirty".

When get_hashtable_e ntry() is called, the passed entry number is
converted (by knowing the structure size) into a disk block number,
with an integral number of entries packed per block. The cache
table is examined, and if that block is already present, then
mark the block as being "in use" and return a pointer to that particular
entry within the hash block.

If the block is not already present, then see if there is an
available slot in the cache array. If so, then read that disk block
into the slot, perform the appropriate overhead management, mark the
slot in use, return the pointer to the entry.

If no slots are available, find a slot that is not marked as being
in-use and which is not marked as dirty; clear the slot (without
any disk I/O), and do the read-the-block thingy. If there were
no slots which were not marked as dirty, pick one of the slots
that is not in use but is dirty, and write the entry block to disk,
then clear the slot and do the read-the-block thingy.

When release_hashtab le_entry() is called, it needs to undo the
"in-use" marker. In the simple loop logic above, there can only be
a single in-use entry, so only one in-use marker per cache slot would
be needed, but in a slightly more complex case, you would want something
like a bit vector of entry numbers within the disk block, in case
the same disk block was being used by two different active entries.

mark_hashtable_ entry_dirty() is a signal that you have made a
change to the contents of an entry and that the new version will
need to be written to disk eventually. It turns out that you only
need one "dirty" flag per disk block, not one per entry within the
block like you do for "in-use". (Actually there are some more
sophisticated disk storage structures in which it would make sense
to use one dirty flag per table entry, but those have tradeoffs
and make the reading-in of entries more complex. But if you have
a "transactio n processing" setup...)

release_hashtab le_entry() -could- trigger a write to disk if it was the
last active user of that cache slot and the disk slot is marked dirty,
but -usually- you want to postpone disk writes as long as practical,
hoping that other entries in the same disk block will get written to
later, so that when you finally need to release the cache slot for
use with an incoming block, that you might have effectively written
several changes with a single I/O. The immediate-write strategy will
often end up with one I/O per changed entry.

flush_hashtable () runs through and writes out all the dirty disk blocks,
and clears the dirty flags. It would not, though, clear the in-use
flags: that way it is safe to flush anytime (e.g., at periodic
intervals.)
--
Prototypes are supertypes of their clones. -- maplesoft
Dec 1 '05 #7
Hi and thank you for your help. However, I was wondering if there was
a way to do this using lseek, read, and write? I'm trying to get a
grasp on these commands if anyone knows anything about them.

Thanks,
James

Dec 1 '05 #8
In article <11************ *********@z14g2 000cwz.googlegr oups.com>,
Foodbank <v8********@yah oo.com> wrote:
Hi and thank you for your help. However, I was wondering if there was
a way to do this using lseek, read, and write? I'm trying to get a
grasp on these commands if anyone knows anything about them.


"lseek", "read" and "write" cannot be used to "do" an unspecified
person's help for an unspecified task.

Please retry your question, but this time with appropriate quoting
and citations, so that we know what you are talking about. Roughly
every fourth posting to this newsgroup contains instructions on
how to use the googlegroups interface to do quoting.

I can predict part of the answer already, though:

lseek(), read() and write() are not part of standard C; if you
want to use them, you are best off heading over to comp.unix.progr amming .
The standard C equivilents are fseek(), fread() and fwrite(), and
we can help you with those here.
--
I was very young in those days, but I was also rather dim.
-- Christopher Priest
Dec 1 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2022
by: Pat | last post by:
I want to look for some one-to-one hashing function. In C++, any one-to-one hashing function?
1
3014
by: snowteo | last post by:
Hi,I have to do this exercises can you help me: 1)Write a program to implement exetendible hashing.If the table is small enough to fin in main memory,how does its performance compare with open and closed hasing? 2)A basic program consists of a series of statements,each of which is numbered in ascending order.Control is passed by use of a goto or gosub and a statement number.Write a program that reads in a legal BASIC program and renumbers...
11
3440
by: Wm. Scott Miller | last post by:
Hello all! We are building applications here and have hashing algorithms to secure secrets (e.g passwords) by producing one way hashes. Now, I've read alot and I've followed most of the advice that made sense. One comment I've seen alot about is "securing the hashing routine" but no-one explains how to accomplish this. So how do I secure my hashing routine? Do I use code access security, role based security, ACLs, etc or combination?...
10
2876
by: Dino M. Buljubasic | last post by:
Hi, I am using MD5 to hash my passwords and add them to database as hashed. I have noticed though that some passwords don't get recognized and I suppose that it happen because hashing might introduce some characters in my password that are not handled properly by SQL server then. For example, password 'startreck' works just fine password 'test' does not
19
3847
by: Ole Nielsby | last post by:
How does the GetHashCode() of an array object behave? Does it combine the GetHashCode() of its elements, or does it create a sync block for the object? I want to use readonly arrays as dictionary keys, based on their content, not their identity. Is this feasible using the arrays directly, or do I need to wrap them in a struct that handles GetHashCode and Equal? If so, is such a wrapper present in the standard class library?
8
4580
by: Maya | last post by:
Hello all, I'm using MD5 hashing in my application to give unique values to huge list of items my application receives, originally every item's name was difficult to use as an id for this item although its unique but because it had certain characters and variable lengths I ended up using MD5 hashing of the name.
1
4421
by: Tinku | last post by:
Hi friends I know Static Hashing and i know about Dynamic Hashing, still i have problem to make program with Dynamic Hashing I am new in "C" world, please help me, my problem is: i have to make program in Dynamic hashing i have to store int value in nodes user only enter int value by this value i have to find hash key and make symbol table my struct are
11
2160
by: January Weiner | last post by:
Hello, I need to use a hashing function for relatively short strings (roughly 16 characters or less). The data that needs to be accessed via hash is just a simple int. I just need to look up values in two dimensional matrices each time that I encounter a pair of these strings. I would like to keep the code as simple and standard as possible, but at the same time to have a reasonable performance.
15
3012
by: Vinodh | last post by:
I am reading about hashing techniques. The map data structure available in C++ STL uses hashing techniques?
0
9600
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10633
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10376
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10375
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9198
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5548
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5686
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4331
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3860
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.