473,322 Members | 1,538 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Hashing to Disk

Hi,

Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.

Thanks,
James

Dec 1 '05 #1
8 4188
"Foodbank" <v8********@yahoo.com> writes:
Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.


It's not at all clear what you mean by this. Are you talking about
some kind of hash table? If so, please be more specific.

Note that C has no built-in support for hash tables. I suspect you
have more of an algorithm question than a C question. If you can show
us some code, we might be able to help you.

Note also that none of lseek, read, and write exist in standard C
(though they're similar to fseek, fread, and fwrite, which do). "fd"
is probably an abbreviation of "file descriptor", which also does not
exist in standard C (the closest equivalent is the type FILE*). You
might be looking for a Unix-specific newsgroup like
comp.unix.programmer.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Dec 1 '05 #2
Keith Thompson wrote
(in article <ln************@nuthaus.mib.org>):
"Foodbank" <v8********@yahoo.com> writes:
Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.


It's not at all clear what you mean by this. Are you talking about
some kind of hash table? If so, please be more specific.


It sounds like he might be trying to save a hash table in memory
out to disk for a subsequent execution, but just guessing here.
If so, insert standard warnings about byte order and portability
for objects larger than bytes.

--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw

Dec 1 '05 #3
Yes, this is an algorithm based question. Basically, I'm trying to
save the hash table to file instead of dynamically in memory. So, for
example, I could access the hash table residing in the file at a later
date. I hope that helps.

Thanks,
James

Dec 1 '05 #4
"Foodbank" <v8********@yahoo.com> writes:
Yes, this is an algorithm based question. Basically, I'm trying to
save the hash table to file instead of dynamically in memory. So, for
example, I could access the hash table residing in the file at a later
date. I hope that helps.


It doesn't help much.

First, read this: <http://cfaj.freeshell.org/google/>.

There are a number of ways to represent a hash table. If the
in-memory representation uses pointers, it's going to be difficult to
represent them meaningfully in an external file; probably you can
translate them to some sort of numeric index. If you use binary
files, they won't be readable on different platforms (varying byte
order, varying sizes of fundamental data types, etc.) unless you go to
some extra effort to canonicalize everything. You might consider a
text-based representation.

If you can show us the C code that implements the hash table,
particularly any type declarations, we can probably help you come up
with a disk-based representation.

If your question isn't specific to C, you might try comp.programming.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Dec 1 '05 #5
Hi, here is my in-memory hash. Also, I've added the sections to create
and close the file.
As a note, I've removed some of the sections that are irrelevent to the
sections I need help with to save space and for ease of reading.

Thanks,
James

Expand|Select|Wrap|Line Numbers
  1. #include <stdio.h>
  2. #include <string.h>
  3. #include <sys/fcntl.h>
  4. #define HASHTSIZE  20011    // A prime number
  5. #define MAXWORDLEN 30     //maximum word length from text file
  6.  
  7. struct hash {
  8. int     count;
  9. char    word[MAXWORDLEN];
  10. } hash;              // the single reusable hash struct
  11.  
  12. int NUMCOLLISIONS;        //# of collisions
  13. int first_hash(char *);          //first hash function
  14. int second_hash(char *);     //second hash function
  15. int main(int, char **);
  16. int get_word(char *);
  17.  
  18. char word_buff[100];
  19.  
  20. int main(int argc, char *argv[]) {
  21. int h, inc, fd, n;
  22. unlink("hash_file");          // remove the old file if present
  23. fd = open("hash_file", O_CREAT|O_RDWR|O_EXCL, 0666);
  24. while(get_word(word_buff) > 0) {
  25. inc = second_hash(word_buff);    // second hash
  26. h = first_hash(word_buff);     // first hash
  27.  
  28. #ifdef     PREVIOUS CODE FROM IN-MEMORY HASH / COLLISION RESOLUTION
  29. STRATEGY
  30. while(strcmp(word_buff, hashtable[h].word)) {
  31. if(hashtable[h].count == 0) {
  32. strcpy(hashtable[h].word, word_buff);
  33. break;
  34. }
  35. NUMCOLLISIONS++; // wrong word is a collision
  36. }
  37. hashtable[h].count++;      // counts both old & new words
  38. #endif
  39.  
  40. //CODE HELP HERE FOR DISK
  41.  
  42. }
  43.  
  44.  
Dec 1 '05 #6
In article <11*********************@g47g2000cwa.googlegroups. com>,
Foodbank <v8********@yahoo.com> wrote:
Hi, here is my in-memory hash. #include <stdio.h>
#include <string.h>
#include <sys/fcntl.h>
#define HASHTSIZE 20011 // A prime number
#define MAXWORDLEN 30 //maximum word length from text file struct hash {
int count;
char word[MAXWORDLEN];
} hash; // the single reusable hash struct
That's no so bad, you aren't storing pointers.
int NUMCOLLISIONS; //# of collisions
int first_hash(char *); //first hash function
int second_hash(char *); //second hash function
int main(int, char **);
int get_word(char *); char word_buff[100]; int main(int argc, char *argv[]) {
int h, inc, fd, n;
unlink("hash_file"); // remove the old file if present
fd = open("hash_file", O_CREAT|O_RDWR|O_EXCL, 0666);
For later user, you will need to add in a call such as:

init_hashtable(fd);
while(get_word(word_buff) > 0) {
inc = second_hash(word_buff); // second hash
h = first_hash(word_buff); // first hash

#ifdef PREVIOUS CODE FROM IN-MEMORY HASH / COLLISION RESOLUTION
STRATEGY
while(strcmp(word_buff, hashtable[h].word)) {
I am going to postulate here that the difficulty is that
hashtable[] might not fit into memory, and you want to pull the
appropriate entry from disk.
if(hashtable[h].count == 0) {
strcpy(hashtable[h].word, word_buff);
break;
}
NUMCOLLISIONS++; // wrong word is a collision
}
hashtable[h].count++; // counts both old & new words
#endif


Okay, what you can do is something like this:

..... h = first_hash(word_buff); // no change, just establishing context

while ( strcmp(word_buff, (hashentry = get_hashtable_entry(h))->word ) ) {
if (hashentry->count == 0) {
strncpy( hashentry->word, word_buff, MAXWORDLEN );
mark_hashtable_entry_dirty(h);
break;
}
NUMMCOLLISSIONS++;

release_hashtable_entry(h);

/* ... something here that updates h, because the code you show
infinite loops if the count it finds is non-zero */
}

/* your previous comment was that the increment "counts both
old & new words", but as you do not have multiple entries
for the same hash bucket, your hashing loop is going to have
to keep running until you find an entry with count of 0;
you cannot drop out of the while loop as long as the count is
non-zero. That being the case, it would make more sense to
increment the count where you do the strcpy... except increment
is the wrong term, as it is always (by induction) only going
to be changing a count of 0 to a count of 1, never (say)
a count of 19 to a count of 20. And if that's the case, that
it isn't really a count but just an in-use flag, then you could
instead just test the first character of the stored word: if
it is '\0' then the entry is unused.
*/

hashentry->count++;
mark_hashtable_entry_dirty(h);
release_hashtable_entry(h);

/* ... any other trailing stuff for the get_word() while loop goes here */
}

flush_hashtable();

close(fd);

This introduces the functions init_hashtable(), flush_hashtable(),
get_hashtable_entry(), mark_hashtable_entry_dirty(), and
release_hashtable_entry()

init_hashtable() would record the fd (or FILE*) for future use,
and would initialize an array of entries, each large enough to
hold a disk block, together with an array of disk block numbers and
another of markers as to whether the blocks are "clean" or "dirty".

When get_hashtable_entry() is called, the passed entry number is
converted (by knowing the structure size) into a disk block number,
with an integral number of entries packed per block. The cache
table is examined, and if that block is already present, then
mark the block as being "in use" and return a pointer to that particular
entry within the hash block.

If the block is not already present, then see if there is an
available slot in the cache array. If so, then read that disk block
into the slot, perform the appropriate overhead management, mark the
slot in use, return the pointer to the entry.

If no slots are available, find a slot that is not marked as being
in-use and which is not marked as dirty; clear the slot (without
any disk I/O), and do the read-the-block thingy. If there were
no slots which were not marked as dirty, pick one of the slots
that is not in use but is dirty, and write the entry block to disk,
then clear the slot and do the read-the-block thingy.

When release_hashtable_entry() is called, it needs to undo the
"in-use" marker. In the simple loop logic above, there can only be
a single in-use entry, so only one in-use marker per cache slot would
be needed, but in a slightly more complex case, you would want something
like a bit vector of entry numbers within the disk block, in case
the same disk block was being used by two different active entries.

mark_hashtable_entry_dirty() is a signal that you have made a
change to the contents of an entry and that the new version will
need to be written to disk eventually. It turns out that you only
need one "dirty" flag per disk block, not one per entry within the
block like you do for "in-use". (Actually there are some more
sophisticated disk storage structures in which it would make sense
to use one dirty flag per table entry, but those have tradeoffs
and make the reading-in of entries more complex. But if you have
a "transaction processing" setup...)

release_hashtable_entry() -could- trigger a write to disk if it was the
last active user of that cache slot and the disk slot is marked dirty,
but -usually- you want to postpone disk writes as long as practical,
hoping that other entries in the same disk block will get written to
later, so that when you finally need to release the cache slot for
use with an incoming block, that you might have effectively written
several changes with a single I/O. The immediate-write strategy will
often end up with one I/O per changed entry.

flush_hashtable() runs through and writes out all the dirty disk blocks,
and clears the dirty flags. It would not, though, clear the in-use
flags: that way it is safe to flush anytime (e.g., at periodic
intervals.)
--
Prototypes are supertypes of their clones. -- maplesoft
Dec 1 '05 #7
Hi and thank you for your help. However, I was wondering if there was
a way to do this using lseek, read, and write? I'm trying to get a
grasp on these commands if anyone knows anything about them.

Thanks,
James

Dec 1 '05 #8
In article <11*********************@z14g2000cwz.googlegroups. com>,
Foodbank <v8********@yahoo.com> wrote:
Hi and thank you for your help. However, I was wondering if there was
a way to do this using lseek, read, and write? I'm trying to get a
grasp on these commands if anyone knows anything about them.


"lseek", "read" and "write" cannot be used to "do" an unspecified
person's help for an unspecified task.

Please retry your question, but this time with appropriate quoting
and citations, so that we know what you are talking about. Roughly
every fourth posting to this newsgroup contains instructions on
how to use the googlegroups interface to do quoting.

I can predict part of the answer already, though:

lseek(), read() and write() are not part of standard C; if you
want to use them, you are best off heading over to comp.unix.programming .
The standard C equivilents are fseek(), fread() and fwrite(), and
we can help you with those here.
--
I was very young in those days, but I was also rather dim.
-- Christopher Priest
Dec 1 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Pat | last post by:
I want to look for some one-to-one hashing function. In C++, any one-to-one hashing function?
1
by: snowteo | last post by:
Hi,I have to do this exercises can you help me: 1)Write a program to implement exetendible hashing.If the table is small enough to fin in main memory,how does its performance compare with open and...
11
by: Wm. Scott Miller | last post by:
Hello all! We are building applications here and have hashing algorithms to secure secrets (e.g passwords) by producing one way hashes. Now, I've read alot and I've followed most of the advice...
10
by: Dino M. Buljubasic | last post by:
Hi, I am using MD5 to hash my passwords and add them to database as hashed. I have noticed though that some passwords don't get recognized and I suppose that it happen because hashing might...
19
by: Ole Nielsby | last post by:
How does the GetHashCode() of an array object behave? Does it combine the GetHashCode() of its elements, or does it create a sync block for the object? I want to use readonly arrays as...
8
by: Maya | last post by:
Hello all, I'm using MD5 hashing in my application to give unique values to huge list of items my application receives, originally every item's name was difficult to use as an id for this item...
1
by: Tinku | last post by:
Hi friends I know Static Hashing and i know about Dynamic Hashing, still i have problem to make program with Dynamic Hashing I am new in "C" world, please help me, my problem is: i have to...
11
by: January Weiner | last post by:
Hello, I need to use a hashing function for relatively short strings (roughly 16 characters or less). The data that needs to be accessed via hash is just a simple int. I just need to look up...
15
by: Vinodh | last post by:
I am reading about hashing techniques. The map data structure available in C++ STL uses hashing techniques?
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.