By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,804 Members | 1,659 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,804 IT Pros & Developers. It's quick & easy.

Indexing a Text File

P: n/a
Hi,

I'm trying to index a text file by creating the index and data clusters
(basically ISAM). Can anyone help with this. I'm finding a very small
amount of resources online for this application.

Thanks,
James

Dec 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
I've gotten some code started below, if anyone can give me some
pointers (and I'm not talking about pointers in C :) ).

Thanks,
James

Expand|Select|Wrap|Line Numbers
  1. #define ELEMPERCLUST 42                      //42 elements per cluster
  2. #define CLUSTERSIZE       1024                 //cluster size obviously
  3. :)
  4. #define MAXWORDLEN   22                       //maximum word length
  5. #define NUMLEVELS 3                              //number of index
  6. levels
  7. #define WORDLEV (NUMLEVELS-1)           //words are on level 2 (3-1 =
  8. 2)
  9.  
  10. struct element {
  11. int index;                                     // cluster index
  12. char     word[MAXWORDLEN];
  13. };
  14.  
  15. struct cluster {
  16. int indexclust;         //Set to 1 for index clust or 0 for words
  17. struct element elem[ELEMPERCLUST];
  18. } clust[NUMLEVELS];  // 2 index levels 0, 1 plus the word level 2
  19.  
  20. int fd, clust_pos, nincluster[NUMLEVELS], location[NUMLEVELS];
  21.  
  22. void isam_output(int lev, char *word, int index) {
  23. //Time to write the index and data clusters for words
  24. }
  25.  
Dec 18 '05 #2

P: n/a

Foodbank schreef:
I've gotten some code started below, if anyone can give me some
pointers (and I'm not talking about pointers in C :) ).
First off, all datastructures are useless without code. It's hard to
see what exactly you are trying to achieve and how you are intending to
do it.

And I hate trying to give advice fro, what I _think_ you mean.
Expand|Select|Wrap|Line Numbers
  1.  #define ELEMPERCLUST 42                      //42 elements per cluster
  2.  #define CLUSTERSIZE       1024                 //cluster size obviously
  3.  :)
  •  
  • So far, so goed. Dump the smileys, though. If this is professional
  • code, don't try to be cute.
  •  #define MAXWORDLEN   22                       //maximum word length
  •  #define NUMLEVELS 3                              //number of index
  •  levels
  •  
  • A more usefull comment would explain _why_ there are only three levels.
  • Why not simply used qsort and bsearch? Why require a homegrown indexing
  • method?
  •  #define WORDLEV (NUMLEVELS-1)           //words are on level 2 (3-1 =
  •  2)
  •  
  •  struct element {
  •       int index;                                     // cluster index
  •       char     word[MAXWORDLEN];
  •  };
  •  
  • This will give you a lot of overhead, since MAXWORDLEN must be large
  • enough to hold the longest word and most words will be much shorter.
  • Isn't there a better alternative? Does the content change frequently?
  •  
  • If not, why not record all the words in one big buffer, separated by
  • '\0' and simply use a char*? Also why do you store the 'cluster index'
  • in the element? It's not clear from what you write here, so at least
  • there should be a comment, explaining that.
  •  struct cluster {
  •       int indexclust;         //Set to 1 for index clust or 0 for words
  •       struct element elem[ELEMPERCLUST];
  •  } clust[NUMLEVELS];  // 2 index levels 0, 1 plus the word level 2
  •  
  • Obviously you know (from the index of 'clust') whether you are dealing
  • with an 'index clust' or a word, so the first field seems superfluous.
  • Unless of course you are planning to domething incredibly clever, I do
  • not see.
  •  int fd, clust_pos, nincluster[NUMLEVELS], location[NUMLEVELS];
  •  
  • It's generally a good idea to define only one variable per line. That
  • will make your code easier to read. After all, you write in C fro the
  • benefit of humans, not computers.
  •  void isam_output(int lev, char *word, int index) {
  •       //Time to write the index and data clusters for words
  •  }
  •  
  • How do you signal an error? If you write to a stream, any number of
  • things can go wrong. Failing silently virtually guarantees BIG
  • problems. Also, your interface requires you to know the index before
  • you've written anything. Now you could be planning something incredibly
  • clever, but from what you post here, i don't think it's a very useable
  • interface. Usually the index is a _result_ of writing a record.
  •  


  • Dec 19 '05 #3

    P: n/a
    I appreciate the effort, but all you did was basically criticize my
    code instead of pointing me in the correct direction to go. Anyone
    else?

    Thanks,
    James

    PS I'll use all the smileys I want :)

    Dec 19 '05 #4

    P: n/a
    On 18 Dec 2005 11:33:32 -0800, "Foodbank" <v8********@yahoo.com>
    wrote:
    I've gotten some code started below, if anyone can give me some
    pointers (and I'm not talking about pointers in C :) ).
    There is nothing in your code that tells us what you are trying to
    accomplish.

    I recommend you leave off trivial comments. They actually decrease
    readability.

    Thanks,
    James

    Expand|Select|Wrap|Line Numbers
    1. #define ELEMPERCLUST 42                      //42 elements per cluster
    2. #define CLUSTERSIZE       1024                 //cluster size obviously
    3. :)
    4. #define MAXWORDLEN   22                       //maximum word length
    5. #define NUMLEVELS 3                              //number of index
    6. levels
    7. #define WORDLEV (NUMLEVELS-1)           //words are on level 2 (3-1 =
    8. 2)
    9. struct element {
    10.      int index;                                     // cluster index
    11.      char     word[MAXWORDLEN];
    12. };
    13. struct cluster {
    14.      int indexclust;         //Set to 1 for index clust or 0 for words
    15.      struct element elem[ELEMPERCLUST];
    16. } clust[NUMLEVELS];  // 2 index levels 0, 1 plus the word level 2
    17. int fd, clust_pos, nincluster[NUMLEVELS], location[NUMLEVELS];
    18. void isam_output(int lev, char *word, int index) {
    19.      //Time to write the index and data clusters for words
    20. }

    <<Remove the del for email>>
    Dec 19 '05 #5

    P: n/a
    In article <11**********************@o13g2000cwo.googlegroups .com>,
    Foodbank <v8********@yahoo.com> wrote:
    I appreciate the effort, but all you did was basically criticize my
    code instead of pointing me in the correct direction to go. Anyone
    else?


    Your posting asked for "pointers", and the respondant gave you a
    number of pointers as to how your code could be improved and as
    to why your existing interface does not appear to suit the stated
    purpose.

    If that wasn't the kind of pointer that you wanted, then you
    could have been more specific.

    What is it that you are looking for? Are you looking for research
    papers comparing the efficiency of ISAM to other databases? Are
    you looking for information on how to optimize ISAM lookups?
    Are you looking for a solid escription of what ISAM is, but
    without code, for the purposes of a "clean-room implementation"
    for a commercial product? Are you looking for a public domain
    ISAM for use in a commercial product? Are you looking for an ISAM
    implementation with a freeware license that could be used in
    a commercial product? Are you looking for an ISAM with a freeware
    license that would allow you to use it in a non-commercial product?

    Or, are you looking for hints on how to code a school assignment?
    --
    Programming is what happens while you're busy making other plans.
    Dec 19 '05 #6

    This discussion thread is closed

    Replies have been disabled for this discussion.