473,240 Members | 1,632 Online

# Algorithm for searching sorted strings in binary file

Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Mar 30 '07 #1
4 5301
"Hunk" <sa**************@gmail.comwrote in message
Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)
Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.

Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.
Mar 30 '07 #2
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Hunk" <santosh.udyav...@gmail.comwrote in message

Hi
I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------
each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search . How about other searching
algo's?
>
Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.- Hide quoted text -

- Show quoted text -

Mar 30 '07 #3
On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.
Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.
algo's?
How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Mar 30 '07 #4
On Mar 30, 4:33 pm, "James Kanze" <james.ka...@gmail.comwrote:
On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.

Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.
algo's?

How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:james.ka...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Yep, i would be reading the entire file into memory and then work on
it. So yes this would solve the multiple file read problem.
Considering this is binary search still a viable option? people here
are of the opinion not to put it in any structure such as hash map for
eg .. any other string searching algo's which would be useful? A large
nos of searches would be there, but no change of the alreayd sorted
list would ever take place

Apr 3 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.