473,240 Members | 1,632 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,240 software developers and data experts.

Algorithm for searching sorted strings in binary file

Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Mar 30 '07 #1
4 5301
"Hunk" <sa**************@gmail.comwrote in message
news:11**********************@b75g2000hsg.googlegr oups.com...
Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)
Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.

Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.
Mar 30 '07 #2
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Hunk" <santosh.udyav...@gmail.comwrote in message

news:11**********************@b75g2000hsg.googlegr oups.com...


Hi
I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------
each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search . How about other searching
algo's?
>
Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.- Hide quoted text -

- Show quoted text -

Mar 30 '07 #3
On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.
Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.
How about other searching
algo's?
How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Mar 30 '07 #4
On Mar 30, 4:33 pm, "James Kanze" <james.ka...@gmail.comwrote:
On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.

Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.
How about other searching
algo's?

How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:james.ka...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Yep, i would be reading the entire file into memory and then work on
it. So yes this would solve the multiple file read problem.
Considering this is binary search still a viable option? people here
are of the opinion not to put it in any structure such as hash map for
eg .. any other string searching algo's which would be useful? A large
nos of searches would be there, but no change of the alreayd sorted
list would ever take place

Apr 3 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: usgog | last post by:
I need to implement a function to return True/false whether String A contains String B. For example, String A = "This is a test"; String B = "is is". So it will return TRUE if String A includes two...
2
by: yee young han | last post by:
I need a fast data structure and algorithm like below condition. (1) this data structure contain only 10,000 data entry. (2) data structure's one entry is like below typedef struct _DataEntry_...
28
by: joshc | last post by:
If I have an array of data that I know to be sorted in increasing order, and the array is less than 50 elements, and I want to find the first element greater than a certain value, is a simple...
6
by: Bernie Yaeger | last post by:
I need a little help with an algorithm. Let's say I have an array with 15 items. I want to find "Fern" where there are "Able", "Me Also", "Zimmerman", etc in no particular order. Forget about...
8
by: Allan Ebdrup | last post by:
What would be the fastest way to search 18,000 strings of an average size of 10Kb, I can have all the strings in memory, should I simply do a instr on all of the strings? Or is there a faster way?...
0
by: JosAH | last post by:
Greetings, I was asked to write a Tip Of the Week; so here goes: a lot of topics are started here in this forum (and a lot of other forums too) mentioning a problem about sorting data. ...
6
by: pj | last post by:
Hi, I 'm currently writing a program that performs transliteration (i.e., converts greek text written using the english alphabet to "pure" greek text using the greek alphabet) as part of my...
3
by: Ahmad Jalil Qarshi | last post by:
Hi, I have a text file having size about 2 GB. The text file format is like: Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values For example...
6
Kelicula
by: Kelicula | last post by:
Why?: One commonly used algorithm is the binary search. If you don't already know it, you should read on. Very helpful. Saves much CPU. Reduces computations exponentially. When searching...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.