473,396 Members | 1,914 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Algorithm for searching sorted strings in binary file

Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Mar 30 '07 #1
4 5309
"Hunk" <sa**************@gmail.comwrote in message
news:11**********************@b75g2000hsg.googlegr oups.com...
Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)
Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.

Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.
Mar 30 '07 #2
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Hunk" <santosh.udyav...@gmail.comwrote in message

news:11**********************@b75g2000hsg.googlegr oups.com...


Hi
I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------
each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search . How about other searching
algo's?
>
Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.- Hide quoted text -

- Show quoted text -

Mar 30 '07 #3
On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.
Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.
How about other searching
algo's?
How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Mar 30 '07 #4
On Mar 30, 4:33 pm, "James Kanze" <james.ka...@gmail.comwrote:
On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:
On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.

Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.
How about other searching
algo's?

How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:james.ka...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Yep, i would be reading the entire file into memory and then work on
it. So yes this would solve the multiple file read problem.
Considering this is binary search still a viable option? people here
are of the opinion not to put it in any structure such as hash map for
eg .. any other string searching algo's which would be useful? A large
nos of searches would be there, but no change of the alreayd sorted
list would ever take place

Apr 3 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: usgog | last post by:
I need to implement a function to return True/false whether String A contains String B. For example, String A = "This is a test"; String B = "is is". So it will return TRUE if String A includes two...
2
by: yee young han | last post by:
I need a fast data structure and algorithm like below condition. (1) this data structure contain only 10,000 data entry. (2) data structure's one entry is like below typedef struct _DataEntry_...
28
by: joshc | last post by:
If I have an array of data that I know to be sorted in increasing order, and the array is less than 50 elements, and I want to find the first element greater than a certain value, is a simple...
6
by: Bernie Yaeger | last post by:
I need a little help with an algorithm. Let's say I have an array with 15 items. I want to find "Fern" where there are "Able", "Me Also", "Zimmerman", etc in no particular order. Forget about...
8
by: Allan Ebdrup | last post by:
What would be the fastest way to search 18,000 strings of an average size of 10Kb, I can have all the strings in memory, should I simply do a instr on all of the strings? Or is there a faster way?...
0
by: JosAH | last post by:
Greetings, I was asked to write a Tip Of the Week; so here goes: a lot of topics are started here in this forum (and a lot of other forums too) mentioning a problem about sorting data. ...
6
by: pj | last post by:
Hi, I 'm currently writing a program that performs transliteration (i.e., converts greek text written using the english alphabet to "pure" greek text using the greek alphabet) as part of my...
3
by: Ahmad Jalil Qarshi | last post by:
Hi, I have a text file having size about 2 GB. The text file format is like: Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values For example...
6
Kelicula
by: Kelicula | last post by:
Why?: One commonly used algorithm is the binary search. If you don't already know it, you should read on. Very helpful. Saves much CPU. Reduces computations exponentially. When searching...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.