Algorithm for searching sorted strings in binary file

Hunk

Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Mar 30 '07 #1

Subscribe Post Reply

5309

Jim Langston

"Hunk" <sa**************@gmail.comwrote in message
news:11**********************@b75g2000hsg.googlegr oups.com...

Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.

Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.

Mar 30 '07 #2

Hunk

On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:

"Hunk" <santosh.udyav...@gmail.comwrote in message

news:11**********************@b75g2000hsg.googlegr oups.com...

Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.

Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search . How about other searching
algo's?

>
Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.- Hide quoted text -

- Show quoted text -

Mar 30 '07 #3

James Kanze

On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:

On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:

Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.

Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.

How about other searching
algo's?

How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Mar 30 '07 #4

Hunk

On Mar 30, 4:33 pm, "James Kanze" <james.ka...@gmail.comwrote:

On Mar 30, 11:42 am, "Hunk" <santosh.udyav...@gmail.comwrote:

On Mar 30, 1:27 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.

Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.

How about other searching
algo's?

How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:james.ka...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Yep, i would be reading the entire file into memory and then work on
it. So yes this would solve the multiple file read problem.
Considering this is binary search still a viable option? people here
are of the opinion not to put it in any structure such as hash map for
eg .. any other string searching algo's which would be useful? A large
nos of searches would be there, but no change of the alreayd sorted
list would ever take place

Apr 3 '07 #5

by: usgog | last post by:

I need to implement a function to return True/false whether String A contains String B. For example, String A = "This is a test"; String B = "is is". So it will return TRUE if String A includes two...

Java

What is the data structure & algorithm that is fastest searching & sorting.

by: yee young han | last post by:

I need a fast data structure and algorithm like below condition. (1) this data structure contain only 10,000 data entry. (2) data structure's one entry is like below typedef struct _DataEntry_...

C / C++

Search algorithm to use

by: joshc | last post by:

If I have an array of data that I know to be sorted in increasing order, and the array is less than 50 elements, and I want to find the first element greater than a certain value, is a simple...

C / C++

searching algorithm

by: Bernie Yaeger | last post by:

I need a little help with an algorithm. Let's say I have an array with 15 items. I want to find "Fern" where there are "Able", "Me Also", "Zimmerman", etc in no particular order. Forget about...

Visual Basic .NET

Searching 18,000 strings of an average size of 10Kb

by: Allan Ebdrup | last post by:

What would be the fastest way to search 18,000 strings of an average size of 10Kb, I can have all the strings in memory, should I simply do a instr on all of the strings? Or is there a faster way?...

C# / C Sharp

Generic heap sort algorithm A

by: JosAH | last post by:

Greetings, I was asked to write a Tip Of the Week; so here goes: a lot of topics are started here in this forum (and a lot of other forums too) mentioning a problem about sorting data. ...

Java

Best algorithm/data structure combination for word lookup?

by: pj | last post by:

Hi, I 'm currently writing a program that performs transliteration (i.e., converts greek text written using the english alphabet to "pure" greek text using the greek alphabet) as part of my...

C / C++

Reading & Searching a Huge file

by: Ahmad Jalil Qarshi | last post by:

Hi, I have a text file having size about 2 GB. The text file format is like: Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values For example...

C# / C Sharp

Binary Search Algorithm

by: Kelicula | last post by:

Why?: One commonly used algorithm is the binary search. If you don't already know it, you should read on. Very helpful. Saves much CPU. Reduces computations exponentially. When searching...

Algorithms / Advanced Math

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Algorithm for searching sorted strings in binary file

Similar topics