By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
458,249 Members | 1,853 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 458,249 IT Pros & Developers. It's quick & easy.

Reading & Searching a Huge file

P: n/a
Hi,

I have a text file having size about 2 GB. The text file format is like:

Numeric value[space]AlphaNumeric values
Numeric value[space]AlphaNumeric values
Numeric value[space]AlphaNumeric values

For example consider following chunk of actual data:

52 SANZP3118.00Q3126T070533N16CEPXT
96 SANZP3118.00Q7294T070534N17CEP
100 SBHPP3200.00Q16000T070534N18CEPXT
150 SBHPP3200.00Q1000T070534N19CEP
153 SBHPP3200.00Q5000070534N20CEP
159 SBHPP3200.00Q3000070534N21CEP
168 SBHPP3200.00Q111000T070534N22CEP
168 SLHGP375.00Q4000T070534N26CEP
210 SBHPP3200.0Q000T070534N23CEP
569 SLHGP375.00Q23000T070534N24CEP
1000 SLHGP375.00Q162000T070534N25CEP
1010 SLHGP375.00Q4000T070534N26CEP

Now what I want to is to find a specific numeric value at the beginning of
the line. For example in the above few lines the numeric values are
52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
sorted). Suppose I have to search 168. Since the values are sorted I can use
the binary search technique to search a specific numeric value. But for that
I will have to read all the numeric values sequentially into some array in
memory. The other way is sequential search i.e to read line by line and
fetch the numeric value before [space] and compare it with required one.

What is the best possible way to perform searching in the above mentioned
file format.

Thanks in anticipation,

Regards,

Ahmad Jalil Qarshi
Mar 14 '08 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On Fri, 14 Mar 2008 10:35:49 -0700, Ahmad Jalil Qarshi
<ah*********@SPAMhotmail.comwrote:
[...]
Now what I want to is to find a specific numeric value at the beginning
of
the line. For example in the above few lines the numeric values are
52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
sorted). Suppose I have to search 168. Since the values are sorted I can
use
the binary search technique to search a specific numeric value. But for
that
I will have to read all the numeric values sequentially into some array
in
memory. The other way is sequential search i.e to read line by line and
fetch the numeric value before [space] and compare it with required one.

What is the best possible way to perform searching in the above mentioned
file format.
How many times do you need to perform this search?

If it's just once, then I think you could take advantage of the fact that
the lines are all _almost_ the same length and do a binary search on the
file itself. You'll have to make a guess, then search for the
previous/next end-of-line, then check the numeric value at the beginning
of that line. That's not as efficient as it could be if you had identical
line lengths, but compared to scanning through a 2GB file it should still
be plenty fast.

If you're going to perform the search on a regular basis, I'd recommend
creating an index for the file. Scan it once, storing just the initial
numeric value and a byte offset within the file representing where that
record's line starts. Then later you can do your binary search on the
index instead of the file. This is, I think, what you're suggesting when
you wrote "read all the numeric values sequentially into some array"; you
didn't mention also storing the file offsets for each record, but
hopefully you realize that's implied since otherwise keeping the values in
an array wouldn't be useful.

Of course, keep in mind that this is the sort of thing that databases are
really good at. If you have a way to redirect the data into a database
rather than keeping it in this huge text file, I think that would be the
best way to handle it.

Pete
Mar 14 '08 #2

P: n/a
Thanks Peter & Jon,

I have to read the file on user request. Now the problem is that I can't
create index as the format of file is fixed. So I think the only option left
is binary search.

Let me explain one more thing which I missed in my previous post. The
numeric value before the [space] can be repeated thousands of time
consequtively. Like.
........ ..............................
2073 SANZP311800Q729T070534N17CEP
2345 SBHPP3200.00Q16000T070534N18CEPXT
2345 SBHPP3200.00Q1000T070534N19CEP
2345 SBHPP3200.00Q5000070534N20CEPTP
........ ..............................
........ ..............................
2345 SLHGP375.00Q4000T070534N26CEP
2567 SBHPP3200.0Q000T070534N23CEP
......... ..............................

Now if I have to find the first occurance of a numeric value 2345. In that
case binary search will help or not?

Regards,

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Fri, 14 Mar 2008 10:35:49 -0700, Ahmad Jalil Qarshi
<ah*********@SPAMhotmail.comwrote:
>[...]
Now what I want to is to find a specific numeric value at the beginning
of
the line. For example in the above few lines the numeric values are
52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
sorted). Suppose I have to search 168. Since the values are sorted I can
use
the binary search technique to search a specific numeric value. But for
that
I will have to read all the numeric values sequentially into some array
in
memory. The other way is sequential search i.e to read line by line and
fetch the numeric value before [space] and compare it with required one.

What is the best possible way to perform searching in the above mentioned
file format.

How many times do you need to perform this search?

If it's just once, then I think you could take advantage of the fact that
the lines are all _almost_ the same length and do a binary search on the
file itself. You'll have to make a guess, then search for the
previous/next end-of-line, then check the numeric value at the beginning
of that line. That's not as efficient as it could be if you had identical
line lengths, but compared to scanning through a 2GB file it should still
be plenty fast.

If you're going to perform the search on a regular basis, I'd recommend
creating an index for the file. Scan it once, storing just the initial
numeric value and a byte offset within the file representing where that
record's line starts. Then later you can do your binary search on the
index instead of the file. This is, I think, what you're suggesting when
you wrote "read all the numeric values sequentially into some array"; you
didn't mention also storing the file offsets for each record, but
hopefully you realize that's implied since otherwise keeping the values in
an array wouldn't be useful.

Of course, keep in mind that this is the sort of thing that databases are
really good at. If you have a way to redirect the data into a database
rather than keeping it in this huge text file, I think that would be the
best way to handle it.

Pete

Mar 15 '08 #3

P: n/a
On Sat, 15 Mar 2008 02:36:29 -0700, Ahmad Jalil Qarshi
<ah*********@SPAMhotmail.comwrote:
Thanks Peter & Jon,

I have to read the file on user request. Now the problem is that I can't
create index as the format of file is fixed. So I think the only option
left
is binary search.
Well, a linear scan is always an option. Whether that performs well
enough to justify avoiding the work of implementing a binary search,
that's up to you or whoever's making the decision about how your time is
best used.

If the text file uses an encoding that has fixed-sized characters, I think
that a binary search isn't actually going to be very hard to implement.
But as Jon points out, if you have to support multiple encodings and/or
encodings with variable-length characters (e.g. UTF-8), then it gets more
complicated (a little...I'm no character-encoding expert, but I'm under
the impression that in any of the encodings, finding an end-of-line marker
while scanning backwards would not be difficult; a LF or CR/LF pair isn't
a valid sequence in any other character's data, is it?).

In either case, it's certainly possible. The question is what's the best
use of your time.

I still wonder at the requirement that this has to be done "on demand",
without an opportunity to index the file. That's an awful lot of data to
be search just once. But if "the powers that be" deem it more sensible to
make you implement a binary search on a variable-line-length text file
than to fix the problem space to better incorporate an indexing- or
database-based solution, I guess that's what you're stuck with.

I have to admit: if these happen infrequently enough that indexing a file
doesn't make sense, I wonder if they don't also happen infrequently enough
that just scanning linearly through the file wouldn't be okay. Why does
this need to be fast?
Let me explain one more thing which I missed in my previous post. The
numeric value before the [space] can be repeated thousands of time
consequtively. [...]

Now if I have to find the first occurance of a numeric value 2345. In
that
case binary search will help or not?
It will help just as much as it would in any other situation where a
binary search can be used. Binary search doesn't imply unique elements,
but if you have duplicated elements that does mean that you have to do
some extra work to scan and find the first matching element, since the
first element the search finds may not be the actual first element in the
sequence.

Pete
Mar 15 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.