Reading & Searching a Huge file

Ahmad Jalil Qarshi

Hi,

I have a text file having size about 2 GB. The text file format is like:

Numeric value[space]AlphaNumeric values
Numeric value[space]AlphaNumeric values
Numeric value[space]AlphaNumeric values

For example consider following chunk of actual data:

52 SANZP3118.00Q3126T070533N16CEPXT
96 SANZP3118.00Q7294T070534N17CEP
100 SBHPP3200.00Q16000T070534N18CEPXT
150 SBHPP3200.00Q1000T070534N19CEP
153 SBHPP3200.00Q5000070534N20CEP
159 SBHPP3200.00Q3000070534N21CEP
168 SBHPP3200.00Q111000T070534N22CEP
168 SLHGP375.00Q4000T070534N26CEP
210 SBHPP3200.0Q000T070534N23CEP
569 SLHGP375.00Q23000T070534N24CEP
1000 SLHGP375.00Q162000T070534N25CEP
1010 SLHGP375.00Q4000T070534N26CEP

Now what I want to is to find a specific numeric value at the beginning of
the line. For example in the above few lines the numeric values are
52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
sorted). Suppose I have to search 168. Since the values are sorted I can use
the binary search technique to search a specific numeric value. But for that
I will have to read all the numeric values sequentially into some array in
memory. The other way is sequential search i.e to read line by line and
fetch the numeric value before [space] and compare it with required one.

What is the best possible way to perform searching in the above mentioned
file format.

Thanks in anticipation,

Regards,

Ahmad Jalil Qarshi

Mar 14 '08 #1

Subscribe Reply

2147

Peter Duniho

On Fri, 14 Mar 2008 10:35:49 -0700, Ahmad Jalil Qarshi
<ah*********@SPAMhotmail.comwrote:

[...]
Now what I want to is to find a specific numeric value at the beginning
of
the line. For example in the above few lines the numeric values are
52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
sorted). Suppose I have to search 168. Since the values are sorted I can
use
the binary search technique to search a specific numeric value. But for
that
I will have to read all the numeric values sequentially into some array
in
memory. The other way is sequential search i.e to read line by line and
fetch the numeric value before [space] and compare it with required one.

What is the best possible way to perform searching in the above mentioned
file format.

How many times do you need to perform this search?

If it's just once, then I think you could take advantage of the fact that
the lines are all _almost_ the same length and do a binary search on the
file itself. You'll have to make a guess, then search for the
previous/next end-of-line, then check the numeric value at the beginning
of that line. That's not as efficient as it could be if you had identical
line lengths, but compared to scanning through a 2GB file it should still
be plenty fast.

If you're going to perform the search on a regular basis, I'd recommend
creating an index for the file. Scan it once, storing just the initial
numeric value and a byte offset within the file representing where that
record's line starts. Then later you can do your binary search on the
index instead of the file. This is, I think, what you're suggesting when
you wrote "read all the numeric values sequentially into some array"; you
didn't mention also storing the file offsets for each record, but
hopefully you realize that's implied since otherwise keeping the values in
an array wouldn't be useful.

Of course, keep in mind that this is the sort of thing that databases are
really good at. If you have a way to redirect the data into a database
rather than keeping it in this huge text file, I think that would be the
best way to handle it.

Pete

Mar 14 '08 #2

Ahmad Jalil Qarshi

Thanks Peter & Jon,

I have to read the file on user request. Now the problem is that I can't
create index as the format of file is fixed. So I think the only option left
is binary search.

Let me explain one more thing which I missed in my previous post. The
numeric value before the [space] can be repeated thousands of time
consequtively. Like.
........ ..............................
2073 SANZP311800Q729T070534N17CEP
2345 SBHPP3200.00Q16000T070534N18CEPXT
2345 SBHPP3200.00Q1000T070534N19CEP
2345 SBHPP3200.00Q5000070534N20CEPTP
........ ..............................
........ ..............................
2345 SLHGP375.00Q4000T070534N26CEP
2567 SBHPP3200.0Q000T070534N23CEP
......... ..............................

Now if I have to find the first occurance of a numeric value 2345. In that
case binary search will help or not?

Regards,

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...

On Fri, 14 Mar 2008 10:35:49 -0700, Ahmad Jalil Qarshi
<ah*********@SPAMhotmail.comwrote:

>[...]
Now what I want to is to find a specific numeric value at the beginning
of
the line. For example in the above few lines the numeric values are
52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
sorted). Suppose I have to search 168. Since the values are sorted I can
use
the binary search technique to search a specific numeric value. But for
that
I will have to read all the numeric values sequentially into some array
in
memory. The other way is sequential search i.e to read line by line and
fetch the numeric value before [space] and compare it with required one.

What is the best possible way to perform searching in the above mentioned
file format.

How many times do you need to perform this search?

If it's just once, then I think you could take advantage of the fact that
the lines are all _almost_ the same length and do a binary search on the
file itself. You'll have to make a guess, then search for the
previous/next end-of-line, then check the numeric value at the beginning
of that line. That's not as efficient as it could be if you had identical
line lengths, but compared to scanning through a 2GB file it should still
be plenty fast.

If you're going to perform the search on a regular basis, I'd recommend
creating an index for the file. Scan it once, storing just the initial
numeric value and a byte offset within the file representing where that
record's line starts. Then later you can do your binary search on the
index instead of the file. This is, I think, what you're suggesting when
you wrote "read all the numeric values sequentially into some array"; you
didn't mention also storing the file offsets for each record, but
hopefully you realize that's implied since otherwise keeping the values in
an array wouldn't be useful.

Of course, keep in mind that this is the sort of thing that databases are
really good at. If you have a way to redirect the data into a database
rather than keeping it in this huge text file, I think that would be the
best way to handle it.

Pete

Mar 15 '08 #3

Peter Duniho

On Sat, 15 Mar 2008 02:36:29 -0700, Ahmad Jalil Qarshi
<ah*********@SPAMhotmail.comwrote:

Thanks Peter & Jon,

I have to read the file on user request. Now the problem is that I can't
create index as the format of file is fixed. So I think the only option
left
is binary search.

Well, a linear scan is always an option. Whether that performs well
enough to justify avoiding the work of implementing a binary search,
that's up to you or whoever's making the decision about how your time is
best used.

If the text file uses an encoding that has fixed-sized characters, I think
that a binary search isn't actually going to be very hard to implement.
But as Jon points out, if you have to support multiple encodings and/or
encodings with variable-length characters (e.g. UTF-8), then it gets more
complicated (a little...I'm no character-encoding expert, but I'm under
the impression that in any of the encodings, finding an end-of-line marker
while scanning backwards would not be difficult; a LF or CR/LF pair isn't
a valid sequence in any other character's data, is it?).

In either case, it's certainly possible. The question is what's the best
use of your time.

I still wonder at the requirement that this has to be done "on demand",
without an opportunity to index the file. That's an awful lot of data to
be search just once. But if "the powers that be" deem it more sensible to
make you implement a binary search on a variable-line-length text file
than to fix the problem space to better incorporate an indexing- or
database-based solution, I guess that's what you're stuck with.

I have to admit: if these happen infrequently enough that indexing a file
doesn't make sense, I wonder if they don't also happen infrequently enough
that just scanning linearly through the file wouldn't be okay. Why does
this need to be fast?

Let me explain one more thing which I missed in my previous post. The
numeric value before the [space] can be repeated thousands of time
consequtively. [...]

Now if I have to find the first occurance of a numeric value 2345. In
that
case binary search will help or not?

It will help just as much as it would in any other situation where a
binary search can be used. Binary search doesn't imply unique elements,
but if you have duplicated elements that does mean that you have to do
some extra work to scan and find the first matching element, since the
first element the search finds may not be the actual first element in the
sequence.

Pete

Mar 15 '08 #4

by: John | last post by:

Hi everyone ! This is a first time I post a message here. If I post my message in a wrong group. Please ignore it. I am trying to build a website which allows users (can be thousands of...

PHP

Sending and reading big data by socket

by: cyril | last post by:

Hi there, I wonder how to send and receive big data (by chunk) with an unknown size in an opened socket in C. For example, my server can send to my client a big binary file. My client didn't...

C / C++

Convert color image to B&W

by: Scott Schluer | last post by:

Is there a way to use the Image class to convert a color photo (GIF or JPEG) to a B&W photo? Thanks, Scott

ASP.NET

huge XML files, XSLT memory problems, Java & SAX...

by: Jeff Calico | last post by:

I have 2 XML data files that I want to extract data from simultaneously and transform with XSLT to generate a report. The first file is huge and when XSLT builds the DOM tree in memory, it runs...

.NET Framework

Convert MS Access DB (FE&BE) to Oracle

by: egoldthwait | last post by:

I need to convert a 17mb access 2000 db to Oracle and house it in a Citrix farm. The issue: we have never converted an Access Db to Oracle but can probably use Oracle's Workbench to assist with...

Microsoft Access / VBA

reading specific lines of a file

by: Yi Xing | last post by:

Hi All, I want to read specific lines of a huge txt file (I know the line #). Each line might have different sizes. Is there a convenient and fast way of doing this in Python? Thanks. Yi Xing

Python

Announcing 64-Bit Beta of dtSearch® Engine for Win & .NET; Initial Office 2007 Support

by: dtsearch | last post by:

A new beta build offers 64-bit developer access to dtSearch's "terabyte indexer," and preliminary MS Word 2007 and Excel 2007 support (in both 64-bit and 32-bit versions) BETHESDA, MD (July 22,...

.NET Framework

Network Communication & File Locking

by: Davidhere40 | last post by:

I'm trying to lock a text file, read its contents, make the text file empty, then unlock it. I haven't been successful at it yet. The following code is what I've tried, with no success, because...

C# / C Sharp

Reading from stdin

by: Luis Zarrabeitia | last post by:

I have a problem with this piece of code: ==== import sys for line in sys.stdin: print "You said!", line ==== Namely, it seems that the stdin buffers the input, so there is no reply until ...

Python

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

Reading & Searching a Huge file

Similar topics