Dealing with large text files

=?Utf-8?B?RENX?=

Hello all:

I have a situation where I need to read a text file containing several
million rows (insurance eligibility files). In additional to sequential
operations, I also need to support a 'seek' on the file. The file itself is
not in a fixed-field format and each line could be different lengths. I
obviously don't want to simply start at the top of the file and read lines
till I hit the requested index.

What other options do I have?

Nov 14 '07 #1

Subscribe Post Reply

2445

Kevin Spencer

You can open a file stream that is seekable, but you haven't specified how
you want to "seek" in the file. How do you know what you're looking for?

--
HTH,

Kevin Spencer
Chicken Salad Surgeon
Microsoft MVP

"DCW" <DC*@discussions.microsoft.comwrote in message
news:D1**********************************@microsof t.com...

Hello all:

I have a situation where I need to read a text file containing several
million rows (insurance eligibility files). In additional to sequential
operations, I also need to support a 'seek' on the file. The file itself
is
not in a fixed-field format and each line could be different lengths. I
obviously don't want to simply start at the top of the file and read lines
till I hit the requested index.

What other options do I have?

Nov 14 '07 #2

=?Utf-8?B?RENX?=

Opps ... sorry.

This is for a library that will have a few overloads of the seek method. A
typical seek might involve the calling context asking for the line at index
position 567778. I've done similar things like this in the past but I've had
the luxury of fixed-size file formats where I can determine the length of
each line and seek using the position of the file pointer ((NUM_LINES *
LENGTH_PER_LINE) * INDEX). Of course, this concept works with the fixed
length lines but will be a problem when the file fields are 'delimited' and
thus variable length.

I am aware of seekable file streams but determining how to position the file
pointer is the biggest issue imo. There is a little more to it than I've
described but this is the main issue.

Prior to performing the seek operation I do have some info about the file.
Namely, I do a pre-parse that determines the count of lines (although not the
length of each line of course :)).

I'm thinking a possible solution might be to create a collection during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position requested,
pull the starting position from the dictionary and seek till I hit the next
dictionary item's starting position.

Any thoughts?

"Kevin Spencer" wrote:

You can open a file stream that is seekable, but you haven't specified how
you want to "seek" in the file. How do you know what you're looking for?

--
HTH,

Kevin Spencer
Chicken Salad Surgeon
Microsoft MVP

Nov 14 '07 #3

Bill Butler

"DCW" <DC*@discussions.microsoft.comwrote in message
news:4B**********************************@microsof t.com...

Opps ... sorry.

This is for a library that will have a few overloads of the seek
method. A
typical seek might involve the calling context asking for the line at
index
position 567778. I've done similar things like this in the past but
I've had
the luxury of fixed-size file formats where I can determine the length
of
each line and seek using the position of the file pointer ((NUM_LINES
*
LENGTH_PER_LINE) * INDEX). Of course, this concept works with the
fixed
length lines but will be a problem when the file fields are
'delimited' and
thus variable length.

I am aware of seekable file streams but determining how to position
the file
pointer is the biggest issue imo. There is a little more to it than
I've
described but this is the main issue.

Prior to performing the seek operation I do have some info about the
file.
Namely, I do a pre-parse that determines the count of lines (although
not the
length of each line of course :)).

I'm thinking a possible solution might be to create a collection
during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position
requested,
pull the starting position from the dictionary and seek till I hit the
next
dictionary item's starting position.

Basically you need to create your own index into the file.
foreach line in the file record the (index,offset)

Nov 14 '07 #4

Chris Shepherd

DCW wrote:
[...]

I'm thinking a possible solution might be to create a collection during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position requested,
pull the starting position from the dictionary and seek till I hit the next
dictionary item's starting position.

Any thoughts?

The solution you suggest here strikes me as being the simplest and
probably the most efficient way of solving your problem.

Chris.

Nov 14 '07 #5

=?Utf-8?B?RENX?=

Thanks guys, I was really looking for confirmation but if someone had a novel
way I'd never thought of, that would have been too. Either way, I do
appreciate the responses.

D
"Chris Shepherd" wrote:

DCW wrote:
[...]
I'm thinking a possible solution might be to create a collection during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position requested,
pull the starting position from the dictionary and seek till I hit the next
dictionary item's starting position.

Any thoughts?

The solution you suggest here strikes me as being the simplest and
probably the most efficient way of solving your problem.

Chris.

Nov 14 '07 #6

Chris Mullins [MVP - C#]

I would import the thing into SQL Server (SQL Compact, SQL Express,
whatever..) and do you operations on that. When you're done, just drop the
database.

You could import the data through code, or via an SSIS package.

It's going to (obviously) depend on your uses cases. Importing the file will
take a bit, so it's going to depend how many search operations you're going
to be doing, versus the time to import the file into SQL.

--
Chris Mullins

"DCW" <DC*@discussions.microsoft.comwrote in message
news:D1**********************************@microsof t.com...

Hello all:

I have a situation where I need to read a text file containing several
million rows (insurance eligibility files). In additional to sequential
operations, I also need to support a 'seek' on the file. The file itself
is
not in a fixed-field format and each line could be different lengths. I
obviously don't want to simply start at the top of the file and read lines
till I hit the requested index.

What other options do I have?

Nov 14 '07 #7

Similar topics

HELP: Joining 5 large text files

by: Stuart | last post by:

Hi, Please can anyone help me join 5 large (1.8gb) text files togeather to create 1 very large file. I have some code in PHP but it bombs out at 2gb (seems there is a limit and php needs re...

Perl

Cleaning up LARGE text files before importing into Access

by: mpdsal | last post by:

Hello. I have some very large text files that I need to import into Access. These files are basically SAP system reports that can be up to 100,000 records but they contain a boatload of data I...

Microsoft Access / VBA

reading large text files in reverse - optimization doubts

by: Rajorshi Biswas | last post by:

Hi folks, Suppose I have a large (1 GB) text file which I want to read in reverse. The number of characters I want to read at a time is insignificant. I'm confused as to how best to do it. Upon...

C / C++

Help - parsing large text files

by: ArunPrakash | last post by:

Hi, I have a web application that looks for a particular string in a set of huge files( the files grow into MBs - max i have seen is 30 MB ). ( search using reg expressions ). the string can occur...

C# / C Sharp

Reading large text files

by: Hutty | last post by:

I have a program that open text files and compares them, however, when reading files larger than 500kb the programs seems to bomb. I get re-directed to "page not found". Any idea how to get...

ASP.NET

building an index for large text files for fast access

by: Yi Xing | last post by:

Hi, I need to read specific lines of huge text files. Each time, I know exactly which line(s) I want to read. readlines() or readline() in a loop is just too slow. Since different lines have...

Python

How do you read large text files in Java >20MB

by: N002213F | last post by:

I have a Java text reader program that can easily read small text, but seems fails to read larger files >20MB. It seems to truncate the contents it reads. I there a way of making sure all the data...

Java

Problem with ifilters, regex and large text files

by: Mobious | last post by:

Hello, I am currently designing a console app that will run and search our network for any files that contain 16 digit numbers. I'm having to utilise iFilters to properly index each file, which...

C# / C Sharp

Open very large text files using VBA?

by: James From Canb | last post by:

Is there a way to open very large text files in VBA? I wrote some code in VBA under Excel 2003 to read database extracts, add the field names as the first line, and to convert the fixed length...

Microsoft Access / VBA

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA