473,216 Members | 1,302 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,216 software developers and data experts.

reading large text files in reverse - optimization doubts

Hi folks,
Suppose I have a large (1 GB) text file which I want to read in
reverse. The number of characters I want to read at a time is
insignificant. I'm confused as to how best to do it. Upon browsing
through this group and other sources on the web, it seems that there
are many ways to do it.
Some suggest that simply fseek'ing to 8K bytes before the end of file,
and going backwards is the way. In this case, am I guaranteed best
results if I fseek 8192 bytes behind my last position simply because
BUFSIZ = 8192 ? In other words is 8192 an optimal number ?
Others suggest that mmap'ing some 20-30K at a time into RAM is the way
to go... (but why 20-30K) ?

Another twist in the tale is that my program must be as portable as
possible. So I'm wondering whether mmap'ing would be viable at all -
but if it works fine on *X and Win (does Win even have mmap or an
equivalent ?), then it would be okay.

Any ideas on how to optimize for speed? I'll head off to do some
profiling on Linux and WinXP.. but it needs to be as portable+fast on
as many OS's as possible - so I need you guys' expertise :)

Thanks in advance,
Raj

Nov 15 '05 #1
6 6269
Rajorshi Biswas wrote:
Hi folks,
Suppose I have a large (1 GB) text file which I want to read in
reverse. The number of characters I want to read at a time is
insignificant. I'm confused as to how best to do it. Upon browsing
through this group and other sources on the web, it seems that there
are many ways to do it.
This is not, strictly speaking, a C question. There's nothing in the
language to answer a question like this.
Some suggest that simply fseek'ing to 8K bytes before the end of file,
and going backwards is the way. In this case, am I guaranteed best
results if I fseek 8192 bytes behind my last position simply because
BUFSIZ = 8192 ? In other words is 8192 an optimal number ?
It might be. Or not. If your system page size happens to be 8K, it might
be a good size to read.
Others suggest that mmap'ing some 20-30K at a time into RAM is the way
to go... (but why 20-30K) ?
Just because. I sincerely doubt anyone has calculated that these numbers
are optimal. Clearly you'll want a buffer "as large as possible", but
not so large that you're wasting I/O time reading data you're not
touching for a long time. Hence a few system pages worth of data seems
reasonable.
Another twist in the tale is that my program must be as portable as
possible. So I'm wondering whether mmap'ing would be viable at all -
but if it works fine on *X and Win (does Win even have mmap or an
equivalent ?), then it would be okay.
Yes, Win32 has memory-mapped files. To make matters easier, however,
there is of course no function called "mmap". I recommend using MinGW
(Cygwin's too slow), which will allow you to pretend you're using a
POSIX system, more or less. Otherwise you'll have to write your own
wrappers around CreateFileMapping() and MapViewOfFile[Ex]().
Any ideas on how to optimize for speed? I'll head off to do some
profiling on Linux and WinXP.. but it needs to be as portable+fast on
as many OS's as possible - so I need you guys' expertise :)

Reading a file in reverse will be one of the most inefficient operations
on any OS, I'd wager. If you use the standard I/O library, you're at the
mercy of both the platform and the C library used -- they can make a lot
of difference individually and in tandem. The "best" buffer size,
whether fread()ing or mmap()ing, will depend on the OS, the library, the
file, the time needed to process the data read, the system load...
Profiling is indeed the way to go here. It's too easy to predict
"reasonable" outcomes that will turn out to be entirely wrong in practice.

Clean code that uses either fseek()/fread() or mmap() should run
"reasonably well" on most platforms. The only thing that might trip you
up is an implementation that hits pathological performance when reading
in reverse, but I'd expect those to be rare.

S.
Nov 15 '05 #2
"Rajorshi Biswas" <ra*************@gmail.com> wrote:
# Hi folks,
# Suppose I have a large (1 GB) text file which I want to read in
# reverse. The number of characters I want to read at a time is
# insignificant. I'm confused as to how best to do it. Upon browsing
# through this group and other sources on the web, it seems that there
# are many ways to do it.
# Some suggest that simply fseek'ing to 8K bytes before the end of file,
# and going backwards is the way. In this case, am I guaranteed best
# results if I fseek 8192 bytes behind my last position simply because
# BUFSIZ = 8192 ? In other words is 8192 an optimal number ?

stdio implementations often don't like fseek and tend to discard
the entire buffer and refill after a fseek. Theoretically you
can seek to the end, fseek back one character and read it, fseek
back two characters and read the second last, etc, all the way
to the beginning. However it's likely stdio will be reading in
as much of BUFSIZ as possible with each get character, so you
work around buffering instead of with it.

# Others suggest that mmap'ing some 20-30K at a time into RAM is the way
# to go... (but why 20-30K) ?

Not all virtual memories can map in a gigabyte.

# Any ideas on how to optimize for speed? I'll head off to do some
# profiling on Linux and WinXP.. but it needs to be as portable+fast on
# as many OS's as possible - so I need you guys' expertise :)

You can turn off bufferring with stdio, and then do your own
page frame management. Same thing as mmap, but slower and a
greater nuisance.

You can read in blocks front to back, reverse the block in
memory, and then write the blocks to a secondary file back
to front. You then have a transpose of the original file
that you can read in the forward direction and make the
kernel and stdio happier.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Wow. A sailboat.
Nov 15 '05 #3
On 24 Sep 2005 06:07:46 -0700, in comp.lang.c , "Rajorshi Biswas"
<ra*************@gmail.com> wrote:
Hi folks,
Suppose I have a large (1 GB) text file which I want to read in
reverse.
Actually, I'd recommend you review why on earth you have text files
that large. Logfiles perhaps? Some redesign of the creating app is in
order, to break its logs into sensible sizes.
BUFSIZ = 8192 ? In other words is 8192 an optimal number ?
Yes and no. It depends completely on your h/w. 8K may be a memory
page, or a disk sector size or something, in which case maybe its
optimal. .
Others suggest that mmap'ing some 20-30K at a time into RAM is the way
to go... (but why 20-30K) ?
Same answer.
Another twist in the tale is that my program must be as portable as
possible. So I'm wondering whether mmap'ing would be viable at all -
but if it works fine on *X and Win (does Win even have mmap or an
equivalent ?), then it would be okay
mmap is be a nonstandard function so theres no knowing. If you can
find a compiler portable across your required h/w and osen, you may
also have a portable version of this function
Any ideas on how to optimize for speed?


Read up on the three rules of optimization.
--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Nov 15 '05 #4
In article <43***********************@news.xs4all.nl>
Skarmander <in*****@dontmailme.com> wrote:
Reading a file in reverse will be one of the most inefficient operations
on any OS, I'd wager.


That would depend on the OS and file system. A file system
implemented via a B-tree with sideways links (B+ or "B-link" or
any of those variations) could be read backwards quite quickly.
A file system designed for use on a DVR/PVR might also use a doubly
linked list, something like the old DOS FAT file system (i.e.
cluster or extent based) except with links going both directions.

The stdio I wrote for BSD (found in 4.4BSD derivatives) also
attempts to optimize fseek() calls on files opened in read-only
mode, so that the sequence:

/* ensure that file is open and contains at least 1 byte */

fseek(fp, -1L, SEEK_END);
for (;;) {
c = getc(fp);
/* do something with c, e.g., putc(c) */
if (ftell(fp) == 0)
break;
fseek(fp, -2L, SEEK_CUR);
}

behaves in a reasonably fast manner, reading one underlying file
system block at a time and otherwise working entirely within the
stdio buffer.

Reading the blocks backwards is not particularly hard on a Unix-like
file system (with inodes containing direct and indirect blocks),
although it does not play well with the OS's read-ahead mechanism.
It *is* particularly hard on the DOS FAT file system, which --
through a design aimed at saving space on 360 kilobyte floppies --
makes it impossible to find block N of any given file without first
finding blocks zero through N-1 inclusive. Modern file systems
(NTFS, ext2, ext3, Reiser, etc) are as good as Version 7 Unix
though, having finally caught up to the 1970s. :-) Some are even
worth of the 1980s, handling files larger than four gigabytes! :-)

(To explain the amusement here, I should add that UFS -- the 4.2BSD
file system that came out in 1983 or 1984 or so -- did all this,
way back then. The newer Linux file systems do have other merits,
but it does seem odd that only recently have other desktop OSes
acquired capabilities we had back in the mid-1980s.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Nov 15 '05 #5
Hmm...
Thanks for the inputs guys. Well I tried an implementation with
fseek/fread with a read chunk size of 8k, also with 16k.. both are
reasonably fast.. at least on Linux. One related doubt I had was
whether fseek and mmap have any advantage over fseek and fread. For the
sake of portability, I discarded the lseek/mmap route, but was just
curious whether mmap would provide any better results.

<ducking for cover>
.... so - profiling is the only way to find out, eh ?
</ducking for cover>

Nov 15 '05 #6
Chris Torek wrote:
In article <43***********************@news.xs4all.nl>
Skarmander <in*****@dontmailme.com> wrote:
Reading a file in reverse will be one of the most inefficient operations
on any OS, I'd wager.

That would depend on the OS and file system. A file system
implemented via a B-tree with sideways links (B+ or "B-link" or
any of those variations) could be read backwards quite quickly.
A file system designed for use on a DVR/PVR might also use a doubly
linked list, something like the old DOS FAT file system (i.e.
cluster or extent based) except with links going both directions.


In the days of 9 track tape, maybe even 7 track tape, it was not
uncommon for tape drives to have the ability to read a tape
backwards. The data was loaded into memory from high address down
to low address, such that the block came out normally in memory.

This ability has been lost in many of the currently popular
drives, especially in helical scan drives.

It was mostly useful in the case of external sorts on tape,
where it saved a rewind between writing the tape and reading
the data back for the next pass.

-- glen

Nov 15 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Jay | last post by:
I have a very large text file (being read by a CGI script on a web server), and I get memory errors when I try to read the whole file into a list of strings. The problem is, I want to read the file...
20
by: sahukar praveen | last post by:
Hello, I have a question. I try to print a ascii file in reverse order( bottom-top). Here is the logic. 1. Go to the botton of the file fseek(). move one character back to avoid the EOF. 2....
50
by: Michael Mair | last post by:
Cheerio, I would appreciate opinions on the following: Given the task to read a _complete_ text file into a string: What is the "best" way to do it? Handling the buffer is not the problem...
4
by: Matthew Crema | last post by:
Hello, Say I have 1000 text files and each is a list of 32768 integers. I have written a C program to read this data into a large matrix. I am using fopen in combination with fscanf to read...
1
by: Hutty | last post by:
I have a program that open text files and compares them, however, when reading files larger than 500kb the programs seems to bomb. I get re-directed to "page not found". Any idea how to get...
2
by: gauravkhanna | last post by:
Hi All I need some help for the below problem: Scenario We need to send large binary files (audio file of about 10 MB or so) from the client machine (.Net Windows based application, located...
10
by: zahy[dot]bnaya[At]gmail[dot]com | last post by:
Hi, I am trying to come up with a c style string reverser, I want it to take 1 argument Altough I would never do this in real life. Is there a way to do it? I wrote this function that fails : ...
9
by: Bill Woessner | last post by:
Suppose I have a structure, foo, which is a POD. I would like to read and write it to disk as follows: std::ofstream outs; foo bar; outs.write(reinterpret_cast<char*>(&bar), sizeof(foo));...
1
by: akalmand | last post by:
Hi there, I am writing a code to read some data from the text files. The number of text files is not fixed and could be more that 15. the length of each file is large... close to 100,000 on an...
1
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
0
by: veera ravala | last post by:
ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: mar23 | last post by:
Here's the situation. I have a form called frmDiceInventory with subform called subfrmDice. The subform's control source is linked to a query called qryDiceInventory. I've been trying to pick up the...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.