473,405 Members | 2,141 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

parsing an ifstream to get some specific text

Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

.... DATAS
.....
..START
....
.....
..START
....
......
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream, currentLine);
currentLine = utils::trim(currentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back(_stream.tellg());
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir

Jan 8 '07 #1
3 4345

toton napsal:
Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

... DATAS
....
.START
...
....
.START
...
.....
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream, currentLine);
currentLine = utils::trim(currentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back(_stream.tellg());
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir
Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]

- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
- You can use system dependent solution - memory mapped file

Jan 8 '07 #2

Ondra Holub wrote:
toton napsal:
Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

... DATAS
....
.START
...
....
.START
...
.....
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream, currentLine);
currentLine = utils::trim(currentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back(_stream.tellg());
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir

Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
_pos is std::vector<pos_type again, pos_type is usually int. so _pos
can also be treated as std::vector<int>.
I am using a pseudo 2 pass parsing. The first pass I only marking the
location (in bytes as returned by tellg() ) for .START . The second
pass is only needed when someone want's to parse data between two
..START. so I can quickly go to the marked location using seekg() .
Usually with xml type of file I can quickly jump to a particular
element without going to the detail of other elements. Here the format
is somewhat different, so I am making a positional reference (in bytes
) for those sections marked by .START, and storing them for later
parsing.
Here IO operations are done 2 times, but loading a 20 MB file is even
slower. And the second IO operation may not be done for whole file, say
for eg I may parse only one such section out of 20 sections marked by
..START
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]
My primary system is Windows :(
I have some estimate how much buffer I may need to get a next .START in
terms of bytes. Can it be set anyway for the stream, or is it totally
implementation dependent/ OS dependent ?
- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
This is what I want in automated way. ie instead of loading a fixed no
of bytes in the buffer, let the stream load the bytes under the hood.
as you mentioned , it may be doing that already. Only I want to control
the size.
- You can use system dependent solution - memory mapped file
Don't know any C++ library for it. Boost is also not providing any mmap
file .

Jan 8 '07 #3
toton napsal:
Ondra Holub wrote:
toton napsal:
Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.
>
The file is something like,
>
... DATAS
....
.START
...
....
.START
...
.....
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.
>
For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream, currentLine);
currentLine = utils::trim(currentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back(_stream.tellg());
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .
>
abir
Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
_pos is std::vector<pos_type again, pos_type is usually int. so _pos
can also be treated as std::vector<int>.
Yes, vector can be used from the functional point of view, but it may
be less effective for this kind of use, because vector has some
preallocated amount of memory and when it is exceeded, it must
reallocate it and it may lead to copying of items from old area to new
one. List does not need it. That's why I suggested not to use vector.
I am using a pseudo 2 pass parsing. The first pass I only marking the
location (in bytes as returned by tellg() ) for .START . The second
pass is only needed when someone want's to parse data between two
.START. so I can quickly go to the marked location using seekg() .
Usually with xml type of file I can quickly jump to a particular
element without going to the detail of other elements.
It is simillar as the parsing of XML. XML is usualy parsed either with
DOM like parser or with SAX parser.

DOM (typically) loads whole document into memory and then works with
it. Then you can simply access any element, but data are stored in
memory. It is simpler for working with, but less effective for large
documents.

SAX (typically) reads document and during reading calls some methods,
which process the currently read data. It is not as simple for use as
DOM, but it is better and more effective for large documents.
Here the format
is somewhat different, so I am making a positional reference (in bytes
) for those sections marked by .START, and storing them for later
parsing.
Here IO operations are done 2 times, but loading a 20 MB file is even
slower. And the second IO operation may not be done for whole file, say
for eg I may parse only one such section out of 20 sections marked by
.START
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]
My primary system is Windows :(
I have some estimate how much buffer I may need to get a next .START in
terms of bytes. Can it be set anyway for the stream, or is it totally
implementation dependent/ OS dependent ?
You could deal with filebuf (implement your own inherited class from
streambuf), but I do not think it would be usefull (too much effort and
no big effect).

If you do not use C files (FILE* from stdio.h or cstdio), you should
disable synchronization of C++ iostreams with FILE* with method
sync_with_stdio of iostream. If you do it, you get the responsibility,
that nobody uses FILE* for your files (even no library).
- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
This is what I want in automated way. ie instead of loading a fixed no
of bytes in the buffer, let the stream load the bytes under the hood.
as you mentioned , it may be doing that already. Only I want to control
the size.
- You can use system dependent solution - memory mapped file
Don't know any C++ library for it. Boost is also not providing any mmap
file .
There is no such standard C++ library, you have to use API of your OS
or some library, which supports many platforms and wraps platform
dependent code in it's functions (for example ACE).

Jan 8 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Dave Johnston | last post by:
Hi, I'm currently trying to create a wrapper that uses C functions but behaves like ifstream (from fstream.h) - this is because the platform I'm using (WinCE) doesn't support streams and this is...
16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
7
by: Anton Ishmurzin | last post by:
Greetings All, I think everybodyknows the answer already. But i am quite a newbie in c++. I've got the following line in my code: ifstream ini_file_in("filename.dat", ios::in); But, the...
3
by: Eric Lilja | last post by:
Hello, I'm creating a small utility for an online game. It involves parsing a text file of "tradesskill recipes" and inserting these recipes in a gui tree widget (similar to gui file browsers if...
2
by: Sean Bartholomew | last post by:
i am trying to parse.....: ifstream newFans ("/Volumes/iBook Apps/Users/me/Library/Mail/POP-blaha@blah.blah.com/INBOX.mbox/mbox", ios::in|ios::binary|ios::ate); so that i could create a tab...
5
by: msammart | last post by:
Hey, i have a payroll system and i'm tyring to have it so the user can select an option from the menu and then be able to change one of the employee's salaries based on the user ID input. ( data is...
3
by: Tomasz Bednarz | last post by:
Can someone help me to parse double precision numbers from text file? I have a sample text file which is as follows: VARIABLES = x, y ZONE I=11, J=11 1.10000000000 0.10000000000...
6
by: Gary Wessle | last post by:
hi I have a code, the part which is troubling goes like this **************************************************************** #include <istream> #include <ostream> #include <fstream>
0
by: James Kanze | last post by:
On 11 avr, 17:44, "mc" <mc_r...@yahoo.comwrote: OK. If the actual format is well documented, that's half the battle won already. Note, however, that reading a float as an int is still very...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.