473,585 Members | 2,555 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

parsing an ifstream to get some specific text

Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

.... DATAS
.....
..START
....
.....
..START
....
......
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream , currentLine);
currentLine = utils::trim(cur rentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back( _stream.tellg() );
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir

Jan 8 '07 #1
3 4371

toton napsal:
Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

... DATAS
....
.START
...
....
.START
...
.....
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream , currentLine);
currentLine = utils::trim(cur rentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back( _stream.tellg() );
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir
Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]

- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
- You can use system dependent solution - memory mapped file

Jan 8 '07 #2

Ondra Holub wrote:
toton napsal:
Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.

The file is something like,

... DATAS
....
.START
...
....
.START
...
.....
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.

For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream , currentLine);
currentLine = utils::trim(cur rentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back( _stream.tellg() );
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .

abir

Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
_pos is std::vector<pos _type again, pos_type is usually int. so _pos
can also be treated as std::vector<int >.
I am using a pseudo 2 pass parsing. The first pass I only marking the
location (in bytes as returned by tellg() ) for .START . The second
pass is only needed when someone want's to parse data between two
..START. so I can quickly go to the marked location using seekg() .
Usually with xml type of file I can quickly jump to a particular
element without going to the detail of other elements. Here the format
is somewhat different, so I am making a positional reference (in bytes
) for those sections marked by .START, and storing them for later
parsing.
Here IO operations are done 2 times, but loading a 20 MB file is even
slower. And the second IO operation may not be done for whole file, say
for eg I may parse only one such section out of 20 sections marked by
..START
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]
My primary system is Windows :(
I have some estimate how much buffer I may need to get a next .START in
terms of bytes. Can it be set anyway for the stream, or is it totally
implementation dependent/ OS dependent ?
- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
This is what I want in automated way. ie instead of loading a fixed no
of bytes in the buffer, let the stream load the bytes under the hood.
as you mentioned , it may be doing that already. Only I want to control
the size.
- You can use system dependent solution - memory mapped file
Don't know any C++ library for it. Boost is also not providing any mmap
file .

Jan 8 '07 #3
toton napsal:
Ondra Holub wrote:
toton napsal:
Hi,
I have some ascii files, which are having some formatted text. I want
to read some section only from the total file.
For that what I am doing is indexing the sections (denoted by .START
in the file) with the location.
And for a particular section I parse only that section.
>
The file is something like,
>
... DATAS
....
.START
...
....
.START
...
.....
etc.
I need to parse datas between two .START when only that section is
needed. I don't load all of the data's in the memory at a time, as the
file is big, 4MB~20MB in size.
To mark all of the .START I parse it once, just to check .START and
mark that position, and when actually the detailed data is needed seek
to that marked position and do parsing.
>
For quick parsing, I do
while(_stream) {
std::string currentLine;
getline(_stream , currentLine);
currentLine = utils::trim(cur rentLine);///this removes whitespace
from front & back.
if (currentLine == ".START"){
_pos.push_back( _stream.tellg() );
}
}
But this code runs slower than I expect. Anything better can be done
here ? like some buffering in the stream etc? .
>
abir
Buffering is made already in input stream. Also your operating system
probably buffers files, so it should not be problem.

I have some ideas which could help:
- You should parse the file in 1 pass. It is faster than 2 pass parsing
and you can get data also from standard input or pipes.
- Where do you store positions (what is the type of _pos)? It should be
list, queue or stack, not vector
_pos is std::vector<pos _type again, pos_type is usually int. so _pos
can also be treated as std::vector<int >.
Yes, vector can be used from the functional point of view, but it may
be less effective for this kind of use, because vector has some
preallocated amount of memory and when it is exceeded, it must
reallocate it and it may lead to copying of items from old area to new
one. List does not need it. That's why I suggested not to use vector.
I am using a pseudo 2 pass parsing. The first pass I only marking the
location (in bytes as returned by tellg() ) for .START . The second
pass is only needed when someone want's to parse data between two
.START. so I can quickly go to the marked location using seekg() .
Usually with xml type of file I can quickly jump to a particular
element without going to the detail of other elements.
It is simillar as the parsing of XML. XML is usualy parsed either with
DOM like parser or with SAX parser.

DOM (typically) loads whole document into memory and then works with
it. Then you can simply access any element, but data are stored in
memory. It is simpler for working with, but less effective for large
documents.

SAX (typically) reads document and during reading calls some methods,
which process the currently read data. It is not as simple for use as
DOM, but it is better and more effective for large documents.
Here the format
is somewhat different, so I am making a positional reference (in bytes
) for those sections marked by .START, and storing them for later
parsing.
Here IO operations are done 2 times, but loading a 20 MB file is even
slower. And the second IO operation may not be done for whole file, say
for eg I may parse only one such section out of 20 sections marked by
.START
- You can treat input as binary file (no difference from text file on
many systems, but for example on Windows it is different), use method
read for reading to some buffer and search ".START" on your own. [ In
fact I do not believe it will make big difference.]
My primary system is Windows :(
I have some estimate how much buffer I may need to get a next .START in
terms of bytes. Can it be set anyway for the stream, or is it totally
implementation dependent/ OS dependent ?
You could deal with filebuf (implement your own inherited class from
streambuf), but I do not think it would be usefull (too much effort and
no big effect).

If you do not use C files (FILE* from stdio.h or cstdio), you should
disable synchronization of C++ iostreams with FILE* with method
sync_with_stdio of iostream. If you do it, you get the responsibility,
that nobody uses FILE* for your files (even no library).
- Although any assumption like "something will probably not exceed xyz
MB of memory" is wrong, you can place data in memory and process it
there (20MB is not so big amount if you are not working on embedded
system)
This is what I want in automated way. ie instead of loading a fixed no
of bytes in the buffer, let the stream load the bytes under the hood.
as you mentioned , it may be doing that already. Only I want to control
the size.
- You can use system dependent solution - memory mapped file
Don't know any C++ library for it. Boost is also not providing any mmap
file .
There is no such standard C++ library, you have to use API of your OS
or some library, which supports many platforms and wraps platform
dependent code in it's functions (for example ACE).

Jan 8 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
13764
by: Dave Johnston | last post by:
Hi, I'm currently trying to create a wrapper that uses C functions but behaves like ifstream (from fstream.h) - this is because the platform I'm using (WinCE) doesn't support streams and this is the easiest way to take a huge project across onto it. Basically, I've hit a problem. I have no idea how the ifstream class handles directories....
16
2875
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed loaded into cache, the slideshow doesn't look very nice. I am not sure how/when to call the slideshow() function to make sure it starts after...
7
3228
by: Anton Ishmurzin | last post by:
Greetings All, I think everybodyknows the answer already. But i am quite a newbie in c++. I've got the following line in my code: ifstream ini_file_in("filename.dat", ios::in); But, the gcc (GCC) 3.2 spits out some stuff about
3
1782
by: Eric Lilja | last post by:
Hello, I'm creating a small utility for an online game. It involves parsing a text file of "tradesskill recipes" and inserting these recipes in a gui tree widget (similar to gui file browsers if you know what I mean). Here's an example of a recipe as it appears in the text file: * Cashew Pie (lvl 39, 5h 3 min, + max power) - Candied Cashew...
2
6923
by: Sean Bartholomew | last post by:
i am trying to parse.....: ifstream newFans ("/Volumes/iBook Apps/Users/me/Library/Mail/POP-blaha@blah.blah.com/INBOX.mbox/mbox", ios::in|ios::binary|ios::ate); so that i could create a tab delimited text file with the fields filled out from an online form that is sent to my email address. i used the find all command in BBEdit and found...
5
1915
by: msammart | last post by:
Hey, i have a payroll system and i'm tyring to have it so the user can select an option from the menu and then be able to change one of the employee's salaries based on the user ID input. ( data is read in from a text file.) If (choice == 2) { cout << "Please Enter the Employee's ID Number (ex. Test0001 <case sensitive>): " << endl; cin...
3
1654
by: Tomasz Bednarz | last post by:
Can someone help me to parse double precision numbers from text file? I have a sample text file which is as follows: VARIABLES = x, y ZONE I=11, J=11 1.10000000000 0.10000000000 0.10000000000 0.20000000000 0.20000000000 0.30000000000 0.30000000000 0.40000000000
6
4736
by: Gary Wessle | last post by:
hi I have a code, the part which is troubling goes like this **************************************************************** #include <istream> #include <ostream> #include <fstream>
0
1346
by: James Kanze | last post by:
On 11 avr, 17:44, "mc" <mc_r...@yahoo.comwrote: OK. If the actual format is well documented, that's half the battle won already. Note, however, that reading a float as an int is still very implementation dependent, since the actual internal format of a float varies between machines. The documentation should specify the format of the...
0
7835
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8334
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7947
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8209
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6596
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5386
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3831
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3856
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2340
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.