473,811 Members | 3,485 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

parsing a file..

I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all
this about ?
Mar 14 '08
31 2525
ri*****@cogsci. ed.ac.uk (Richard Tobin) wrote:
In article <fr**********@r egistered.motza rella.org>,
Richard <de***@gmail.co mwrote:
Reading in large chunks is unrelated to whether it's binary or
ascii.
I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.

I didn't say whether it's in binary is unrelated to *speed*.

I meant: there are two separate issues; whether you read it in large
chunks, and whether it's binary. You can read each of text or binary
in small or large chunks. Each of these choices will separately affect
the speed.
Besides, he _has_ a text file. Yes, it's a lot larger than a binary file
would be, and therefore slower to read. But the fact that the _file_ is
text is not the OP's doing. Reading this file as text or as binary won't
make a large difference. _Writing_ it as a binary file would have; but
that's not something the OP can do.

Richard
Mar 14 '08 #11

"Chris Dollin" <ch**********@h p.comwrote in message
news:fr******** **@news-pa1.hpl.hp.com. ..
Richard wrote:
>ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
>>In article
<4e********** *************** *********@s19g2 000prg.googlegr oups.com>,
broli <Br*****@gmail. comwrote:

I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii.

I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.
Quick test, one file, 2000 lines, each line with two floats (1.12345
and 7.890), about 28Kb total.

One single big-enough fread:

real 0m0.002s
user 0m0.000s
sys 0m0.001s

Repeat fscanf( ... "%lf %lf" ... ) until EOF:

real 0m0.004s
user 0m0.002s
sys 0m0.002s

Yes, in this test it's twice as slow. The data file is probably
cached (it's been read several other times already as I /cough/
My own tests:

(A) 100,000 lines of text, each with 3 doubles (2900000 bytes):

2.1 seconds to read a number at a time, using sscanf() (but I use a wrapper
or two with some extra overhead)

(B) The same data as 300,000 doubles written as binary (2400000 bytes):

0.8 seconds to read a number at a time, using fread() 8 bytes at a time

(C) Same binary data as (B)

0.004 seconds to read as a single block into memory (possibly straight into
the array or whatever datastructure is used). Using fread() on 2400000
bytes.

So about 200-500 times faster in binary mode, when done properly.

--
Bart

Mar 14 '08 #12
ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
In article <fr**********@r egistered.motza rella.org>,
Richard <de***@gmail.co mwrote:
>>Reading in large chunks is unrelated to whether it's binary or
ascii.
>>I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.

I didn't say whether it's in binary is unrelated to *speed*.
I'm not sure that parses :-;
>
I meant: there are two separate issues; whether you read it in large
chunks, and whether it's binary. You can read each of text or binary
in small or large chunks. Each of these choices will separately affect
the speed.
Yes, I agree.
>
-- Richard
Mar 14 '08 #13
Bartc wrote:
) My own tests:
)
) (A) 100,000 lines of text, each with 3 doubles (2900000 bytes):
)
) 2.1 seconds to read a number at a time, using sscanf() (but I use a wrapper
) or two with some extra overhead)
)
) (B) The same data as 300,000 doubles written as binary (2400000 bytes):
)
) 0.8 seconds to read a number at a time, using fread() 8 bytes at a time
)
) (C) Same binary data as (B)
)
) 0.004 seconds to read as a single block into memory (possibly straight into
) the array or whatever datastructure is used). Using fread() on 2400000
) bytes.
)
) So about 200-500 times faster in binary mode, when done properly.

Have you tried reading the text file into memory as a single block
and then using sscanf() to parse it ?
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Mar 14 '08 #14
Chris Dollin <ch**********@h p.comwrites:
Richard wrote:
>ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
>>In article <4e************ *************** *******@s19g200 0prg.googlegrou ps.com>,
broli <Br*****@gmail. comwrote:

I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii.

I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.
>> Perhaps they meant that character-at-a-time reading with
getchar() is slow, which it is on some systems. You can perfectly
well use fread() on text files.

The text file will be larger. There is a need to parse the ascii text
into the destination formats.

It will be slower in the great majority of cases.

Quick test, one file, 2000 lines, each line with two floats (1.12345
and 7.890), about 28Kb total.

One single big-enough fread:

real 0m0.002s
user 0m0.000s
sys 0m0.001s

Repeat fscanf( ... "%lf %lf" ... ) until EOF:

real 0m0.004s
user 0m0.002s
sys 0m0.002s

Yes, in this test it's twice as slow. The data file is probably
cached (it's been read several other times already as I /cough/
debugged my code). It includes program start-up time (I just did
`time ./a.out` to get the numbers) so the actual reading time will
be less.

Myself I wouldn't count that as "LOTS faster" for binary data,
but doubtless there are applications where it is so counted;
I don't think the OPs case is one of them, and it does look as
though he's reading a text file anyway.
Then why not take the static noise out? Make the file a lot bigger and
report back.

But even these results do indicate quite a large % difference .....

And we do not know how often this data sample is written or read. I
could be thousands of times an hour leading to considerable unnecessary
overhead if using ascii over binary.
Mar 14 '08 #15
"Bartc" <bc@freeuk.comw rites:
"Chris Dollin" <ch**********@h p.comwrote in message
news:fr******** **@news-pa1.hpl.hp.com. ..
>Richard wrote:
>>ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:

In article
<4e********* *************** **********@s19g 2000prg.googleg roups.com>,
broli <Br*****@gmail. comwrote:

>I need to parse a file which has about 2000 lines and I'm getting
>told that reading the file in ascii would be a slower way to do it and
>so i need to resort to binary by reading it in large chunks. Can any
>one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii.

I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.
>Quick test, one file, 2000 lines, each line with two floats (1.12345
and 7.890), about 28Kb total.

One single big-enough fread:

real 0m0.002s
user 0m0.000s
sys 0m0.001s

Repeat fscanf( ... "%lf %lf" ... ) until EOF:

real 0m0.004s
user 0m0.002s
sys 0m0.002s

Yes, in this test it's twice as slow. The data file is probably
cached (it's been read several other times already as I /cough/

My own tests:

(A) 100,000 lines of text, each with 3 doubles (2900000 bytes):

2.1 seconds to read a number at a time, using sscanf() (but I use a wrapper
or two with some extra overhead)

(B) The same data as 300,000 doubles written as binary (2400000 bytes):

0.8 seconds to read a number at a time, using fread() 8 bytes at a time

(C) Same binary data as (B)

0.004 seconds to read as a single block into memory (possibly straight into
the array or whatever datastructure is used). Using fread() on 2400000
bytes.

So about 200-500 times faster in binary mode, when done properly.
I'm surprised this is even being contested.
Mar 14 '08 #16

"Willem" <wi****@stack.n lwrote in message
news:sl******** ***********@sna il.stack.nl...
Bartc wrote:
) My own tests:
)
) (A) 100,000 lines of text, each with 3 doubles (2900000 bytes):
)
) 2.1 seconds to read a number at a time, using sscanf() (but I use a
wrapper
) or two with some extra overhead)
)
) (B) The same data as 300,000 doubles written as binary (2400000 bytes):
)
) 0.8 seconds to read a number at a time, using fread() 8 bytes at a time
)
) (C) Same binary data as (B)
)
) 0.004 seconds to read as a single block into memory (possibly straight
into
) the array or whatever datastructure is used). Using fread() on 2400000
) bytes.
)
) So about 200-500 times faster in binary mode, when done properly.

Have you tried reading the text file into memory as a single block
and then using sscanf() to parse it ?
No. I would imagine it would add a second or so to the time.

However, I left out the word 'apparently' when quoting the 200+ speed-up for
the binary block. I'm sure the disk cache has a big effect here, unless my
harddrive has a 600MB/sec transfer rate.

--
Bart
Mar 14 '08 #17
In article <fr**********@r egistered.motza rella.org>,
Richard <de***@gmail.co mwrote:
>I didn't say whether it's in binary is unrelated to *speed*.
>I'm not sure that parses :-;
I didn't say { { whether it's in binary } is unrelated to { *speed* } }.

-- Richard
--
:wq
Mar 14 '08 #18
Richard wrote:
Chris Dollin <ch**********@h p.comwrites:
>Myself I wouldn't count that as "LOTS faster" for binary data,
but doubtless there are applications where it is so counted;
I don't think the OPs case is one of them, and it does look as
though he's reading a text file anyway.

Then why not take the static noise out? Make the file a lot bigger and
report back.
Because 2000 lines was the OPs file size, and for that file size
and context, the difference in timing is unimportant, and because
life being finite, I'd already spent what time I had available.
But even these results do indicate quite a large % difference .....

And we do not know how often this data sample is written or read. I
could be thousands of times an hour leading to considerable unnecessary
overhead if using ascii over binary.
Yes, and it could be once a day. Or a week. And for all we know -- hey,
if you can invent facts, so can I -- his code will be run on machines
with different floating-point formats, making binary transfer a clear
road to the Pit and text transfer more of a Dragons of Bel'kwinith thing.

--
"Creation began." - James Blish, /A Clash of Cymbals/

Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN

Mar 14 '08 #19
Richard wrote:
"Bartc" <bc@freeuk.comw rites:
>So about 200-500 times faster in binary mode, when done properly.

I'm surprised this is even being contested.
It's not being contested; it's being /quantified/, which is part of
deciding whether whatever is the right thing to do.

[You can drive along the M4 at 70mph or at 120mph [1]; the latter is
certainly faster.]

[1] And, Just In Case Someone Suspects A Weasel, at a whole bunch of
other speeds as well, including at times 0; I don't /think/ I've
ever had to go negative, though.

--
"It was the dawn of the third age of mankind." /Babylon 5/

Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN

Mar 14 '08 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3663
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference This is caused by: <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
2
3962
by: Cigdem | last post by:
Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home Canonicalpath-Directory4: \\wkdis3\ROOT\home\bwe\ You selected the file named AAA.XML getXmlAlgorithmDocument(): IOException Not logged in
3
3510
by: Pir8 | last post by:
I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story> <story_id>112233</story_id> <pub_name>Puleen's Publication</pub_name> <pub_code>PP</pub_code> <edition_date>20031201</edition_date>
1
2466
by: Christoph Bisping | last post by:
Hello! Maybe someone is able to give me a little hint on this: I've written a vb.net app which is mainly an interpreter for specialized CAD/CAM files. These files mainly contain simple movement and drawing instructions like "move to's" and "change color's" optionally followed by one or more numeric (int or float) arguments. My problem is that the parsing algorithm I've currently implemented is extremely slow.
4
4868
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr> </table> With an XSLT styles sheet, I can use for-each to grab the values in
3
4390
by: toton | last post by:
Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in the file) with the location. And for a particular section I parse only that section. The file is something like, .... DATAS
9
1992
by: Paulers | last post by:
Hello, I have a log file that contains many multi-line messages. What is the best approach to take for extracting data out of each message and populating object properties to be stored in an ArrayList? I have tried looping through the logfile using regex, if statements and flags to find the start and end of each message but I do not see a good time in this process to create a new instance of my Message object. While messing around with...
13
4530
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple command set consisting of several single letter commands which take no arguments. A few additional single letter commands take arguments:
13
2835
by: charliefortune | last post by:
I am fetching some product feeds with PHP like this $merch = substr($key,1); $feed = file_get_contents($_POST); $fp = fopen("./feeds/feed".$merch.".txt","w+"); fwrite ($fp,$feed); fclose ($fp); and then parsing them with PHP's native parsing functions. This is succesful for most of the feeds, but a couple of them claim to be
2
3620
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on.... whenever I feed the parser with such file I get the error :
0
9605
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10651
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10392
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10136
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7671
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6893
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5555
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5693
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4341
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.