473,799 Members | 3,061 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

parsing a file..

I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all
this about ?
Mar 14 '08 #1
31 2521
broli said:
I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?
Someone's pulling your leg. 2000 lines of text is nothing. Just write the
program so that it's clear, correct, and easy to understand. Then, if and
only if it's too slow (and you should define the "fast enough"/"too slow"
boundary before you start writing the program), it's time to think about
how it might be made faster.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
Mar 14 '08 #2
broli said:

<snip>
But then I
was told that " normally we don't read scientific data in ascii for
accuracy and speed concerns" which made me wonder what was so wrong ?
The statement!
I could parse 2000 lines in hardly any time and there was no problem
with ascii either.
Right. Someone's pulling your leg, or is overly concerned with efficiency
at the expense of development time and clarity. That isn't to say that
efficiency isn't important. But let's just pretend, for the sake of
argument, that you write it /both/ ways, and then you measure. You
discover that the "binary" technique takes 0.025 seconds to process the
2000 data groups, whereas the "text" version takes 0.075 seconds - three
times slower! Surely this is a triumph for binary!

Yeah, right, but who cares? You press ENTER, and then it takes you 0.1
seconds to look up at the screen, and everything's finished, no matter
which one you ran.

Write it clear, simple, and correct. Then worry about speed if and only if
you have to.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
Mar 14 '08 #3
In article <4e************ *************** *******@s19g200 0prg.googlegrou ps.com>,
broli <Br*****@gmail. comwrote:
>I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?
Reading in large chunks is unrelated to whether it's binary or
ascii. Perhaps they meant that character-at-a-time reading with
getchar() is slow, which it is on some systems. You can perfectly
well use fread() on text files.

-- Richard

--
:wq
Mar 14 '08 #4
Chris Dollin said:
Richard Heathfield wrote:
<snip>
>>
Someone's pulling your leg. 2000 lines of text is nothing. Just write
the program so that it's clear, correct, and easy to understand. Then,
if and only if it's too slow (and you should define the "fast
enough"/"too slow" boundary before you start writing the program), it's
time to think about how it might be made faster.

I agree that speed is unlikely to be a factor -- but accuracy may be.
Possibly, but that comes under correctness, not performance.

<snip>
After all, if they want to read those 2000 lines 1000 times per second
...
....and that is covered by "fast enough/too slow". Again, I would emphasise
that the first priority is to make the program *clear* (because it's
easier to make a clear program correct than to make a correct program
clear). The second priority (and a sine qua non, obviously) is to make the
program *correct*. When and only when it works, it's time to worry about
speed. (This obviously does *not* mean that one should intentionally adopt
gross algorithmic inefficiencies. )

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
Mar 14 '08 #5
Richard HeathField,

There are many modules involved in my software package and this is
just one of them. My software would also involve huge number of
calculations, searching, memory allocation etc etc but the thing is
that I have to parallelize the software code to run on different
machines anyway. Even if speed is an issue, I doubt that reading a
file in ascii or "binary" would make a huge impact overall.
Mar 14 '08 #6
broli said:

<snip>
But when I use fgets() then wouldn't I get a string
of characters (also many tabs, null character etc) ?
Yes.
Wouldn't it be a
difficult task to convert an array of characters into double type
floating numbers again ?
I don't see that you have any choice. If what you've described is correct,
the numbers are already in text form. Converting is easy enough, though,
using strtod.
I think using fread will make it very fast
(considering that it allows you to read as many bytes of data at a
time as you want) but once again I'm not very adept at file handling
just at the begginign stages.
It's very likely that the input stream is buffered, so it won't actually
make much, if any, difference.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
Mar 14 '08 #7
ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
In article <4e************ *************** *******@s19g200 0prg.googlegrou ps.com>,
broli <Br*****@gmail. comwrote:
>>I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii.
I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.
Perhaps they meant that character-at-a-time reading with
getchar() is slow, which it is on some systems. You can perfectly
well use fread() on text files.
The text file will be larger. There is a need to parse the ascii text
into the destination formats.

It will be slower in the great majority of cases.
>
-- Richard
Mar 14 '08 #8
Richard wrote:
ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
>In article <4e************ *************** *******@s19g200 0prg.googlegrou ps.com>,
broli <Br*****@gmail. comwrote:
>>>I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii.

I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.
> Perhaps they meant that character-at-a-time reading with
getchar() is slow, which it is on some systems. You can perfectly
well use fread() on text files.

The text file will be larger. There is a need to parse the ascii text
into the destination formats.

It will be slower in the great majority of cases.
Quick test, one file, 2000 lines, each line with two floats (1.12345
and 7.890), about 28Kb total.

One single big-enough fread:

real 0m0.002s
user 0m0.000s
sys 0m0.001s

Repeat fscanf( ... "%lf %lf" ... ) until EOF:

real 0m0.004s
user 0m0.002s
sys 0m0.002s

Yes, in this test it's twice as slow. The data file is probably
cached (it's been read several other times already as I /cough/
debugged my code). It includes program start-up time (I just did
`time ./a.out` to get the numbers) so the actual reading time will
be less.

Myself I wouldn't count that as "LOTS faster" for binary data,
but doubtless there are applications where it is so counted;
I don't think the OPs case is one of them, and it does look as
though he's reading a text file anyway.

--
"Creation began." - James Blish, /A Clash of Cymbals/

Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN

Mar 14 '08 #9
In article <fr**********@r egistered.motza rella.org>,
Richard <de***@gmail.co mwrote:
>Reading in large chunks is unrelated to whether it's binary or
ascii.
>I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.
I didn't say whether it's in binary is unrelated to *speed*.

I meant: there are two separate issues; whether you read it in large
chunks, and whether it's binary. You can read each of text or binary
in small or large chunks. Each of these choices will separately affect
the speed.

-- Richard
--
:wq
Mar 14 '08 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3661
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference This is caused by: <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
2
3960
by: Cigdem | last post by:
Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home Canonicalpath-Directory4: \\wkdis3\ROOT\home\bwe\ You selected the file named AAA.XML getXmlAlgorithmDocument(): IOException Not logged in
3
3509
by: Pir8 | last post by:
I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story> <story_id>112233</story_id> <pub_name>Puleen's Publication</pub_name> <pub_code>PP</pub_code> <edition_date>20031201</edition_date>
1
2465
by: Christoph Bisping | last post by:
Hello! Maybe someone is able to give me a little hint on this: I've written a vb.net app which is mainly an interpreter for specialized CAD/CAM files. These files mainly contain simple movement and drawing instructions like "move to's" and "change color's" optionally followed by one or more numeric (int or float) arguments. My problem is that the parsing algorithm I've currently implemented is extremely slow.
4
4868
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr> </table> With an XSLT styles sheet, I can use for-each to grab the values in
3
4387
by: toton | last post by:
Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in the file) with the location. And for a particular section I parse only that section. The file is something like, .... DATAS
9
1992
by: Paulers | last post by:
Hello, I have a log file that contains many multi-line messages. What is the best approach to take for extracting data out of each message and populating object properties to be stored in an ArrayList? I have tried looping through the logfile using regex, if statements and flags to find the start and end of each message but I do not see a good time in this process to create a new instance of my Message object. While messing around with...
13
4516
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple command set consisting of several single letter commands which take no arguments. A few additional single letter commands take arguments:
13
2835
by: charliefortune | last post by:
I am fetching some product feeds with PHP like this $merch = substr($key,1); $feed = file_get_contents($_POST); $fp = fopen("./feeds/feed".$merch.".txt","w+"); fwrite ($fp,$feed); fclose ($fp); and then parsing them with PHP's native parsing functions. This is succesful for most of the feeds, but a couple of them claim to be
2
3619
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on.... whenever I feed the parser with such file I get the error :
0
9688
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10268
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10247
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10031
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7571
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6809
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5467
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5593
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2941
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.