parsing a file..

broli

I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all
this about ?

Mar 14 '08 #1

Subscribe Reply

2521

1
2
3
>
Last »

Richard Heathfield

broli said:

I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Someone's pulling your leg. 2000 lines of text is nothing. Just write the
program so that it's clear, correct, and easy to understand. Then, if and
only if it's too slow (and you should define the "fast enough"/"too slow"
boundary before you start writing the program), it's time to think about
how it might be made faster.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Mar 14 '08 #2

Richard Heathfield

broli said:

<snip>

But then I
was told that " normally we don't read scientific data in ascii for
accuracy and speed concerns" which made me wonder what was so wrong ?

The statement!

I could parse 2000 lines in hardly any time and there was no problem
with ascii either.

Right. Someone's pulling your leg, or is overly concerned with efficiency
at the expense of development time and clarity. That isn't to say that
efficiency isn't important. But let's just pretend, for the sake of
argument, that you write it /both/ ways, and then you measure. You
discover that the "binary" technique takes 0.025 seconds to process the
2000 data groups, whereas the "text" version takes 0.075 seconds - three
times slower! Surely this is a triumph for binary!

Yeah, right, but who cares? You press ENTER, and then it takes you 0.1
seconds to look up at the screen, and everything's finished, no matter
which one you ran.

Write it clear, simple, and correct. Then worry about speed if and only if
you have to.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Mar 14 '08 #3

Richard Tobin

In article <4e************ *************** *******@s19g200 0prg.googlegrou ps.com>,
broli <Br*****@gmail. comwrote:

>I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii. Perhaps they meant that character-at-a-time reading with
getchar() is slow, which it is on some systems. You can perfectly
well use fread() on text files.

-- Richard

--
:wq

Mar 14 '08 #4

Richard Heathfield

Chris Dollin said:

Richard Heathfield wrote:

<snip>

>>
Someone's pulling your leg. 2000 lines of text is nothing. Just write
the program so that it's clear, correct, and easy to understand. Then,
if and only if it's too slow (and you should define the "fast
enough"/"too slow" boundary before you start writing the program), it's
time to think about how it might be made faster.

I agree that speed is unlikely to be a factor -- but accuracy may be.

Possibly, but that comes under correctness, not performance.

<snip>

After all, if they want to read those 2000 lines 1000 times per second
...

....and that is covered by "fast enough/too slow". Again, I would emphasise
that the first priority is to make the program *clear* (because it's
easier to make a clear program correct than to make a correct program
clear). The second priority (and a sine qua non, obviously) is to make the
program *correct*. When and only when it works, it's time to worry about
speed. (This obviously does *not* mean that one should intentionally adopt
gross algorithmic inefficiencies. )

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Mar 14 '08 #5

broli

Richard HeathField,

There are many modules involved in my software package and this is
just one of them. My software would also involve huge number of
calculations, searching, memory allocation etc etc but the thing is
that I have to parallelize the software code to run on different
machines anyway. Even if speed is an issue, I doubt that reading a
file in ascii or "binary" would make a huge impact overall.

Mar 14 '08 #6

Richard Heathfield

broli said:

<snip>

But when I use fgets() then wouldn't I get a string
of characters (also many tabs, null character etc) ?

Yes.

Wouldn't it be a
difficult task to convert an array of characters into double type
floating numbers again ?

I don't see that you have any choice. If what you've described is correct,
the numbers are already in text form. Converting is easy enough, though,
using strtod.

I think using fread will make it very fast
(considering that it allows you to read as many bytes of data at a
time as you want) but once again I'm not very adept at file handling
just at the begginign stages.

It's very likely that the input stream is buffered, so it won't actually
make much, if any, difference.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Mar 14 '08 #7

Richard

ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:

In article <4e************ *************** *******@s19g200 0prg.googlegrou ps.com>,
broli <Br*****@gmail. comwrote:

>>I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii.

I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.

Perhaps they meant that character-at-a-time reading with
getchar() is slow, which it is on some systems. You can perfectly
well use fread() on text files.

The text file will be larger. There is a need to parse the ascii text
into the destination formats.

It will be slower in the great majority of cases.

>
-- Richard

Mar 14 '08 #8

Chris Dollin

Richard wrote:

ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:

>In article <4e************ *************** *******@s19g200 0prg.googlegrou ps.com>,
broli <Br*****@gmail. comwrote:

>>>I need to parse a file which has about 2000 lines and I'm getting
told that reading the file in ascii would be a slower way to do it and
so i need to resort to binary by reading it in large chunks. Can any
one please explain what is all this about ?

Reading in large chunks is unrelated to whether it's binary or
ascii.

I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.

> Perhaps they meant that character-at-a-time reading with
getchar() is slow, which it is on some systems. You can perfectly
well use fread() on text files.

The text file will be larger. There is a need to parse the ascii text
into the destination formats.

It will be slower in the great majority of cases.

Quick test, one file, 2000 lines, each line with two floats (1.12345
and 7.890), about 28Kb total.

One single big-enough fread:

real 0m0.002s
user 0m0.000s
sys 0m0.001s

Repeat fscanf( ... "%lf %lf" ... ) until EOF:

real 0m0.004s
user 0m0.002s
sys 0m0.002s

Yes, in this test it's twice as slow. The data file is probably
cached (it's been read several other times already as I /cough/
debugged my code). It includes program start-up time (I just did
`time ./a.out` to get the numbers) so the actual reading time will
be less.

Myself I wouldn't count that as "LOTS faster" for binary data,
but doubtless there are applications where it is so counted;
I don't think the OPs case is one of them, and it does look as
though he's reading a text file anyway.

--
"Creation began." - James Blish, /A Clash of Cymbals/

Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN

Mar 14 '08 #9

Richard Tobin

In article <fr**********@r egistered.motza rella.org>,
Richard <de***@gmail.co mwrote:

>Reading in large chunks is unrelated to whether it's binary or
ascii.

>I would question that statement. Reading in binary will be a LOT faster
,if its the same platform. for reading in the same NUMBER of
readings.

I didn't say whether it's in binary is unrelated to *speed*.

I meant: there are two separate issues; whether you read it in large
chunks, and whether it's binary. You can read each of text or binary
in small or large chunks. Each of these choices will separately affect
the speed.

-- Richard
--
:wq

Mar 14 '08 #10

Similar topics

3661

XML file parsing with SAX

by: Willem Ligtenberg | last post by:

I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference This is caused by: <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">

Python

3960

XML file parsing/validating with xerces-j

by: Cigdem | last post by:

Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home Canonicalpath-Directory4: \\wkdis3\ROOT\home\bwe\ You selected the file named AAA.XML getXmlAlgorithmDocument(): IOException Not logged in

.NET Framework

3509

Parsing complex xml file with C#

by: Pir8 | last post by:

I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story> <story_id>112233</story_id> <pub_name>Puleen's Publication</pub_name> <pub_code>PP</pub_code> <edition_date>20031201</edition_date>

C# / C Sharp

2465

file parsing algorithms in vb.net?

by: Christoph Bisping | last post by:

Hello! Maybe someone is able to give me a little hint on this: I've written a vb.net app which is mainly an interpreter for specialized CAD/CAM files. These files mainly contain simple movement and drawing instructions like "move to's" and "change color's" optionally followed by one or more numeric (int or float) arguments. My problem is that the parsing algorithm I've currently implemented is extremely slow.

Visual Basic .NET

4868

Parsing an HTML table with XML

by: Rick Walsh | last post by:

I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr> </table> With an XSLT styles sheet, I can use for-each to grab the values in

.NET Framework

4387

parsing an ifstream to get some specific text

by: toton | last post by:

Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in the file) with the location. And for a particular section I parse only that section. The file is something like, .... DATAS

C / C++

1992

Need help with parsing a multilined log file into objects

by: Paulers | last post by:

Hello, I have a log file that contains many multi-line messages. What is the best approach to take for extracting data out of each message and populating object properties to be stored in an ArrayList? I have tried looping through the logfile using regex, if statements and flags to find the start and end of each message but I do not see a good time in this process to create a new instance of my Message object. While messing around with...

Visual Basic .NET

4516

Command language parsing - how formal to get?

by: Chris Carlen | last post by:

Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple command set consisting of several single letter commands which take no arguments. A few additional single letter commands take arguments:

C / C++

2835

Would a lack of line breaks in a doc cause parsing problems ?

by: charliefortune | last post by:

I am fetching some product feeds with PHP like this $merch = substr($key,1); $feed = file_get_contents($_POST); $fp = fopen("./feeds/feed".$merch.".txt","w+"); fwrite ($fp,$feed); fclose ($fp); and then parsing them with PHP's native parsing functions. This is succesful for most of the feeds, but a couple of them claim to be

.NET Framework

3619

HTML File Parsing

by: Felipe De Bene | last post by:

I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on.... whenever I feed the parser with such file I get the error :

Python

9688

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10268

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10247

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10031

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7571

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6809

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5467

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5593

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2941

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General