473,396 Members | 2,099 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

parse two field file


Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.

--
Dec 17 '06 #1
6 3498
Richard wrote:
>
Which way would you guys recommened to best parse a multiline file
which contains two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints
& tips appreciated as to do this in the most efficient manor.
Use toksplit. Call with tokchar set to '\t'. Std C code follows:

/* ------- file toksplit.h ----------*/
#ifndef H_toksplit_h
# define H_toksplit_h

# ifdef __cplusplus
extern "C" {
# endif

#include <stddef.h>

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh); /* length token can receive */
/* not including final '\0' */

# ifdef __cplusplus
}
# endif
#endif
/* ------- end file toksplit.h ----------*/

/* ------- file toksplit.c ----------*/
#include "toksplit.h"

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

A better name would be "strtkn", except that is reserved
for the system namespace. Change to that at your risk.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
Revised 2006-06-13
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh) /* length token can receive */
/* not including final '\0' */
{
if (src) {
while (' ' == *src) src++;

while (*src && (tokchar != *src)) {
if (lgh) {
*token++ = *src;
--lgh;
}
src++;
}
if (*src && (tokchar == *src)) src++;
}
*token = '\0';
return src;
} /* toksplit */

#ifdef TESTING
#include <stdio.h>

#define ABRsize 6 /* length of acceptable token abbreviations */

/* ---------------- */

static void showtoken(int i, char *tok)
{
putchar(i + '1'); putchar(':');
puts(tok);
} /* showtoken */

/* ---------------- */

int main(void)
{
char teststring[] = "This is a test, ,, abbrev, more";

const char *t, *s = teststring;
int i;
char token[ABRsize + 1];

puts(teststring);
t = s;
for (i = 0; i < 4; i++) {
t = toksplit(t, ',', token, ABRsize);
showtoken(i, token);
}

puts("\nHow to detect 'no more tokens' while truncating");
t = s; i = 0;
while (*t) {
t = toksplit(t, ',', token, 3);
showtoken(i, token);
i++;
}

puts("\nUsing blanks as token delimiters");
t = s; i = 0;
while (*t) {
t = toksplit(t, ' ', token, ABRsize);
showtoken(i, token);
i++;
}
return 0;
} /* main */

#endif
/* ------- end file toksplit.c ----------*/

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

Dec 17 '06 #2

"Richard" <rg****@gmail.comwrote in message
news:jv************@gmail.com...
>
Which way would you guys recommened to best parse a multiline file which
contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.
The input format is slightly quirky, so the best solution is to call fgets()
to read a line and then parse it yourself.

int checkheader(char *str)

ccan check whether the string is a header or not by looking for the tab or
counting whitespace.

parseheader(char *str, char *field1, char *field2)

will pull out the fields for you. make sure you reject over-long strings.
Then the data fields only contain one string.

However

void trim(char *str)

which removes leading and trailing whitespace is a good function to have.

so too is
int checkblank(char *str)

which checks for strings which consist entirely of whitespace characters.
--
www.personal.leeds.ac.uk/~bgy1mm
freeware games to download.
Dec 17 '06 #3
"Malcolm" <re*******@btinternet.comwrites:
"Richard" <rg****@gmail.comwrote in message
news:jv************@gmail.com...
>>
Which way would you guys recommened to best parse a multiline file which
contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.
The input format is slightly quirky, so the best solution is to call fgets()
to read a line and then parse it yourself.

int checkheader(char *str)

ccan check whether the string is a header or not by looking for the tab or
counting whitespace.

parseheader(char *str, char *field1, char *field2)

will pull out the fields for you. make sure you reject over-long strings.
Then the data fields only contain one string.

However

void trim(char *str)

which removes leading and trailing whitespace is a good function to have.

so too is
int checkblank(char *str)

which checks for strings which consist entirely of whitespace
characters.
I just did sscanf("%s%s",f1,f2) in the end.

--
Dec 17 '06 #4
Richard wrote:
Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?
strtok(..., "\t") will give the same result for "\tfoo"
and "\t\tfoo\t" and "foo". If you *know* that the input has
two tab-separated fields and that only the first (never the
second) can be empty, you can get this to work: If strtok()
finds two fields they are #1 and #2, but if it finds only
one it is #2 with #1 empty.

However, it makes me queasy to put that much faith in an
input source I don't control programmatically. Who knows?
Maybe in six months somebody will extend the format, adding
an optional third field. If that happened, then the field-
counting approach would misinterpret "\tfoo\tbar" as if it
were "foo\tbar". It would be better to adopt a method that
would complain about "\tfoo\tbar" than to be fooled by it.

fgets() plus sscanf() is a possibility, but it's a bit
tricky to use: The obvious "%s\t%s" will not do what you
want. (The first "%s" will skip any leading white space,
leaving you in the same hole as the strtok() approach, and
the "\t" will match any amount of any kind of white space,
tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
a little better, but still wouldn't be fully satisfactory:
It would match the prefix of "foo\tbar baz goozle frobnitz"
without any warning of the trailing junk. You could use
"%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
consumed the entire string ...

... but wouldn't it be simpler just to pick the line
apart for yourself? Read it in with fgets(), use strchr()
to find the first tab (syntax error if there isn't one), and
the first (possibly empty) field is everything from the start
to just before the tab. Then start just after the tab and use
strchr() again to find the terminating '\n'; the second field
is everything from just after the tab to just before the '\n'
(syntax error if its length is zero). You can use strcspn()
to check that the second field contains no white space and
squawk if it does (somebody added a third field you don't
understand).
The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.
The "most efficient manor" is the house of Usher. Resist
this unnecessary impulse for efficiency, lest your program meet
the same fate as did that storied manse.

(In other words: How long is this file, anyhow? How many
times will you scan its contents? If you sped up the scanning
by a factor of four hundred twenty gazillion, how much faster
would the program as a whole run? If you give your SUV a coat
of wax, will you improve its fuel economy by making it slipperier
or harm it by adding weight?)

--
Eric Sosman
es*****@acm-dot-org.invalid
Dec 17 '06 #5
On Sun, 17 Dec 2006 01:10:16 +0100, Richard <rg****@gmail.comwrote:
Which way would you guys recommened to best parse a multiline file
which contains two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?
strtok() is not so nice, because it tries to modify the string you pass
to it. I would probably use strcspn() for this, with something like:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXLINE 256

static void doline(char *buf, size_t bufsize);

int
main(void)
{
char buf[MAXLINE];
FILE *fp;

/*
* Add code here that opens /proc/filesystems file, instead of using
* `stdin' as the input file.
*/
fp = stdin;

clearerr(fp);
while (fgets(buf, sizeof buf, fp) != NULL) {
doline(buf, sizeof buf);
}
if (ferror(fp) != 0) {
perror("fgets");
exit(EXIT_FAILURE);
}
/*
* Add code here that closes the open file referenced by `fp'.
*/

return EXIT_SUCCESS;
}

static void
doline(char *buf, size_t bufsize)
{
char *field;
size_t pos, pos2, fieldsize;

assert(buf != NULL && bufsize 0);
(void)bufsize;

pos = strcspn(buf, "\t");
if (buf[pos] == '\0') {
fprintf(stderr,
"warning: no TAB in `%s', skipping this line\n", buf);
return;
}
pos2 = strcspn(buf + pos + 1, "\t");

fieldsize = pos2 + 1;
field = malloc(fieldsize);
if (field == NULL) {
perror("malloc");
return;
}
strncpy(field, buf + pos + 1, fieldsize - 1);
field[fieldsize - 1] = '\0';
field[strcspn(field, "\n\r")] = '\0';
printf("%s\n", field);
free(field);
}

The trick is to use strcspn() to find out the 'part' of the original
string which you are interested in, and then you can do whatever you
like with this part. In the particular program, I'm temporarily
allocate a new string buffer, copy the original contents in this new
buffer, print the buffer and release its memory. Any other way you can
think about to use this substring is fine too :)

Dec 27 '06 #6
On Sun, 17 Dec 2006 10:37:28 -0500, Eric Sosman
<es*****@acm-dot-org.invalidwrote:
Richard wrote:
Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. <snip>
strtok(..., "\t") will [lose empty fields]
Right.
fgets() plus sscanf() is a possibility, but it's a bit
tricky to use: The obvious "%s\t%s" will not do what you
want. (The first "%s" will skip any leading white space,
leaving you in the same hole as the strtok() approach, and
the "\t" will match any amount of any kind of white space,
tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
a little better, but still wouldn't be fully satisfactory:
Not enough better. If the first field is empty and thus the first
%[^\t] matches nothing, *scanf stops and doesn't do the %*1[\t]s.

This is effectively the same problem of the people who periodically
try to use {,f}scanf to replace <ILLEGALfflush (input) </>.
(Some people, including IIRC Dan Pop, have recommended e.g.
if( scanf ("%*[^\n]%*1[\n]") < 2 ) getchar ();
but I consider that too much uglier than the obvious, though slightly
longer and possibly slightly less efficient
while( (ch = getchar()) != EOF && ch != '\n' ) ;
etc.

Plus unbounded %[...] or %s risks buffer overflow and resulting UB.
You should specify a length at most one less than the buffer size.
It would match the prefix of "foo\tbar baz goozle frobnitz"
without any warning of the trailing junk. You could use
"%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
consumed the entire string ...

... but wouldn't it be simpler just to pick the line
apart for yourself? Read it in with fgets(), use strchr()
to find the first tab <snip>
Yes.
The "most efficient manor" is the house of Usher. Resist
this unnecessary impulse for efficiency, lest your program meet
the same fate as did that storied manse.
Yes. Or even the hundred-year shay, IIRC grade school. <G>

- David.Thompson1 at worldnet.att.net
Jan 3 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
by: Peter A. Schott | last post by:
I've got a file that seems to come across more like a dictionary from what I can tell. Something like the following format: ###,1,val_1,2,val_2,3,val_3,5,val_5,10,val_10...
3
by: Mags | last post by:
Hi everyone, Not a programmer, just need a little help.... I have a form with a type="file" field, which uploads the file to the server (using ColdFusion). is there a way to parse the...
6
by: nate | last post by:
Hello, Does anyone know where I can find an ASP server side script written in JavaScript to parse text fields from a form method='POST' using enctype='multipart/form-data'? I'd also like it to...
24
by: | last post by:
Hi, I need to read a big CSV file, where different fields should be converted to different types, such as int, double, datetime, SqlMoney, etc. I have an array, which describes the fields and...
3
by: Ken Bush | last post by:
How can I write an update query that removes part of a field? Like if I have a field with values such as 8/3/68 (a birthday obviously) and I need to put values in a new column but I need...
2
by: Vittal | last post by:
Hello All, I am trying to compile my application on Red Hat Linux 8 against gcc 3.2.2. Very first file in application is failing to compile. I tried compiling my application on Linux 7.2...
19
by: Johnny Google | last post by:
Here is an example of the type of data from a file I will have: Apple,4322,3435,4653,6543,4652 Banana,6934,5423,6753,6531 Carrot,3454,4534,3434,1111,9120,5453 Cheese,4411,5522,6622,6641 The...
11
by: hoopsho | last post by:
Hi Everyone, I am trying to write a program that does a few things very fast and with efficient use of memory... a) I need to parse a space-delimited file that is really large, upwards fo a...
2
by: zcabeli | last post by:
Hello everybody, i'd like to parse xml file which include records of the following type: each record has varied number of fields with either numerical or string value. it may also include...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.