469,631 Members | 1,526 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,631 developers. It's quick & easy.

fgets() and embedded null characters

Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.

So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?

2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not reinvent
this wheel.

Thanks,

David Mathog
ma****@caltech.edu
Nov 14 '05 #1
35 9124


David Mathog wrote:
Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.
As an aside, a file containing '\0' characters is not
suitable for reading with a text stream. Section 7.19.2
paragraph 2 describes the "expected form" of a text stream:
printing characters and a small group of control characters,
plus a few other conventions. If you write a '\0' to a
text stream it's not guaranteed that you can read it back,
not even if you use getc().

If the data can include '\0' (more generally, if it
doesn't follow the expected conventions for text), you can
use a binary stream. But then one must question the wisdom
of using fgets(), which is specifically designed for textual
input in units of lines. The Standard doesn't prohibit using
fgets() with a binary stream, but fread() might be better.
So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?
"It was just one of those things,
Just one of those crazy flings,
One weird design to raise Hell with strings,
Just one of those things."

.... and there are plenty of other examples of library functions
that echo back what you already know instead of telling you
something useful. The folks who invented fgets() (and gets(),
and strcat(), and ...) lacked our twenty-twenty hindsight.
2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not reinvent
this wheel.


I don't know of a function with quite this specification,
although somebody may have written one (it seems everybody
eventually writes himself an fgets() replacement). If you
wind up rolling your own, I'd suggest getc() instead of fgetc().
Also, while conditions A-D seem entirely reasonable, point E
seems more involved than it needs to be: it would seem that
most calls would need to be accompanied by a bunch of bit-testing,
increasing the "clunkiness" of the interface. Note that the
feof() and ferror() functions can already discriminate cases
E1 and E4; is it really worth while to call out E2 separately?
Absent dynamic allocation you need *some* way of discriminating
between "line too long" and "line that just fits but ends with
EOF instead of newline," but perhaps a simple convention about
using or not using the last spot in the buffer might handle it
with a slimmer interface.

--
Er*********@sun.com

Nov 14 '05 #2
David Mathog wrote:

Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.

So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?
Because somebody wrote it that way about 30 years ago, and a change
would break all sorts of existing code.

2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not
reinvent this wheel.


Bingo. Except you would be well advised to use getc rather than
fgetc. BTW, if your files have '\0' (nul, not null) chars in them,
they are not textfiles, and you will need to face the non-portable
treatment of line endings.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
Nov 14 '05 #3
Eric Sosman wrote:
As an aside, a file containing '\0' characters is not
suitable for reading with a text stream. Section 7.19.2
paragraph 2 describes the "expected form" of a text stream:
printing characters and a small group of control characters,
plus a few other conventions. If you write a '\0' to a
text stream it's not guaranteed that you can read it back,
not even if you use getc().


Sure. Unfortunately in the real world I sometimes encounter
files that do contain embedded null characters but are
otherwise normal text files.

Both responses so far said to use getc instead of fgetc,
is that for speed?

Here's a first pass at this function. Before everybody jumps
on the name please note that super_fgets()
doesn't imply that it is better than fgets(), just that it does more.
And no I have not tested it very thoroughly yet.

Ideally it would read at an even lower level than (f)getc so that the
secondary tests for EOF vs. read error wouldn't be necessary.
It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF) and SFG_EMBEDDED_NULL. It does not correct these,
just warns that they exist. The test for the trailing \r is nearly
free but the test for embedded NULL will slow things down a bit.
However less I think than testing for the embedded null characters
after this routine is called, since the character will already be
loaded in a CPU register.
/* super_fgets() status bits, put in a header fiole */
#define SFG_EOF 1 /* input terminated by End of File */
#define SFG_EOL 2 /* input terminated by End of line (\n) */
#define SFG_CRLF 4 /* input terminated by CRLF (\r\n) \r remains! */
#define SFG_EMBEDDED_NULL 8 /* embedded NULL characters are present */
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */
#define SFG_READERROR 32 /* unrecoverable read error */

/* super_fgets is implemented at the getc level. It does the following:
A: reads from a stream (like fgets)
B: accepts a preallocated buffer (like fgets)
C: accepts the size of that preallocated buffer (like fgets)
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)
Input is terminated by either EOL (\n) or EOF.
E: sets the position of the terminating null =
number of characters read (size_t)
D: sets a status integer where the bits are as
defined in the table far above (SFG_*)
Limitations: not a drop in fgets() replacement!

*/

void super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm, unsigned int *status){

size_t icterm; /* internal cterm value */
unsigned int istatus; /* internal status value */
size_t lastslot; /* the last character cell in the buffer */
int readthis; /* the character which was read */

icterm = 0;
istatus = 0;
lastslot = size-1;

while(1){

if(icterm == lastslot){
istatus |= SFG_BUFFER_OVERFLOW;
break;
}

readthis=fgetc(stream);

if(readthis == EOF){
/* either the end of the file or a
read error, figure out which */
if(feof(stream)){ istatus |= SFG_EOF; }
else { istatus |= SFG_READERROR; }
break;
}

if(readthis == '\n'){
/* LF is a line terminator, return what has been read so far,
NOTE, the \n is NOT returned!!! On \r\n terminated input
files the trailing \r may be present, check and
signal that too. */

istatus |= SFG_EOL;
if( (icterm>0) && (string[icterm-1]=='\r')) istatus |= SFG_CRLF;
break;
}

/* warn about embedded null characters */
if(readthis == '\0')istatus |= SFG_EMBEDDED_NULL;

string[icterm] = readthis;
icterm++;

}
string[icterm]='\0';
*status = istatus;
*cterm = icterm;
return;

}
Regards,

David Mathog
ma****@caltech.edu
Nov 14 '05 #4
In article <d1**********@naig.caltech.edu>,
David Mathog <ma****@caltech.edu> wrote:

[much snippage]
Here's a first pass at [a fgets replacement]
....and some of my comments after a first pass at reading it.

It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)
The way you state it makes it seem as if you're aware of this, but it's
worth explicitly noting that if you're reading a well-formed text file
that was opened correctly (that is, not in binary mode), you won't see
CRLF line endings.
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */
A different name (possibly SFG_BUFFER_FULL) would be more accurate for
this one, since you don't actually overflow the buffer (unless you're
given a too-small size).
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)
fgets does do this; it reads at most (size-1) bytes and always gives
back a '\0'-terminated string.
D: sets a status integer where the bits are as
defined in the table far above (SFG_*)
void super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm, unsigned int *status){
It would probably be better to return the status instead of storing it
through a pointer you get. That would allow the common idiom of doing
something like:
--------
if((ret=super_fgets(buf,sizeof buf,stdin,&last))!=SFG_EOL)
{
/*Something's not quite right, deal with it*/
}
--------
(or even:
--------
if(super_fgets(buf,sizeof buf,stdin,&last)!=SFG_EOL)
{
/*This trivial sample program isn't worth doing proper error-handling
in, as that would clutter the program and obscure the real point
*/
exit(EXIT_FAILURE);
}
--------
) rather than having to do (no harder, really, but less common and
therefore less immediately recognizeable)
--------
super_fgets(buf,sizeof buf,stdin,&last,&ret);
if(ret!=SFG_EOL)
{
/*Something's not quite right, deal with it*/
}
--------

while(1){

if(icterm == lastslot){
istatus |= SFG_BUFFER_OVERFLOW;
break;
}
My first reaction when I see something like this is that the check for
whether to break should be folded into the loop condition, since things
are often clearer that way, but it's not immediately clear whether that
can be done here without making readability worse rather than better.
/* LF is a line terminator, return what has been read so far,
NOTE, the \n is NOT returned!!!


Which is "better" depends more on your own preferences than anything
else, since either way you're providing enough information to trivially
reconstruct the other, but my preference would be to leave a '\n'
that's read from the stream in the buffer rather than removing it, so
that what the caller fgets is exactly what was read from the stream.
(This makes read-and-dump simpler, and if you're already doing more
than read-and-dump with it it's trivial to add removing the newline,
if you don't want it, to what you're doing.)
dave

--
Dave Vandervies dj******@csclub.uwaterloo.ca
If Dennis Ritchie starts trolling in here, you can be certain we'll
killfile him too.
--Joona I Palaste in comp.lang.c
Nov 14 '05 #5
Dave Vandervies wrote:
In article <d1**********@naig.caltech.edu>,
David Mathog <ma****@caltech.edu> wrote:

[much snippage]
Here's a first pass at [a fgets replacement]


...and some of my comments after a first pass at reading it.
It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)


The way you state it makes it seem as if you're aware of this, but it's
worth explicitly noting that if you're reading a well-formed text file
that was opened correctly (that is, not in binary mode), you won't see
CRLF line endings.


That only applies on a DOS/Windows type system. On Unix, even with the
file opened as a text stream, the CR will still be left on since Unix
uses just an LF to indicate the end of line.

However, in this case, since the file may have nul characters, I would
read it as binary anyway. For the types of SW I deal with I would tread
LF and CRLF identically, but that may not be appropriate for the OP.

<snip>
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.
Nov 14 '05 #6
David Mathog wrote:
Eric Sosman wrote:
As an aside, a file containing '\0' characters is not
suitable for reading with a text stream. Section 7.19.2
paragraph 2 describes the "expected form" of a text stream:
printing characters and a small group of control characters,
plus a few other conventions. If you write a '\0' to a
text stream it's not guaranteed that you can read it back,
not even if you use getc().

Sure. Unfortunately in the real world I sometimes encounter
files that do contain embedded null characters but are
otherwise normal text files.

Both responses so far said to use getc instead of fgetc,
is that for speed?


Yes. The idea is that getc will be a macro that can evaluate its
parameter more than once where as fgetc (even if implemented as a macro
as well as a function) has to behave like a function. So getc can take
more shortcuts and be more efficient.
Here's a first pass at this function. Before everybody jumps
on the name please note that super_fgets()
doesn't imply that it is better than fgets(), just that it does more.
And no I have not tested it very thoroughly yet.

Ideally it would read at an even lower level than (f)getc so that the
secondary tests for EOF vs. read error wouldn't be necessary.
You could use fread and buffer stuff yourself, but then people could not
mix calls to your super_fgetc and the standard functions. The same
applies to non-standard lower level functions such as the read function
provided as an extension on some implementations. So using getc is the
probably best way.

Anyway, I would be very surprised if the test for whether it was an end
of file or a read error will be a significant factor in the performance.
It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)
No chance of old MAC text files (I think it is) with LFCR for the line
termination?
and SFG_EMBEDDED_NULL. It does not correct these,
just warns that they exist. The test for the trailing \r is nearly
free but the test for embedded NULL will slow things down a bit. However
less I think than testing for the embedded null characters
after this routine is called, since the character will already be
loaded in a CPU register.
I agree that, for your purpose, this is probably the way to go.
/* super_fgets() status bits, put in a header fiole */
#define SFG_EOF 1 /* input terminated by End of File */
#define SFG_EOL 2 /* input terminated by End of line (\n) */
I would not bother with a status code for reaching the end of line as
this is the normal situation. Unless you are worried about whether the
last line in the
#define SFG_CRLF 4 /* input terminated by CRLF (\r\n) \r remains! */
#define SFG_EMBEDDED_NULL 8 /* embedded NULL characters are present */
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */
#define SFG_READERROR 32 /* unrecoverable read error */

/* super_fgets is implemented at the getc level. It does the following:
A: reads from a stream (like fgets)
B: accepts a preallocated buffer (like fgets)
C: accepts the size of that preallocated buffer (like fgets)
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)
Input is terminated by either EOL (\n) or EOF.
E: sets the position of the terminating null =
number of characters read (size_t)
D: sets a status integer where the bits are as
defined in the table far above (SFG_*)
Limitations: not a drop in fgets() replacement!

*/

void super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm, unsigned int *status){
I would make the function return the status rather than passing a
pointer to the status variable.
size_t icterm; /* internal cterm value */
unsigned int istatus; /* internal status value */
I would not bother with having these as local variable. I would just
work with *cterm (and *status if you want to return status that way).
The compiler should be able to handle optimising the accesses.
size_t lastslot; /* the last character cell in the buffer */
I would not bother with this variable either.
int readthis; /* the character which was read */

icterm = 0;
istatus = 0;
lastslot = size-1;

while(1){
Why not use a for so that initialisation and increment can be
encapsulated at one point?
if(icterm == lastslot){
istatus |= SFG_BUFFER_OVERFLOW;
break;
}
readthis=fgetc(stream);

if(readthis == EOF){
/* either the end of the file or a
read error, figure out which */
if(feof(stream)){ istatus |= SFG_EOF; }
else { istatus |= SFG_READERROR; }
break;
}

if(readthis == '\n'){
Why separate if statements, especially as they are mutually exclusive?
either "else if" or doing something with a switch would be clearer in my
opinion.
/* LF is a line terminator, return what has been read so far,
NOTE, the \n is NOT returned!!! On \r\n terminated input
files the trailing \r may be present, check and
signal that too. */

istatus |= SFG_EOL;
if( (icterm>0) && (string[icterm-1]=='\r')) istatus |= SFG_CRLF;
break;
}

/* warn about embedded null characters */
if(readthis == '\0')istatus |= SFG_EMBEDDED_NULL;

string[icterm] = readthis;
icterm++;

}
string[icterm]='\0';
*status = istatus;
*cterm = icterm;
return;

}


Starting from yours I would as a first hack change it to:

unsigned int super_fgets(char *string, size_t size, FILE *stream,
size_t *cterm)
{
int readthis; /* the character which was read */
unsigned int status = 0;

for (*cterm = 0; *cterm < size-1; ++*cterm) {

readthis = fgetc(stream);

switch (readthis) {
case EOF:
/* either the end of the file or a
read error, figure out which */

string[cterm]='\0';

if (feof(stream))
return status | SFG_EOF;
else
return status | SFG_READERROR;

case '\n':
/* LF is a line terminator, return what has been read so
far.
NOTE, the \n is NOT returned!!! On \r\n terminated input
files the trailing \r may be present, check and
signal that too. */

string[cterm]='\0';

if ((cterm>0) && (string[cterm-1]=='\r'))
status |= SFG_CRLF;

return status | SFG_EOL;

case '\0':
status |= SFG_EMBEDDED_NULL;
/* fall through to default case */

default:
string[cterm] = readthis;
break;

}
}

string[cterm] = '\0';
return status | SFG_BUFFER_OVERFLOW;
}

This is completely untested, but I think a bit tidier than yours. If I
was designing from scratch, and not doing it late at night, I might do
it differently.
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.
Nov 14 '05 #7
In article <hs************@brenda.flash-gordon.me.uk>,
Flash Gordon <sp**@flash-gordon.me.uk> wrote:
Dave Vandervies wrote:

The way you state it makes it seem as if you're aware of this, but it's
worth explicitly noting that if you're reading a well-formed text file
that was opened correctly (that is, not in binary mode), you won't see
CRLF line endings.


That only applies on a DOS/Windows type system. On Unix, even with the
file opened as a text stream, the CR will still be left on since Unix
uses just an LF to indicate the end of line.


If it's on a unix system and has a CR, it's not a well-formed text file.
(In that case, it's most likely an incorrectly imported file from
another system.)

(I believe older MacOS systems used CR-only as their line delimiter.
A MacOS program opening such a file in text mode would get the appropriate
translation to '\n' for end-of-line done for it by the library; copying
the file to a Unixish (or DosWindowsish) system without translating
appropriately would give you something other than a well-formed text
file, just as copying between Unixish and DosWindowsish systems without
translating line-break conventions would.)
dave

--
Dave Vandervies dj******@csclub.uwaterloo.ca
If Dennis Ritchie starts trolling in here, you can be certain we'll
killfile him too.
--Joona I Palaste in comp.lang.c
Nov 14 '05 #8
David Mathog wrote:
Eric Sosman wrote:
As an aside, a file containing '\0' characters is not
suitable for reading with a text stream. Section 7.19.2
paragraph 2 describes the "expected form" of a text stream:
printing characters and a small group of control characters,
plus a few other conventions. If you write a '\0' to a
text stream it's not guaranteed that you can read it back,
not even if you use getc().
Sure. Unfortunately in the real world I sometimes encounter
files that do contain embedded null characters but are
otherwise normal text files.


Then they are not text files. They are binary files, and should be
so treated.

Both responses so far said to use getc instead of fgetc,
is that for speed?
Yes. getc can be a macro, and can operate directly on the system
buffers, and thus avoid the overhead of a system function call.

Here's a first pass at this function. Before everybody jumps
on the name please note that super_fgets()
doesn't imply that it is better than fgets(), just that it does more.
And no I have not tested it very thoroughly yet.

Ideally it would read at an even lower level than (f)getc so that the
secondary tests for EOF vs. read error wouldn't be necessary.
It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF) and SFG_EMBEDDED_NULL. It does not correct these,
just warns that they exist. The test for the trailing \r is nearly
free but the test for embedded NULL will slow things down a bit.
However less I think than testing for the embedded null characters
after this routine is called, since the character will already be
loaded in a CPU register.

/* super_fgets() status bits, put in a header fiole */
#define SFG_EOF 1 /* input terminated by End of File */
#define SFG_EOL 2 /* input terminated by End of line (\n) */
#define SFG_CRLF 4 /* input terminated by CRLF (\r\n) \r remains! */
#define SFG_EMBEDDED_NULL 8 /* embedded NULL characters are present */
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full */
#define SFG_READERROR 32 /* unrecoverable read error */
Better to define an enumeration, so those values and names will
show up in a debugger.

typedef enum sfgRESULT {SFGOK, SFGEOF, SFGEOL, SFGCRLF=4,
SFGNULL=8, ..... } SFGRESULT;

/* super_fgets is implemented at the getc level. It does the following:
A: reads from a stream (like fgets)
B: accepts a preallocated buffer (like fgets)
C: accepts the size of that preallocated buffer (like fgets)
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)
fgets always terminates with a '\0'. It omits the \n when the full
line doesn't fit the buffer.
Input is terminated by either EOL (\n) or EOF.
Bad idea. EOF doesn't fit into a char. Thats why getc etc return
int. What about a MAC file, which terminates lines with \r and no
\n. What about systems that don't terminate lines with anything.
E: sets the position of the terminating null =
number of characters read (size_t)
D: sets a status integer where the bits are as
defined in the table far above (SFG_*)

Limitations: not a drop in fgets() replacement!


Too complex an interface. You yourself will forget how to call it
in a short while. Hell, I can never remember whether strcpy copies
to or from the first parameter. KISS. See:

<http://cbfalconer.home.att.net/download/ggets.zip>

where I managed to make the interface simple enough for even me to
remember it. Then explain why your routine is a significant
advance on fread. Quickly now, does the file parameter come first
or last? Why?

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
Nov 14 '05 #9
Dave Vandervies wrote:
In article <d1**********@naig.caltech.edu>,
David Mathog <ma****@caltech.edu> wrote:

[much snippage]
Here's a first pass at [a fgets replacement]
...and some of my comments after a first pass at reading it.

It has two warning fields: SFG_CRLF, indicating the presence of
a CRLF (vs. a LF)


The way you state it makes it seem as if you're aware of this, but

it's worth explicitly noting that if you're reading a well-formed text file that was opened correctly (that is, not in binary mode), you won't see CRLF line endings.
#define SFG_BUFFER_OVERFLOW 16 /* input buffer full
*/
A different name (possibly SFG_BUFFER_FULL) would be more accurate for this one, since you don't actually overflow the buffer (unless you're
given a too-small size).
D: terminates the characters read with a '\0' in all cases
(unlike fgets on a read that won't fit into the buffer)


fgets does do this; it reads at most (size-1) bytes and always gives
back a '\0'-terminated string.


ITYM almost always, because when fgets returns NULL, the string is not
guaranteed to be nul terminated (much like gets or scanf() with %s).

Nov 14 '05 #10
David Mathog wrote:
Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.
Some time ago, I was involved in a flamefest in this newsgroup that
essentially claimed that your situation did not exist in the real
world. That is to say, they think you are lying.
So two questions:

1. Why did the folks who wrote fgets() have a successful
read return a pointer to the storage buffer (which the
calling routine already knew in any case) instead of the
number of characters read (which often cannot determine at
all after the fact if there are embedded nulls in the input)?
The C language designers were "hackers". It did the job they wanted it
to do at the time. They had no consideration for future ramifications.
In fact they claim they had no expectation that C would be used
outside of Unix and some utilities for said OS. gets() is of course
far worse, but thinking about that will help you understand just how
little forethought was put into the design of the C language.

Thankfully in the 3 decades since they first designed this language,
the ANSI C committee have fixed all the warts of the language and no
longer do we have to endure the embarassment that is the C language
library and ... oh wait, sorry no, that was just a dream I had,
nevermind.
2. Can somebody please supply a pointer to a function
written in ANSI C that:

A) reads from a stream (like fgets)
B) stores to a preallocated buffer (like fgets)
C) accepts the size of the buffer (like fgets)
D) returns the number of characters read (unlike fgets)
E) sets read status, ideally in an integer combining
status bits more or less like these:
1 EOF
2 LINETOOBIG (instead of having to check the last byte)
4 READERROR (any other kind of READ error)
(read status = 1 with a nonzero returned length would
not be an error, it just indicates that all input data
has been consumed.)

If need be I can roll my own from fgetc, but I'd rather not reinvent
this wheel.


Ask, and ye shall receive:

http://www.azillionmonkeys.com/qed/userInput.html

It doesn't exactly do the things you ask for, but rather is a general
enough framework for your to easily write your own callbacks and
wrappers which have exactly the behaviors your desire.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Nov 14 '05 #11
we******@gmail.com writes:
David Mathog wrote:
Every so often one of my fgets() based programs encounters
an input file containing embedded nulls. fgets is happy to
read these but the embedded nulls subsequently cause problems
elsewhere in the program. Since fgets() doesn't return
the number of characters read it is pretty tough to handle
the embedded nulls once they are in the buffer.


Some time ago, I was involved in a flamefest in this newsgroup that
essentially claimed that your situation did not exist in the real
world. That is to say, they think you are lying.


I don't recall any such discussion here. Can you provide a citation?
Could you have misinterpreted something?

It doesn't surprise me that a program using fgets() would have
problems reading a file with embedded nuls. (Arguably such a file is
not a text file, and you therefore shouldn't be using fgets() to read
it, but determining that before you try to read the file can be
difficult.)

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #12
Dave Vandervies wrote:
In article <hs************@brenda.flash-gordon.me.uk>,
Flash Gordon <sp**@flash-gordon.me.uk> wrote:
Dave Vandervies wrote:
The way you state it makes it seem as if you're aware of this, but it's
worth explicitly noting that if you're reading a well-formed text file
that was opened correctly (that is, not in binary mode), you won't see
CRLF line endings.


That only applies on a DOS/Windows type system. On Unix, even with the
file opened as a text stream, the CR will still be left on since Unix
uses just an LF to indicate the end of line.


If it's on a unix system and has a CR, it's not a well-formed text file.
(In that case, it's most likely an incorrectly imported file from
another system.)


The OP has already said that the files may contain null characters and
so are badly formed, therefore it is entirely possibly that the CRs in
the file are another sign of badly formed files.
(I believe older MacOS systems used CR-only as their line delimiter.
Possibly. I'm sure there is one system which used LFCR as well.
A MacOS program opening such a file in text mode would get the appropriate
translation to '\n' for end-of-line done for it by the library; copying
the file to a Unixish (or DosWindowsish) system without translating
appropriately would give you something other than a well-formed text
file, just as copying between Unixish and DosWindowsish systems without
translating line-break conventions would.)


I've made a lot of use of FTP in binary and text modes between *nix and
Windows so I know what is meant to be done. However, the OP knows he is
dealing with badly formed text files, so we cannot infer what system he
is on based on what the files contain. In his position I would open the
files in binary mode and handle line termination myself.
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.
Nov 14 '05 #13
On Thu, 17 Mar 2005 08:17:18 +0000, Flash Gordon
<sp**@flash-gordon.me.uk> wrote:
Dave Vandervies wrote:
In article <hs************@brenda.flash-gordon.me.uk>,
Flash Gordon <sp**@flash-gordon.me.uk> wrote:

If it's on a unix system and has a CR, it's not a well-formed text file.
(In that case, it's most likely an incorrectly imported file from
another system.)


The OP has already said that the files may contain null characters and
so are badly formed, therefore it is entirely possibly that the CRs in
the file are another sign of badly formed files.


Have you considered that the files may be on a shared filesystem,
accessed by both Unix and Win/DOS? Since NFS, for example, has no idea
of what is accessing the files or whether they are supposed to be 'text'
or 'binary' it can't do any translation.

Real World(tm) programs which are well-written will take account of
that, and where they can they will handle all common line endings (CR,
LF, CRLF, possibly LFCR) transparently (Vim, for instance, will try to
recognise the type of file, and will allow the user to change it if it
gets it wrong).
(I believe older MacOS systems used CR-only as their line delimiter.


Possibly. I'm sure there is one system which used LFCR as well.
A MacOS program opening such a file in text mode would get the appropriate
translation to '\n' for end-of-line done for it by the library; copying
the file to a Unixish (or DosWindowsish) system without translating
appropriately would give you something other than a well-formed text
file, just as copying between Unixish and DosWindowsish systems without
translating line-break conventions would.)


I've made a lot of use of FTP in binary and text modes between *nix and
Windows so I know what is meant to be done. However, the OP knows he is
dealing with badly formed text files, so we cannot infer what system he
is on based on what the files contain. In his position I would open the
files in binary mode and handle line termination myself.


With the NUL characters it's messy, because any string containing them
will be bound to have problems elsewhere. For just Win/DOS and *ix
files the easiest thing is to open in text mode, where CRLF will either
be replaced by \n automatically or will give \r\n, and detect the latter
and replace it. If input is in a loop it's easy enough to detect CR as
a line ending as well. Note that in the following I (a) define CR and
LF as the explicit hex ASCII values; (b) don't return a flag as to which
line ending was present; (c) don't differentiate between EOF and error.
Those things, if wanted, are left as an exercise for the reader. I've
included the test program, a hex dump of my test file, and the output...

#include <stdio.h>
#include <ctype.h>

#define CR 0x0D
#define LF 0x0A

/**
* Get a line from the input stream *fp, up to size-1 characters nul
* terminated. Handles line terminators LF, CR, CRLF and LFCR,
* treats them all as \n. Returns number of characters in the
* buffer, not including trailing NUL (0x00) character (returns zero
* for EOF at start of line).
* @param fp file pointer to open input stream.
* @param buff pointer to buffer.
* @param size size of input buffer, including trailing nul.
* @return number of characters read.
*/
size_t getLine(FILE *fp, char *buff, size_t size)
{
size_t n = 0;
int c;
while (n+1 < size && (c = getc(fp)) != EOF)
{
if (c == CR)
{
if ((c = getc(fp)) != LF && c != EOF)
ungetc(c, fp);
buff[n++] = '\n';
break;
}
else if (c == LF)
{
if ((c = getc(fp)) != CR && c != EOF)
ungetc(c, fp);
buff[n++] = '\n';
break;
}
buff[n++] = c;
}
buff[n] = '\0';
return n;
}

int main(int argc, char **argv)
{
int i;
for (i = 1; i < argc; i++)
{
FILE *fp = fopen(argv[i], "r");
if (fp)
{
char buff[16];
int n;
while ((n = getLine(fp, buff, 16)) > 0)
{
char *p = buff;
printf("%4d:", n);
while (*p)
if (isprint(*p))
printf(" %c", *p++);
else
printf(" 0x%.2X", *p++);
printf("\n");
}
fclose(fp);
}
}
return 0;
}

$ cc -pedantic -W -Wall getline.c

$ xdump /tmp/test
00000000: 4C 69 6E 65 20 74 65 72 6D 69 6E 61 74 65 64 20 |Line terminated |
00000010: 62 79 20 4C 46 0A 54 65 72 6D 20 77 69 74 68 20 |by LF.Term with |
00000020: 43 52 0D 54 65 72 6D 20 77 69 74 68 20 43 52 4C |CR.Term with CRL|
00000030: 46 0D 0A 54 65 72 6D 20 77 69 74 68 20 4C 46 43 |F..Term with LFC|
00000040: 52 0A 0D 4C 61 73 74 20 6C 69 6E 65 0A -- -- -- |R..Last line.___|

$ ./a.out /tmp/test
15: L i n e t e r m i n a t e d
7: b y L F 0x0A
13: T e r m w i t h C R 0x0A
15: T e r m w i t h C R L F 0x0A
15: T e r m w i t h L F C R 0x0A
10: L a s t l i n e 0x0A

(My getLine() also doesn't object to embedded NUL characters, although
the main program above will treat them as end of string as written and
vim won't let me insert them into a file and I couldn't be bothered to
fire up a hex editor to do it).

Chris C
Nov 14 '05 #14
In article <sl******************@ccserver.keris.net>,
Chris Croughton <ch***@keristor.net> wrote:
:Have you considered that the files may be on a shared filesystem,
:accessed by both Unix and Win/DOS? Since NFS, for example, has no idea
:of what is accessing the files or whether they are supposed to be 'text'
:or 'binary' it can't do any translation.

NFS was also historically prone to dropping a block of nulls into
the middle of what was expected to be a text file, especially mailboxes
(locking issues...)
--
"This was a Golden Age, a time of high adventure, rich living and
hard dying... but nobody thought so." -- Alfred Bester, TSMD
Nov 14 '05 #15
Flash Gordon wrote:
David Mathog wrote:

Both responses so far said to use getc instead of fgetc,
is that for speed?

Yes. The idea is that getc will be a macro that can evaluate its
parameter more than once where as fgetc (even if implemented as a macro
as well as a function) has to behave like a function. So getc can take
more shortcuts and be more efficient.


Tested a few platforms and found only one where getc was faster
than fgetc (gcc 3.2.2 on Solaris 8). In all other cases they ran
at the same speed.

Ideally it would read at an even lower level than (f)getc so that the
secondary tests for EOF vs. read error wouldn't be necessary.

You could use fread and buffer stuff yourself, but then people could not
mix calls to your super_fgetc and the standard functions.


Hmm. Don't really want to read past the EOL with fread, which means
using it 1 char at a time like fgetc. Tried that and it ran at less
than half the speed of fgetc (gcc 3.3.2 on linux). Not surprising since
it is not usually called that way. The reason
I tried it is that when pure binary files were fed through
a pipe on windows xp to a tiny fgetc testbed they terminated
with an EOF only a few bytes into the program:

testfgetc <drivers.cab

where the loop being tested was:

while(fgetc(stdin) != EOF){}

So I tried replacing that with this loop:

while(fread(&readchar,1,1,stdin) !=0){}

and it terminated at the exact same place. A little googling
found some references to ^Z in the input stream having this effect.
Great. So there is apparently an intrinsic problem passing
binary data through Windows XP pipes. On linux and Solaris
both forms happily read through a pure binary file in a
pipe without throwing a premature EOF.

The pipe was eliminated on Windows by using an explicit:

fin=fopen(filename,"rb")

at which point fgetc and getc and fread all were able to read the binary
file correctly. Unfortunately at that point fgetc/getc stopped treating
CR LF as a line terminator and returned both characters instead of
just a single "\n". That's ugly enough on Windows but should
be truly hideous indeed for some of the text file formats provided by
RMS on VMS.

I'm not going to worry about an fgets() that can
read a "line" containing arbitrary binary characters in all situations
on all platforms. For now I just need one that can handle embedded
null characters in files which are otherwise valid text files. The one
posted here can apparently do that on both Windows and linux/Solaris.

Thanks,

David Mathog
ma****@caltech.edu
Nov 14 '05 #16
David Mathog <ma****@caltech.edu> writes:
[SNIP]
Hmm. Don't really want to read past the EOL with fread, which means
using it 1 char at a time like fgetc. Tried that and it ran at less
than half the speed of fgetc (gcc 3.3.2 on linux). Not surprising since
it is not usually called that way. The reason
I tried it is that when pure binary files were fed through
a pipe on windows xp to a tiny fgetc testbed they terminated
with an EOF only a few bytes into the program:

testfgetc <drivers.cab

where the loop being tested was:

while(fgetc(stdin) != EOF){}

So I tried replacing that with this loop:

while(fread(&readchar,1,1,stdin) !=0){}

and it terminated at the exact same place. A little googling
found some references to ^Z in the input stream having this effect.
Great. So there is apparently an intrinsic problem passing
binary data through Windows XP pipes. On linux and Solaris
both forms happily read through a pure binary file in a
pipe without throwing a premature EOF.


Sure, because stdin is a text stream, not a binary stream. If you
want to read binary data on stdin, you *might* be able to use
freopen(). It's implementation-defined whether this is allowed (and
I may be missing something else, since I've never tried this).

It works on Unix-like systems because they don't make a strong
distinction between text and binary files. EOF, for either text or
binary files, is marked by the end of the file, not by any special
character.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #17
On Thu, 17 Mar 2005 08:17:18 +0000, Flash Gordon <sp**@flash-gordon.me.uk>
wrote:
Possibly. I'm sure there is one system which used LFCR as well.


Ever wonder why CRLF is traditional? It's because the time it took for
the print head to actually returned to the left margin usually was longer
than the time it took to feed the paper up a line, so CRLF saved time.

--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up
Per the FCA, this address may not be added to any commercial mail list
Nov 14 '05 #18
In article <ln************@nuthaus.mib.org>,
Keith Thompson <ks***@mib.org> wrote:
:Sure, because stdin is a text stream, not a binary stream. If you
:want to read binary data on stdin, you *might* be able to use
:freopen(). It's implementation-defined whether this is allowed

freopen() silently ignores failures to close the existing file,
and always opens the new file provided that appropriate access
exists (and the file exists or as appropriate.) freopen() does
a full close() first.

I suspect you may have been thinking of fdopen() instead of freopen().
--
Warning: potentially contains traces of nuts.
Nov 14 '05 #19
In article <5a********************************@4ax.com>,
Kevin D. Quitt <KQ****@IEEInc.com> wrote:
:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.
--
Usenet is like a slice of lemon, wrapped around a large gold brick.
Nov 14 '05 #20
In article <d1**********@canopus.cc.umanitoba.ca>,
Walter Roberson <ro******@ibd.nrc-cnrc.gc.ca> wrote:
In article <5a********************************@4ax.com>,
Kevin D. Quitt <KQ****@IEEInc.com> wrote:
:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.


How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.
dave

--
Dave Vandervies dj******@csclub.uwaterloo.ca
The compiler could wait until the next local sunrise to produce its output if
it wants to issue a diagnostic, or produce output immediately if it doesn't;
then the sunrise could be considered a diagnostic. --Keith Thompson in CLC
Nov 14 '05 #21
On 17 Mar 2005 20:41:56 GMT, Walter Roberson
<ro******@ibd.nrc-cnrc.gc.ca> wrote:
In article <5a********************************@4ax.com>,
Kevin D. Quitt <KQ****@IEEInc.com> wrote:
:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.


No, it simply meant that if you had CR followed by x then you needed to
delay sending x (often by inserting NUL), so printing LF CR x would have
needed to do it as LF CR NUL x whereas you could do CR LF x and save
time. Some systems would even adjust the number of NUL characters
according to the length of the previous line, so you could get:

1 CR LF
123 CR LF
1234567890ASDFGHJK CR LF NUL
1234567890QWERTYUIOPASDFGHJKLZXCVBNM1234567890 CR LF NUL NUL

etc. Mechanisms often jammed or did strange things if you tried to
print too fast (I remember printing words backwards during the carriage
return period when I got it wrong), there was no "flow control" on most
teleprinters in that sense (DC1 through DC3 were used for paper tape
control, mostly, so that the computer could control the tape reader and
punch, they weren't sent automatically to the computer).

There were also printers where the carriage could print in both
directions, so you could either wait for the carriage to return to the
left or you could send a "reverse direction" code ans start printing
backwards. For that the computer had to do the buffering itself and
output characters in reverse order...

Chris C
Nov 14 '05 #22


Dave Vandervies wrote:
In article <d1**********@canopus.cc.umanitoba.ca>,
Walter Roberson <ro******@ibd.nrc-cnrc.gc.ca> wrote:
In article <5a********************************@4ax.com>,
Kevin D. Quitt <KQ****@IEEInc.com> wrote:
:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.


How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.


<off-topic>

Some printing terminals had no ability to block incoming
characters while waiting for the mechanical components to get
into position. It was the sender's responsibility to insert
a momentary pause after sending a "slow" control code -- and
can you guess how the pauses were implemented on machines that
often didn't have clocks like those we've become accustomed to?
Yes, folks: you inserted a bunch of '\0' characters and let the
transmission line do the timing for you ...

</off-topic>

--
Er*********@sun.com

Nov 14 '05 #23
ro******@ibd.nrc-cnrc.gc.ca (Walter Roberson) writes:
In article <ln************@nuthaus.mib.org>,
Keith Thompson <ks***@mib.org> wrote:
:Sure, because stdin is a text stream, not a binary stream. If you
:want to read binary data on stdin, you *might* be able to use
:freopen(). It's implementation-defined whether this is allowed

freopen() silently ignores failures to close the existing file,
and always opens the new file provided that appropriate access
exists (and the file exists or as appropriate.) freopen() does
a full close() first.

I suspect you may have been thinking of fdopen() instead of freopen().


No, I was thinking of freopen(); since fdopen() isn't standard C, I
probably wouldn't have mentioned it here. On the other hand, fdopen()
might be a solution, though not a 100% portable one. On the other
other hand, I've already exceeded the limits of my expertise, so
perhaps I'll just stop talking now.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #24
Eric Sosman <er*********@sun.com> writes:
[...]
<off-topic>

Some printing terminals had no ability to block incoming
characters while waiting for the mechanical components to get
into position. It was the sender's responsibility to insert
a momentary pause after sending a "slow" control code -- and
can you guess how the pauses were implemented on machines that
often didn't have clocks like those we've become accustomed to?
Yes, folks: you inserted a bunch of '\0' characters and let the
transmission line do the timing for you ...

</off-topic>


<still-off-topic>
Unix tty software still supports this kind of thing. "man tty" and/or
"man termio" for details.
</still-off-topic>

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #25
> we******@gmail.com writes:
...
Some time ago, I was involved in a flamefest in this newsgroup
that essentially claimed that your situation did not exist in
the real world. That is to say, they think you are lying.

Keith Thompson wrote: I don't recall any such discussion here. Can you provide a citation?
Could you have misinterpreted something?


Paul Hsieh has even accused Dave Thompson of being unconstructive.

Need I say more...

--
Peter

Nov 14 '05 #26
Eric Sosman wrote:

Dave Vandervies wrote:
In article <d1**********@canopus.cc.umanitoba.ca>,
Walter Roberson <ro******@ibd.nrc-cnrc.gc.ca> wrote:

In article <5a********************************@4ax.com>,
Kevin D. Quitt <KQ****@IEEInc.com> wrote:
:Ever wonder why CRLF is traditional? It's because the time it took for
:the print head to actually returned to the left margin usually was longer
:than the time it took to feed the paper up a line, so CRLF saved time.

But CR followed by a printable character was supposed to return to
margin and then print the new character at the beginning of the line.
Therefore the mechanism that implimented that had to have a look-ahead --
and that being the case, LFCR could have worked just as well.


How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.

<off-topic>

Some printing terminals had no ability to block incoming
characters while waiting for the mechanical components to get
into position. It was the sender's responsibility to insert
a momentary pause after sending a "slow" control code -- and
can you guess how the pauses were implemented on machines that
often didn't have clocks like those we've become accustomed to?
Yes, folks: you inserted a bunch of '\0' characters and let the
transmission line do the timing for you ...

</off-topic>


<still off-topic>
But the '\0' itself was completely ignored. Blank Tape. '\177' (DEL) or
"all holes punched" was also completely ignored. I see no valid case for
NUL ('\0') in any text file. If it does exist, the I/O system should
ignore it in text mode. fgets() should never see '\0' in a text stream.
</still off-topic>
--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #27
Joe Wright wrote:

<still off-topic>
But the '\0' itself was completely ignored. Blank Tape. '\177' (DEL) or
"all holes punched" was also completely ignored. I see no valid case for
NUL ('\0') in any text file. If it does exist, the I/O system should
ignore it in text mode. fgets() should never see '\0' in a text stream.
</still off-topic>


Do you consider a unicode file a text file? The C standard probably
doesn't but unfortunately a lot of people do and they mail them to me.
The first indication that you've got one is when "cat" works but
"grep" won't find any of the words that "cat" shows. On a Windows
system these act just like a text file: notepad, wordpad, and
DOS level TYPE and FIND all show the same thing, with no overt
indication that the file contains unicode.

Run "od -c" on a unicode file and you'll find that it starts with
a Byte Order Mark (FE FF). After that every other byte is null.
Take out the BOM and the null characters and you've got an ASCII file,
assuming it was originally written in english. Somewhat ironically
all of the ones I've seen so far have only LF EOLs after being processed
like this. This is for UTF-16. There's also
UTF-32 but thankfully nobody has sent me one of those yet.

Regards,

David Mathog
ma****@caltech.edu
Nov 14 '05 #28
Peter Nilsson wrote:
.... snip ...
Paul Hsieh has even accused Dave Thompson of being unconstructive.


Hsieh and I have recently had some words in comp.programming. They
started when I stated that some of his code was unnecessarily
non-portable, and grew rapidly from there.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
Nov 14 '05 #29
On Thu, 17 Mar 2005 21:23:26 +0000 (UTC),
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:


How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.


We are talking about old mechanical teletypewriters with virtualy no
character buffering whatsoever. Thus, when a character is receive it
better be printed immediately before the next arrives, and if the
carriage hasn't fully returned to the beginning of the line then the
character will be printed wherever the carriage happens to be; usualy
in the middle of the line.
Villy
Nov 14 '05 #30
David Mathog <ma****@caltech.edu> wrote:
Joe Wright wrote:
<still off-topic>
But the '\0' itself was completely ignored. Blank Tape. '\177' (DEL) or
"all holes punched" was also completely ignored. I see no valid case for
NUL ('\0') in any text file. If it does exist, the I/O system should
ignore it in text mode. fgets() should never see '\0' in a text stream.
</still off-topic>
Do you consider a unicode file a text file? The C standard probably
doesn't but unfortunately a lot of people do and they mail them to me.


The C Standard doesn't consider anything a text file or not; it leaves
that up to the implementation. If you manage to get hold of an
implementation that can open both extended-ASCII and Unicode files as
text files, and decode them correctly, that's fine according to the
Standard.
Actually, the Standard does say one thing: all the characters of the
basic character set must be positive, and the null character must have
value 0. Both ASCII (and all variations on it I know) and Unicode have
this property, so both can be used.
The first indication that you've got one is when "cat" works but
"grep" won't find any of the words that "cat" shows. On a Windows
system these act just like a text file: notepad, wordpad, and
DOS level TYPE and FIND all show the same thing, with no overt
indication that the file contains unicode.


That's because AFAIK newer versions of MS Windows use Unicode under the
hood, and convert MS-ASCII files on the fly. To those programs, Unicode
files _are_ text files even in the C meaning of the word.

Richard
Nov 14 '05 #31
In article <sl****************@station02.ohout.pharmapartners .nl>,
Villy Kruse <nobody> wrote:
On Thu, 17 Mar 2005 21:23:26 +0000 (UTC),
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:

How would that need lookahead? Anything that actually prints would have
to wait until the print head finishes moving, but feeding the paper could
be done in parallel, so it would just need something that would block on
printing characters until the CR finished but would be able to apply a LF
(or other paper movement, such as VT or FF) while the head was moving.


We are talking about old mechanical teletypewriters with virtualy no
character buffering whatsoever. Thus, when a character is receive it
better be printed immediately before the next arrives, and if the
carriage hasn't fully returned to the beginning of the line then the
character will be printed wherever the carriage happens to be; usualy
in the middle of the line.


I was assuming (incorrectly, as noted elsethread) the ability to tell
the other end of the link "Don't send me any more characters until I've
had a chance to deal with the last one you sent".
This isn't so much buffering as extending the length of the "store"
part of a store-and-forward mechanism.
dave

--
Dave Vandervies dj******@csclub.uwaterloo.ca
You know, you should never say something like "No one is expecting..." on
Usenet. It is just too easy to disprove. The preferred form is "No one in his
right mind is expecting..." --Stephan H.M.J. Houben in comp.lang.scheme
Nov 14 '05 #32


Dave Vandervies wrote:
In article <sl****************@station02.ohout.pharmapartners .nl>,
Villy Kruse <nobody> wrote:
On Thu, 17 Mar 2005 21:23:26 +0000 (UTC),
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:

We are talking about old mechanical teletypewriters with virtualy no
character buffering whatsoever. Thus, when a character is receive it
better be printed immediately before the next arrives, and if the
carriage hasn't fully returned to the beginning of the line then the
character will be printed wherever the carriage happens to be; usualy
in the middle of the line.


I was assuming (incorrectly, as noted elsethread) the ability to tell
the other end of the link "Don't send me any more characters until I've
had a chance to deal with the last one you sent".
This isn't so much buffering as extending the length of the "store"
part of a store-and-forward mechanism.


<off-topic>

It's buffering, because it takes time for the "Please
stop" request to get back to the sender and for the sender
to act upon it, and during that time the characters keep
on coming.

Another method was to have the sender stop of its own
volition after sending the CR, until the terminal sent a
"Go ahead" when it was once again ready. This eliminated
the latency of the "Please stop" method, but at the cost
of some extra electronics in the terminal -- non-negligible
in the days of discrete-component circuit boards. It was
also prone to assorted synchronization deadlocks, where
each side was waiting for the other to utter something on
a silent (often half-duplex) line ...

Ah, those were the days! Less rosy by far than fading
memory paints them, but there's absolutely no denying that
they were "days."

</off-topic>

Nov 14 '05 #33
Dave Vandervies wrote:
Villy Kruse <nobody> wrote:
Dave Vandervies <dj******@csclub.uwaterloo.ca> wrote:

How would that need lookahead? Anything that actually prints
would have to wait until the print head finishes moving, but
feeding the paper could be done in parallel, so it would just
need something that would block on printing characters until
the CR finished but would be able to apply a LF (or other paper
movement, such as VT or FF) while the head was moving.


We are talking about old mechanical teletypewriters with virtualy
no character buffering whatsoever. Thus, when a character is
receive it better be printed immediately before the next arrives,
and if the carriage hasn't fully returned to the beginning of the
line then the character will be printed wherever the carriage
happens to be; usualy in the middle of the line.


I was assuming (incorrectly, as noted elsethread) the ability to
tell the other end of the link "Don't send me any more characters
until I've had a chance to deal with the last one you sent".
This isn't so much buffering as extending the length of the
"store" part of a store-and-forward mechanism.


With a 33 Teletype there was no store. The CR simply released a
catch, and a spring sent the carriage hurtling left, to eventually
be caught by a dashpot. Other actions could happen during the
hurtling, such as line feeding, or pounding out a character on the
fly. Hurtling termination involved shaking of the system, and
stand walking down the floor. At some point, lacking positional
maintenance, this was likely to break the electrical connections.

You could be fairly confident that the hurtling was done after one
spare character time, much more so if you also sent a nul to gobble
up another 100 millisecs.

I think there was one transistor in the system. It was large and
powerful, and I forget what it was for.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #34
On Fri, 18 Mar 2005 17:42:47 -0500, Eric Sosman
<er*********@sun.com> wrote:
<off-topic>

It's buffering, because it takes time for the "Please
stop" request to get back to the sender and for the sender
to act upon it, and during that time the characters keep
on coming.
Plus if you are sending as well you have to interrupt that to insert the
request. Unless you use the RTS/CTS/DTR/DCD lines, but they depended
(with modems) on actually switching the carrier off (and hence not
sending anything).
Another method was to have the sender stop of its own
volition after sending the CR, until the terminal sent a
"Go ahead" when it was once again ready. This eliminated
the latency of the "Please stop" method, but at the cost
of some extra electronics in the terminal -- non-negligible
in the days of discrete-component circuit boards. It was
also prone to assorted synchronization deadlocks, where
each side was waiting for the other to utter something on
a silent (often half-duplex) line ...
Electronics? I was talking about things like the Teletype(R) Model 33
ASR, which was totally mechanical. Inserting an X-OFF character
automatically would have been a real pain, there was no buffering at the
terminal at all.
Ah, those were the days! Less rosy by far than fading
memory paints them, but there's absolutely no denying that
they were "days."
Yup, they were days. And weeks...

(But at least you could read output as it appeared. 10 or possibly
30cps is readable, these things which scroll off the screen before you
even see that they've started are a pain...)
</off-topic>


Chris C
Nov 14 '05 #35
David Mathog wrote:
Joe Wright wrote:
<still off-topic>
But the '\0' itself was completely ignored. Blank Tape.
'\177' (DEL) or "all holes punched" was also completely
ignored. I see no valid case for NUL ('\0') in any text
file. If it does exist, the I/O system should ignore it
in text mode. fgets() should never see '\0' in a text
stream.
</still off-topic>
Do you consider a unicode file a text file? The C standard probably
doesn't but unfortunately a lot of people do and they mail them to
me.


Yeah, its called "globalization". Anyways, the best that the C library
can do is read it as binary, then you are on your own for decoding
Unicode. Ironically, the wchar_t stuff is not a portable solution.
The first indication that you've got one is when "cat" works
but "grep" won't find any of the words that "cat" shows. On a
Windows system these act just like a text file: notepad,
wordpad, and DOS level TYPE and FIND all show the same thing,
with no overt indication that the file contains unicode.
Sounds like you need a better grep? :)
Run "od -c" on a unicode file and you'll find that it starts with
a Byte Order Mark (FE FF). After that every other byte is null.
Well, technically a UTF-16 file may start with either FE FF or FF FE,
and any of the octets that follow it may be NUL -- the encoding really
is a mapping to 16-bit values.
Take out the BOM and the null characters and you've got an ASCII file, assuming it was originally written in english.
Better yet, convert it to UTF-8 and it remains as much ASCII as
required, while not losing any non-eglish characters. If you are
consistently seeing every other byte as NUL, then the author (or
program that the author is using) has almost certainly chosen a very
sub-optimal encoding (they should choose UTF-8 instead.)
[...] Somewhat ironically
all of the ones I've seen so far have only LF EOLs after being
processed like this. This is for UTF-16. There's also
UTF-32 but thankfully nobody has sent me one of those yet.


UTF-32 is mostly a "theoretical" transfer format. Its commonly used
internally within a program to simplify text data manipulation (and can
sometimes be mapped to "wchar_t"), however, nobody would ever use it as
a format for storing or sending a file. The reason is that UTF-16 is
always shorter than UTF-32, and UTF-8 is often shorter than both (but
sometimes can be longer than either.)

---
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Nov 14 '05 #36

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

20 posts views Thread by Paul D. Boyle | last post: by
7 posts views Thread by Emerson | last post: by
11 posts views Thread by santosh | last post: by
42 posts views Thread by mellyshum123 | last post: by
285 posts views Thread by Sheth Raxit | last post: by
26 posts views Thread by Bill Cunningham | last post: by
14 posts views Thread by subramanian100in | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.