how to delete a character in a file ?

S!mb

Hi all,

I'm currently developping a tool to convert texts files between linux,
windows and mac.

the end of a line is coded by 2 characters in windows, and only one in
unix & mac. So I have to delete a character at each end of a line.

car = fgetc(myFile);
while (car != EOF) {
if (car == 13) {
car2 = fgetc(myFile) ;
if (car2 == 10) {
// fseek of 2 characters
// delete a caracter
// overwrite the second caracter
}
}
}

how can I do that ? is there a function that I can use ? I can't find
one in stdio.h

thx in advance,

Jerem.

Nov 14 '05 #1

Subscribe Post Reply

21595

Jens.Toerring

S!mb@ <S!mb@nop> wrote:

I'm currently developping a tool to convert texts files between linux,
windows and mac. the end of a line is coded by 2 characters in windows, and only one in
unix & mac. So I have to delete a character at each end of a line. car = fgetc(myFile);
while (car != EOF) {
if (car == 13) {
Better use '\r' instead of some "magic" values.
car2 = fgetc(myFile) ;
if (car2 == 10) {
And that would be '\n'. BTW, when you open the file in text mode
you may never "see" the '\r' and '\n' as two separate characters
if the "\r\n" combination is the end of line marker on the system.
// fseek of 2 characters
// delete a caracter
// overwrite the second caracter how can I do that ? is there a function that I can use ? I can't find
one in stdio.h

See the FAQ, section 19.14. In short, you can't delete something from
the middle of a file, you have to copy everything except the stuff you
don't want to a new file.
Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@physik.fu-berlin.de
\__________________________ http://www.toerring.de

Nov 14 '05 #2

Francois Grieu

In article <2m************@uni-berlin.de>, Je***********@physik.fu-berlin.de
wrote:

S!mb@ <S!mb@nop> wrote:
I'm currently developping a tool to convert texts files between linux,
windows and mac.

(..)
if (car == 13) {

Better use '\r' instead of some "magic" values.

For traditional MacOS compilers, '\r' tends to be 10,
and '\n' tends to be 13. This illustrates that when dealing
with binary files in a non-native format, it is best to use magic
values. OTOH, when dealign with local text files, '\n' is
best, of course.
François Grieu

Nov 14 '05 #3

Madhur Ahuja

"S!mb@" <S!mb@nop> wrote in message
news:40***********************@news.free.fr...

Hi all,

I'm currently developping a tool to convert texts files between linux,
windows and mac.

the end of a line is coded by 2 characters in windows, and only one in
unix & mac. So I have to delete a character at each end of a line.

car = fgetc(myFile);
while (car != EOF) {
if (car == 13) {
car2 = fgetc(myFile) ;
if (car2 == 10) {
// fseek of 2 characters
// delete a caracter
// overwrite the second caracter
}
}
}

how can I do that ? is there a function that I can use ? I can't find
one in stdio.h

thx in advance,

Jerem.

Well, there is already a tool, dos2unix and vice versa. Why reinvent the
wheel. Think something new.

--
Winners dont do different things, they do things differently.

Madhur Ahuja
India

Homepage : http://madhur.netfirms.com
Email : madhur<underscore>ahuja<at>yahoo<dot>com

Nov 14 '05 #4

Jens.Toerring

Francois Grieu <fg****@francenet.fr> wrote:

In article <2m************@uni-berlin.de>, Je***********@physik.fu-berlin.de
wrote:
S!mb@ <S!mb@nop> wrote:
> I'm currently developping a tool to convert texts files between linux,
> windows and mac.

(..)
> if (car == 13) {

Better use '\r' instead of some "magic" values.

For traditional MacOS compilers, '\r' tends to be 10,
and '\n' tends to be 13. This illustrates that when dealing
with binary files in a non-native format, it is best to use magic
values. OTOH, when dealign with local text files, '\n' is
best, of course.

I don't believe that, they were also using ASCII. AFAIR on "classical"
MacOS the end of line marker was simply "\n\r" (i.e. the other way
round compared to DOSish systems), but that doesn't make '\r' (i.e. CR)
== 0xA and '\n' (LF) == 0xD.
Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@physik.fu-berlin.de
\__________________________ http://www.toerring.de

Nov 14 '05 #5

Alan Balmer

On Tue, 20 Jul 2004 00:06:11 +0530, "Madhur Ahuja" <df@df.com> wrote:

Well, there is already a tool, dos2unix and vice versa. Why reinvent the
wheel. Think something new.

--
Winners dont do different things, they do things differently.

Didn't you just negate your own comment? <G>.

Maybe the OP is doing it differently.

--
Al Balmer
Balmer Consulting
re************************@att.net

Nov 14 '05 #6

Peter Nilsson

<Je***********@physik.fu-berlin.de> wrote in message news:2m************@uni-berlin.de...

Francois Grieu <fg****@francenet.fr> wrote:
In article <2m************@uni-berlin.de>, Je***********@physik.fu-berlin.de
wrote:
S!mb@ <S!mb@nop> wrote:
> I'm currently developping a tool to convert texts files between linux,
> windows and mac.
(..)
> if (car == 13) {

Better use '\r' instead of some "magic" values.

For traditional MacOS compilers, '\r' tends to be 10,
and '\n' tends to be 13. This illustrates that when dealing
with binary files in a non-native format, it is best to use magic
values. OTOH, when dealign with local text files, '\n' is
best, of course.

I don't believe that, they were also using ASCII.

Believe it, although it wasn't a hard and fast rule that Francois makes it out to be. Many
implementations (e.g. Metrowerks) allowed the programmer to optionally swap the values of
'\n' and '\r' for text streams. Choosing the '\n' == 0x0D meant that text streams where
unencomboured with eol translations.

The standard states that '\n' is an implementation defined value (whether on ASCII based
platforms or not) precisely for support of such systems.

[OT: That said, third party mac compilers had no support for command line arguments, since
Apple's MPW was the only environment that actually provided the notion of a 'shell'. So
compilers were not exactly conforming in the strictest sense.

Compiling command line programs generally involved including a ccommand(&argv) call from
main. Curiously, every development tool that I used (I've never used MPW) got the runtime
startup for command line programs 'wrong' since a main signature of...

int main(int argc, char **argv)

....invariably meant that argc and argv were located below the stack. (The int was returned
in register D0, so that didn't matter.) Fortunately the memory was the top of the
'application globals', a location 'reserved' by apple, but never used AFAIK!]
AFAIR on "classical"
MacOS the end of line marker was simply "\n\r"

The end of line marker was a lone <CR> (0x0D).

I have no idea whether Mac OS X uses linux (<LF> 0x10) linebreaks or not.

--
Peter

Nov 14 '05 #7

Gordon Burditt

>I'm currently developping a tool to convert texts files between linux,

windows and mac.

the end of a line is coded by 2 characters in windows, and only one in
unix & mac. So I have to delete a character at each end of a line.
The portable way to make such changes is to copy the file and
make changes as you go. There is no portable way to shorten a file
to a length greater than zero except by truncating it to zero length
and then writing new contents for it. Functions such as ftruncate(),
chsize(), and suck() are not portable ANSI C.

Making changes in-place in a file should be done carefully. If
your program crashes partway through, it may leave an unrecoverable
mess.

car = fgetc(myFile);
while (car != EOF) {
if (car == 13) {
car2 = fgetc(myFile) ;
if (car2 == 10) {
// fseek of 2 characters
// delete a caracter
// overwrite the second caracter
}
}
}

how can I do that ? is there a function that I can use ? I can't find
one in stdio.h

A function which deletes a character out of a gigabyte file by
copying all but one character of the file may run very slowly
(although it is possible to write such a function portably if you've
got space for a copy of the file). If it's called once per line,
it could get REALLY, REALLY slow.

Gordon L Burditt

Nov 14 '05 #8

Gordon Burditt

>The end of line marker was a lone <CR> (0x0D).

I have no idea whether Mac OS X uses linux (<LF> 0x10) linebreaks or not.

It does, although I prefer to call them UNIX linebrreaks.

Gordon L. Burditt

Nov 14 '05 #9

S!mb

Gordon Burditt wrote:

The end of line marker was a lone <CR> (0x0D).

I have no idea whether Mac OS X uses linux (<LF> 0x10) linebreaks or not.

It does, although I prefer to call them UNIX linebrreaks.

Gordon L. Burditt

ok ;)

and what about the others caracters on OS X ?
I mean caracters between 128 and 255. Do they use the Unix or the Mac
codage ?

i.e. £ is 0xA3 (163) on mac and 0x9C (156) on unix. What about on OS X ?

Nov 14 '05 #10

Richard Bos

"Peter Nilsson" <ai***@acay.com.au> wrote:

[OT: That said, third party mac compilers had no support for command line arguments, since
Apple's MPW was the only environment that actually provided the notion of a 'shell'. So
compilers were not exactly conforming in the strictest sense.
There's no reason why not having a command line would make an
implementation non-conforming. It would mean that the first argument to
main() would always be 0 or 1, but that's all.
Compiling command line programs generally involved including a ccommand(&argv) call from
main.

That, however, _would_ make it non-conforming.

Richard

Nov 14 '05 #11

S!mb

> Well, there is already a tool, dos2unix and vice versa. Why reinvent the

wheel. Think something new.

I had a look on google to find a tool. But I didn't find interesting one.
Most of them only convert LF and CR characters, but I need to convert
also characters above 128. I also need the source code to adapt the
interface to my program.

But if you know well coded and powerful tools, I am interested.

Jerem.

Nov 14 '05 #12

Jens.Toerring

S!mb@ <S!mb@nop> wrote:

> Well, there is already a tool, dos2unix and vice versa. Why reinvent the
wheel. Think something new.
I had a look on google to find a tool. But I didn't find interesting one.
Most of them only convert LF and CR characters, but I need to convert
also characters above 128. I also need the source code to adapt the
interface to my program.

That's not as simple as you seem to imagine - there are several different
standards (plus an even larger set of non-standard) interpretations for
the characters in that range. Just do a google search for e.g. "iso-8859"
to see just a few ways that range has been used. And there already exists
a tool for that purpose, it's called "recode".

Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@physik.fu-berlin.de
\__________________________ http://www.toerring.de

Nov 14 '05 #13

Peter Nilsson

"S!mb@" <S!mb@nop> wrote in message news:40***********************@news.free.fr...

Gordon Burditt wrote:
The end of line marker was a lone <CR> (0x0D).

I have no idea whether Mac OS X uses linux (<LF> 0x10) linebreaks
or not.
It does, although I prefer to call them UNIX linebrreaks.

ok ;)

and what about the others caracters on OS X ?

What about them?
I mean caracters between 128 and 255. Do they use the Unix or the Mac
codage ?
They use whatever coding the program that wrote them used.
i.e. £ is 0xA3 (163) on mac and 0x9C (156) on unix. What about on OS X ?

Either system would be (and I presume is) capable of interpreting the given text file
under a given charset. Even within a C implementation you may be able to switch between
locales to interpret the same file differently under two different codings.

--
Peter

Nov 14 '05 #14

Peter Nilsson

"Richard Bos" <rl*@hoekstra-uitgeverij.nl> wrote in message
news:40****************@news.individual.net...

"Peter Nilsson" <ai***@acay.com.au> wrote:
[OT: That said, third party mac compilers had no support for command
line arguments, since Apple's MPW was the only environment that
actually provided the notion of a 'shell'. So compilers were not
exactly conforming in the strictest sense.

There's no reason why not having a command line would make an
implementation non-conforming. It would mean that the first argument to
main() would always be 0 or 1, but that's all.

But the implementations I used didn't support that signature for main, what
you got for argc and argv was unspecified!

Compiling command line programs generally involved including a
ccommand(&argv) call from main.

That, however, _would_ make it non-conforming.

The call would make a program not _strictly_ conforming, although it may be
(and was) conforming. The behaviour of such programs says nothing about the
_implementation's_ conformance.

--
Peter

Nov 14 '05 #15

Richard Bos

"Peter Nilsson" <ai***@acay.com.au> wrote:

"Richard Bos" <rl*@hoekstra-uitgeverij.nl> wrote in message
news:40****************@news.individual.net...
"Peter Nilsson" <ai***@acay.com.au> wrote:
[OT: That said, third party mac compilers had no support for command
line arguments, since Apple's MPW was the only environment that
actually provided the notion of a 'shell'. So compilers were not
exactly conforming in the strictest sense.

There's no reason why not having a command line would make an
implementation non-conforming. It would mean that the first argument to
main() would always be 0 or 1, but that's all.

But the implementations I used didn't support that signature for main, what
you got for argc and argv was unspecified!

Ah, but that's a different matter. If int main(int argc, char **argv) is
not supported, _that_ does mean that the implementation does not conform
to the Standard, at least if it claims to be a hosted implementation.
But not having a command line doesn't make this inevitable.

Compiling command line programs generally involved including a
ccommand(&argv) call from main.

That, however, _would_ make it non-conforming.

The call would make a program not _strictly_ conforming, although it may be
(and was) conforming. The behaviour of such programs says nothing about the
_implementation's_ conformance.

Well, yes, it does; ccommand is reserved for the programmer, not for the
implementation.

Richard

Nov 14 '05 #16

S!mb

> And that would be '\n'. BTW, when you open the file in text mode

you may never "see" the '\r' and '\n' as two separate characters
if the "\r\n" combination is the end of line marker on the system.

When I use an hexadecimal editor, I "see" both characters.
that's why my program tries to read 2 characters (with 2 fgetc).

in fact, this works perfectly on linux (compiled with gcc), but on
windows (with Borland bcc32 compiler), my program doesn't detect \r\n as
two separate characters, as you told me.

So... how can I detect "\r\n", the EOL in windows ?

Jerem

Nov 14 '05 #17

Jens.Toerring

S!mb@ <S!mb@nop> wrote:

And that would be '\n'. BTW, when you open the file in text mode
you may never "see" the '\r' and '\n' as two separate characters
if the "\r\n" combination is the end of line marker on the system.
When I use an hexadecimal editor, I "see" both characters.
that's why my program tries to read 2 characters (with 2 fgetc). in fact, this works perfectly on linux (compiled with gcc), but on
windows (with Borland bcc32 compiler), my program doesn't detect \r\n as
two separate characters, as you told me. So... how can I detect "\r\n", the EOL in windows ?

On Windows, when you have opened the file in text mode, the "\r\n"
sequence will be returned as a single '\n' because in text mode it
signifies the EOL - and in order to make dealing with text files as
portable as possible the C functions return a '\n' for whatever
the the EOL character or charcter sequence is on the system the
program is running on (as long as the file has been opened in text
mode). So, the obvious solution is to open the file in binary mode
(i.e. with "rb" as the second argument to fopen() when you want t
open the file for reading) whenenver you need to see what's really
in the file without some handling of special characters (the char-
acter with the numeric equivalent of 0x1A is another of such char-
acters that have a special meaning for text files on Windows).

The "problem" does not seem to exist for you on Linux because there
the character signifying an EOL is identical to the '\n' the C
functions are returning, so on Linux (and other Unices) there isn't
any difference between opening a file in text or binary mode.

Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@physik.fu-berlin.de
\__________________________ http://www.toerring.de

Nov 14 '05 #18

Francois Grieu

In article <2m************@uni-berlin.de>, Je***********@physik.fu-berlin.de
wrote:

For traditional MacOS compilers, '\r' tends to be 10,
and '\n' tends to be 13. This illustrates that when dealing
with binary files in a non-native format, it is best to use magic
values. OTOH, when dealign with local text files, '\n' is
best, of course.

I don't believe that, they were also using ASCII. AFAIR on "classical"
MacOS the end of line marker was simply "\n\r" (i.e. the other way
round compared to DOSish systems), but that doesn't make '\r' (i.e. CR)
== 0xA and '\n' (LF) == 0xD.

OT: I am 100% positive that traditional MacOS (up to and including
MacOS9) use the byte with value 13 to separate text line, with no 10.
You can check that yhis is the encoding is e.g.
<ftp://ftp.apple.com/developer/+LICENSE_READ_ME_FIRST>
This is how the traditional MacOS version of gzip decompresses text files.
This is the encoding used by e.g. Teachtext and Simpletext, and all
versions of Microsoft Word when dealing with text files, and..

Getting back on topic: Apple's own C comilers, part of MPW Shell, indeed
defines '\n' as 13, and '\r' as 10. This is NOT an option (contrary to
other compilers). This cause no porting problem with most code.

[OT: there are headaches when moving files across a network. The
worse is that for diacriticals such as eacute encoded on a byte, Apple
has used FOUR different encodings on the Apple2, Lisa, Traditional MacOS,
and MacOSX; and none of these is the same as in DOS].
François Grieu

Nov 14 '05 #19

Old Wolf

"Peter Nilsson" <ai***@acay.com.au> wrote:

Francois Grieu <fg****@francenet.fr> wrote:

For traditional MacOS compilers, '\r' tends to be 10,
and '\n' tends to be 13. This illustrates that when dealing
with binary files in a non-native format, it is best to use magic
values. OTOH, when dealign with local text files, '\n' is
best, of course.

I don't believe that, they were also using ASCII.

Believe it, although it wasn't a hard and fast rule that Francois
makes it out to be. Many implementations (e.g. Metrowerks) allowed the
programmer to optionally swap the values of '\n' and '\r' for text
streams. Choosing the '\n' == 0x0D meant that text streams where
unencomboured with eol translations.

It sounds like you are describing conversion of '\n' to '\r' and vice
versa when a stream is open in text mode, which would be quite normal.
In fact it's the reason for having text mode and binary mode.

The OP claimed that '\r' was actually 10, ie. the following:

printf("%d\n", '\r');

would print 10. This is a totally different claim (which also
implies that the system is non-ASCII).
I'd have to see it to believe it..

Nov 14 '05 #20

Dik T. Winter

Trying to clarify things.

In article <fg**************************@individual.net> Francois Grieu <fg****@francenet.fr> writes:

In article <2m************@uni-berlin.de>, Je***********@physik.fu-berlin.de
wrote:
I don't believe that, they were also using ASCII. AFAIR on "classical"
MacOS the end of line marker was simply "\n\r" (i.e. the other way
round compared to DOSish systems), but that doesn't make '\r' (i.e. CR)
== 0xA and '\n' (LF) == 0xD.

Wrong. On the Mac the linemarker is a CR, not a LF. And there is just a
single character there. (This entirely conforms to the ASCII standard of
that time which states that when only a single character is used to mark
the end of a line, the CR should be used. Unix violated this.)
OT: I am 100% positive that traditional MacOS (up to and including
MacOS9) use the byte with value 13 to separate text line, with no 10.
You are right.
Getting back on topic: Apple's own C comilers, part of MPW Shell, indeed
defines '\n' as 13, and '\r' as 10. This is NOT an option (contrary to
other compilers). This cause no porting problem with most code.
It better should defind '\n' as 13, because that is the way it is.
However, when transporting text files from the Mac to Unix I always
do a 10 <-> 13 interchange and it comes out just as I want...
[OT: there are headaches when moving files across a network. The
worse is that for diacriticals such as eacute encoded on a byte, Apple
has used FOUR different encodings on the Apple2, Lisa, Traditional MacOS,
and MacOSX; and none of these is the same as in DOS].

MacOSX uses ISO 8859-1, I think.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Nov 14 '05 #21

S!mb

Je***********@physik.fu-berlin.de wrote:
So, the obvious solution is to open the file in binary mode

(i.e. with "rb" as the second argument to fopen() when you want t
open the file for reading)

ok, it works perfectly !

thx.

I had to open the file for reading AND the file for writing with "b",
now, all bytes are detected.
(opening a file for writing without "b" will write CR LF under windows
when it was asked to write the car number 10).

Jerem

Nov 14 '05 #22

S!mb

Je***********@physik.fu-berlin.de wrote:

thx an other time for this tips (the "b" parameter), coz it is not
written in the man of fopen :

"The character b has no effect, but is allowed for ISO C standard
conformance. Opening a file with read mode (r as the first character in
the mode argument) fails if the file does not exist or cannot be read."

http://www.opengroup.org/onlinepubs/...xsh/fopen.html

jerem.

Nov 14 '05 #23

Jens.Toerring

S!mb@ <S!mb@nop> wrote:

thx an other time for this tips (the "b" parameter), coz it is not
written in the man of fopen : "The character b has no effect, but is allowed for ISO C standard
conformance. Opening a file with read mode (r as the first character in
the mode argument) fails if the file does not exist or cannot be read." http://www.opengroup.org/onlinepubs/...xsh/fopen.html

Well, you shouldn't use UNIX specifications when you want to know
about the properties of a function under Windows. Under UNIX it
doesn't make a difference because there the EOL marker is the
single character '\n' (ASCII 0x0A). But under Windows the
"\r\n" -> '\n' takes place when reading text files and the
reverse when writing. But, of course, you normally won't find
that mentioned in e.g. a UNIX man page. The only really relevant
piece of information is the C standard, where you find

A text stream is an ordered sequence of characters...
...Characters may have to be added, altered, or deleted
on input and output to conform to differing conventions for
representing text in the host environment. Thus, there need
not be a one-to-one correspondence between the characters in
a stream and those in the external representation...

where "text stream" is what you read from or write to a FILE* you
got from e.g. fopen() in text mode and the "external representation"
is the way the file looks like on the disk.

Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@physik.fu-berlin.de
\__________________________ http://www.toerring.de

Nov 14 '05 #24

Dan Pop

In <I1********@cwi.nl> "Dik T. Winter" <Di********@cwi.nl> writes:

Wrong. On the Mac the linemarker is a CR, not a LF. And there is just a
single character there. (This entirely conforms to the ASCII standard of
that time which states that when only a single character is used to mark
the end of a line, the CR should be used. Unix violated this.)

^^^^^^^^^^^^^^^^^^
From http://www.campusprogram.com/referen.../new_line.html

0A, respectively. (In the 1958 version of ASCII, for which Multics
was originally designed, there was a separate newline (NL) character
in addition to CR and LF. The 1963 revision combined NL with LF,
and Multics followed suit. Unix followed the Multics practice, and
later systems followed Unix.)

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #25

Mabden

"S!mb@" <S!mb@nop> wrote in message
news:40***********************@news.free.fr...

Je***********@physik.fu-berlin.de wrote:
So, the obvious solution is to open the file in binary mode
(i.e. with "rb" as the second argument to fopen() when you want t
open the file for reading)

I had to open the file for reading AND the file for writing with "b",
now, all bytes are detected.
(opening a file for writing without "b" will write CR LF under windows
when it was asked to write the car number 10).

I find it amusing that when I needed to convert Linux files to MS Windows OS
files I wrote the following program and it works!
================================================
/* lf2crlf.c
Copies lines from stdin to stdout, changing the Linefeed character
to Carriage Return / Linefeed.
*/
#include <stdio.h>
int main (void)
{
char ch;

// read in chars until EOF
while ( (ch=getc(stdin)) != EOF )
putc (ch, stdout);
return (0);
}
================================================
--
Mabden

Nov 14 '05 #26

Tom Van Vleck

Da*****@cern.ch (Dan Pop) wrote...

(In the 1958 version of ASCII, for which Multics
was originally designed, there was a separate newline (NL) character
in addition to CR and LF. The 1963 revision combined NL with LF,
and Multics followed suit. Unix followed the Multics practice, and
later systems followed Unix.)

Two corrections. 1968, not 1958.

Second, there was not a separate NL in addition to CR and LF.
In proposed-revised-1968-ASCII, the code we used,
there were NL and CR. I think the standard said something like
"CR followed by NL should work, and if you use only one character
for a line separator, use NL."

If I remember correctly, Teletypes used CR and LF for
news and message transmission, and whether a TTY did a CR when
it received the LF code was a per-machine option. The Multics
TTY Device Interface Module converted the storage code of LF
into whatever was needed to accomplish a new line: for Teletypes
I think it sent CR, LF, and some number of delay characters.

Nov 14 '05 #27

how to delete a character in a file ?

Similar topics