472,958 Members | 2,662 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,958 software developers and data experts.

text and binary files confusion

hi friends,

i've been having this confusion for about a year, i want to know the
exact difference between text and binary files.

using the fwrite function in c, i wrote 2 bytes of integers in binary
mode.
according to me, notepad opens files and each byte of the file
read, it converts that byte from ascii to its correct character and
displays
it on screen..
so that's what i did, i wrote 2 bytes (an integer) using fwrite and
since ascii
is 1 byte, i expected 2 characters to be displayed in notepad..
the first character displayed correctly but not the second.

to add to my confusion of text and binary, some FTP servers running
on Linux require html files to be uploaded in 'ascii mode' and binary
files in 'binary mode'.
Both are ordinary files consisting of a sequential series of bytes
after all, then
why a seperate mode?
Any insight into this confusion would be greatly appreciated.

thanks a lot for your time.

- joel

Mar 13 '06 #1
10 3569
On 2006-03-13, jo*******@gmail.com <jo*******@gmail.com> wrote:
hi friends,

i've been having this confusion for about a year, i want to know the
exact difference between text and binary files.

using the fwrite function in c, i wrote 2 bytes of integers in binary
mode.
according to me, notepad opens files and each byte of the file
read, it converts that byte from ascii to its correct character and
displays
it on screen..
so that's what i did, i wrote 2 bytes (an integer) using fwrite and
since ascii
is 1 byte, i expected 2 characters to be displayed in notepad..
the first character displayed correctly but not the second.

to add to my confusion of text and binary, some FTP servers running
on Linux require html files to be uploaded in 'ascii mode' and binary
files in 'binary mode'.
Both are ordinary files consisting of a sequential series of bytes
after all, then
why a seperate mode?
Any insight into this confusion would be greatly appreciated.

thanks a lot for your time.

- joel


Firstly, dont worry : this is something that trips a lot of people up.

I was about to pen a few lines and then decided not to because (a) I
could not think of an eloquent way of doing it and (b) like most things,
someone else has done it first. The secret of being a great engineer
is not howing how to do something, but knowing that there may be a
better way and knowing how to locate that better way :-;

Here:

http://en.wikipedia.org/wiki/Binary_and_text_files

There is one key part which might confuse you (not knowing your
familiarity with ascii text) and that is:

"Text files are files where most bytes (or short sequences of bytes)
represent ordinary readable characters such as letters,"

The short sequence of bytes is important : google up unicode and dbcs.

--
Debuggers : you know it makes sense.
http://heather.cs.ucdavis.edu/~matlo...g.html#tth_sEc
Mar 13 '06 #2
Le 13-03-2006, jo*******@gmail.com <jo*******@gmail.com> a écrit*:
i've been having this confusion for about a year, i want to know the
exact difference between text and binary files.
To my knowledge, the main difference is the interpretation
of the '\n' character.
In text mode, '\n' is an 'end of line' indication, mapped
into '\n', '\r\n' or '\r' depending of the file system
encoding of the end of line.
In binary mode, '\n' is '\n'.
using the fwrite function in c, i wrote 2 bytes of integers in binary
mode.
according to me, notepad opens files and each byte of the file
read, it converts that byte from ascii to its correct character and
displays
it on screen..
Assuming that your encoding is ASCII, yes.
so that's what i did, i wrote 2 bytes (an integer) using fwrite and
since ascii
is 1 byte, i expected 2 characters to be displayed in notepad..
the first character displayed correctly but not the second.


Perhaps it was not a 'displayable' value.

Marc Boyer
Mar 13 '06 #3
> Perhaps it was not a 'displayable' value.

the value was displayable im sure, because the ascii codes
i wrote in binary to the file were:
40H
and 41H
2 bytes.
and i expected it to display AB
but it displayed A@

That's what confuses me..

- joel

Mar 17 '06 #4
also how does notepad detect the encoding, after all its just
a sequence of bytes, there's nothing in the file that says im
encoded in ascii or ebcdic...

joel

Mar 17 '06 #5

jo*******@gmail.com wrote:
Perhaps it was not a 'displayable' value.
the value was displayable im sure, because the ascii codes
i wrote in binary to the file were:
40H
and 41H
2 bytes.
and i expected it to display AB


Wrong.
but it displayed A@
That's what it's supposed to display. ASCII 'A' is 41H.
ASCII '@' is 40H.

That's what confuses me..
But the computer is not confused. It does what you tell it.
When what you get is not what you expect, always consider
that your expectations may be wrong.

- joel


Mar 17 '06 #6

jo*******@gmail.com wrote:
also how does notepad detect the encoding, after all its just
a sequence of bytes, there's nothing in the file that says im
encoded in ascii or ebcdic...
Notepad assumes it's in ASCII. If you actually used EBCDIC
and opened it in notepad, you'd _really_ be confused.

joel


Mar 17 '06 #7
Me
jo*******@gmail.com wrote:
hi friends,

i've been having this confusion for about a year, i want to know the
exact difference between text and binary files.
As far as the C standard is concerned there are some things like not
being able to get the exact file size with binary files, file position
may be off with text files, there being a maximum line length for text
files, and each line in a text file must be outputted with '\n'. This
is a summary, check the standard for the real list. So basically
writing a file in text mode then opening it in binary mode isn't
guaranteed to even give you anything meaningful or work at all (imagine
an implementation that marks whether a file has a text or binary
attribute and a file is determined by both the filename and this
attribute).

On many implementations, the above doesn't apply and all you have to
worry about how the implementation stores the newline character. Since
you're on Windows, here is the convention for text files (treating the
text file as binary here):

BOM(optional)
line1 newline
....
lineN newline(optional)
EOF(optional)

the BOM is to handle unicode files, it can be one of:

0xEF 0xBB 0xBF (UTF-8 BOM)
0xFF 0xFE (UTF-16LE BOM)
0xFE 0xFF (UTF-16BE BOM)

If there is no BOM, then it's up to the software opening it to figure
out the encoding of the file somehow.

Newline is the '\r' '\n' sequence of characters.

Lines are composed of characters. For UTF-16, these characters are
either 2-bytes or 4-bytes depending if they're surrogate pairs. For
UTF-8, characters are 1, 2, 3, or 4 bytes. (and on top of all this, you
have to deal with an arbitrary number of combining characters). You
should read up on unicode, UTF-8, and UTF-16 because this whole issue
of characters and glyphs is confusing when somebody like me uses loose
language like this. If it's not a Unicode file, it most likely uses
some encoding set on the system. Generally white-people countries use
1-byte per character and non-white-people countries use multiple bytes
to encode characters.

EOF is the ASCII ctrl+Z code (0x19). You won't find this except when
opening an ancient DOS file off a floppy or something.
When opening a file in text-mode, most of this should be transparent to
you if your program and the C runtime were carefully designed. i.e. the
above should pretty much be a concern for the C runtime implementors or
programmers that want to handle all of this themselves.
Here's some homework for you:

On the C side, read 7.19, 7.24, and 7.25 in the C standard. Make sure
you know what the following do and how they fit together:

mbstate_t
fwide
fwrite
fputs
fputws
mbtowc
mbstowcs
setlocale
wcstombs
wctomb
mblen

On the windows side, read:

GetACP
MultiByteToWideChar
WideCharToMultiByte
http://blogs.msdn.com/oldnewthing/ar...08/389527.aspx
http://blogs.msdn.com/oldnewthing/ar...31/144893.aspx
http://blogs.msdn.com/oldnewthing/ar...29/457483.aspx
http://blogs.msdn.com/oldnewthing/ar.../24/95235.aspx
http://blogs.msdn.com/michkap/archiv...gory/8717.aspx

On the Unicode side, read:

http://www.unicode.org/faq/utf_bom.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://catch22.net/tuts/ (the articles about his text editor)
http://en.wikipedia.org/wiki/ISO/IEC_8859
http://en.wikipedia.org/wiki/Unicode

After all this, you should be way more advanced about files than most C
programmers.

using the fwrite function in c, i wrote 2 bytes of integers in binary
mode.
according to me, notepad opens files and each byte of the file
read, it converts that byte from ascii to its correct character and
displays it on screen..
so that's what i did, i wrote 2 bytes (an integer) using fwrite and
since ascii
is 1 byte, i expected 2 characters to be displayed in notepad..
the first character displayed correctly but not the second.
Notepad likely uses the winapi function IsTextUnicode to determine the
encoding of the file. Windows supports the ASCII codepage but it is
very rarely used. Yours is most likely set to this:

http://en.wikipedia.org/wiki/Windows-1252
to add to my confusion of text and binary, some FTP servers running
on Linux require html files to be uploaded in 'ascii mode' and binary
files in 'binary mode'.
Both are ordinary files consisting of a sequential series of bytes
after all, then
why a seperate mode?


I don't know about FTP but I think they also allow EBCIDIC to be
transferred as well. I doubt you can send any arbitrary text file out
to another because it highly depends on the source and destination
character set of computers so obviously FTP can only send it out to
computers (non-lossy) that have some sort of mapping between each-other
and the FTP server is aware of this mapping.

Mar 17 '06 #8
Le 17-03-2006, jo*******@gmail.com <jo*******@gmail.com> a écrit*:
also how does notepad detect the encoding,
Does it ?
after all its just
a sequence of bytes, there's nothing in the file that says im
encoded in ascii or ebcdic...


No, but, in general, a 'familly of platform'
(like Win*, AIX*, AS*) uses the same encoding. That
is to says, I do not know any Win* running EBCDIC.
So, notepad can assume ASCII is used.

Nevertheless, nowadays, in non-english countries,
peoples are using iso-* encodings, UTF-8, perhaps
UTF-16 and others...
As french, I often have problem openning UTF-8
files with iso-latin* editors, and so on.

There are some heuristics used to guess the
encoding. Some editors (like [X]emacs) use some,
and it often works.

Marc Boyer
Mar 17 '06 #9
Marc Boyer <Ma********@enseeiht.yahoo.fr.invalid> wrote:
Le 17-03-2006, jo*******@gmail.com <jo*******@gmail.com> a écrit*:
also how does notepad detect the encoding,


Does it ?
after all its just
a sequence of bytes, there's nothing in the file that says im
encoded in ascii or ebcdic...


No, but, in general, a 'familly of platform'
(like Win*, AIX*, AS*) uses the same encoding. That
is to says, I do not know any Win* running EBCDIC.
So, notepad can assume ASCII is used.


Not on newer versions of Windows, it can't. More would be off-topic,
except to say that
a. the detection used is easily writable, correctly, in ISO C and
b. _if_ the implementation uses UTF-16 for wchar_t, so is the rest of
the editor.

Richard
Mar 17 '06 #10
Marc Boyer said:
Le 17-03-2006, jo*******@gmail.com <jo*******@gmail.com> a écrit :
also how does notepad detect the encoding,


Does it ?
after all its just
a sequence of bytes, there's nothing in the file that says im
encoded in ascii or ebcdic...


No, but, in general, a 'familly of platform'
(like Win*, AIX*, AS*) uses the same encoding. That
is to says, I do not know any Win* running EBCDIC.
So, notepad can assume ASCII is used.


Windows can still emulate MS-DOS, quite probably to the extent that it can
run IBM's DisplayWrite software, which uses [1] EBCDIC encoding (for, would
you believe, mainframe compatibility).
[1] Or, at least, used. I freely admit that my information is over a decade
old.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Mar 17 '06 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
by: Sami Viitanen | last post by:
Hello, How can I check if a file is binary or text? There was some easy way but I forgot it.. Thanks in adv.
7
by: Enrico Morelli | last post by:
Dear all, I have to write a program which reads from a binary file, a serious of 32 bit long integer data and stores its in an array. I cannot know the format (little or big endian) and I have...
17
by: Guyon Morée | last post by:
what is the difference? if I open a text file in binary (rb) mode, it doesn't matter... the read() output is the same.
12
by: Aki Niimura | last post by:
Hello everyone, I started to use pickle to store the latest user settings for the tool I wrote. It writes out a pickled text file when it terminates and it restores the settings when it starts....
27
by: Eric | last post by:
Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there...
50
by: Michael Mair | last post by:
Cheerio, I would appreciate opinions on the following: Given the task to read a _complete_ text file into a string: What is the "best" way to do it? Handling the buffer is not the problem...
36
by: Wei Su | last post by:
Hi, I have a text file abc.txt and it looks like: 12 34 56 23 45 56 33 56 78 ... .. .. ... .. .. I want to get how many rows totally in the text file, how to do this? Thanks.
16
by: thenightfly | last post by:
Ok, I know all about how binary numbers translate into text characters. My question is what exactly IS a text character? Is it a bitmap?
12
by: Adam J. Schaff | last post by:
I am writing a quick program to edit a binary file that contains file paths (amongst other things). If I look at the files in notepad, they look like: ...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
0
by: Aliciasmith | last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...
0
tracyyun
by: tracyyun | last post by:
Hello everyone, I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...
2
by: giovanniandrean | last post by:
The energy model is structured as follows and uses excel sheets to give input data: 1-Utility.py contains all the functions needed to calculate the variables and other minor things (mentions...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM) Please note that the UK and Europe revert to winter time on...
3
by: nia12 | last post by:
Hi there, I am very new to Access so apologies if any of this is obvious/not clear. I am creating a data collection tool for health care employees to complete. It consists of a number of...
0
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
2
by: GKJR | last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.