471,066 Members | 1,177 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,066 software developers and data experts.

How to determine stream type?

Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

Thanks

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Nov 15 '05 #1
6 13722
<Kaki <--NO-->> wrote:
Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)


There's no way of doing it, basically. A stream is just a sequence of
bytes, and it's perfectly possible to have a stream of bytes which is a
valid document when viewed from more than one perspective (e.g. a text
file in two different encodings).

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #2

"Kaki" <--NO--> wrote in message
news:#Y**************@TK2MSFTNGP10.phx.gbl...

Given a file, how do I know if it's ascii or unicode or binary ?
And how do I know if it's rtf or html or etc? In other words,
how do I find the stream type or mime type?

(No, file extension cannot be the answer)


Only large-system operating systems such as VMS [DEC / Compaq] and MVS [IBM]
make any formal distinction between file types. In these systems there are
even physical differences between file types in so far as they are stored
differently, and are accessed with different code routines.

Under operating systems such as DOS / Windows-family, and *NIX / Linux, a
'file' is merely a named, persistent collection of bytes, and the only way
to tell whether a file contains data that is to be interpreted as text, or
as binary is by adherence to some conven'tion such as file extension usage
[e.g. '.txt' indicates a text file etc], and schemes such as searching
'magic numbers' [i.e. byte sequences known to uniquely identify file types]
in files, one heaviliy used in the *NIX / Linux world [the latter systems
also make distinctions between things like sockets, and devices at the
operating system level, but this hardly helps in identifying file types].

Thus, the answer is: there is no way of guaranteeing what a file's 'type'
actually is. All you can do is adhere to some convention, and hope that
everyone else follows suit. When attempting to access a particular file you
would check to ensure that the data read in conforms to the expected pattern
/ format for that file type.

For example, an HTML file could be expected to contain a <HTML> tag
somewhere near the start of the file, while many proprietary file formats
[e.g. MS Excel, Word etc] would sport a byte collection known as a 'header'
containing 'fields' with version information and the like. If, in reading
such files, the expected tags are found, or 'sensible' values for each
field are read in, then you can be reasonably sure [though not absolutuely
certain] that the 'correct' file type has been accessed.

Note that I made no mention of 'streams' which are nothing more than
program objects that are temporarily connected or linked to file(s) for
purposes of file data access / updating. Now, it might be possible for such
objects to report information about the file, or the current connection /
linkage status. However, when first creating establishing a link to a
specified file, such objects can merely make the checks mentioned earlier to
ascertain the 'correctness' of the file.

I'm not sure this is the type of response you were after, but the rather
general nature of your query seemed to warrant it. Additionally, it is the
type of issue that trancends any one programming language / environment.

I hope this helps.

Anthony Borla
Nov 15 '05 #3
Anthony Borla <aj*****@bigpond.com> wrote:
Under operating systems such as DOS / Windows-family, and *NIX / Linux, a
'file' is merely a named, persistent collection of bytes


Actually that's not true - a file has other attributes under all of the
above. Under Windows a file may be read-only, or hidden, with various
security attributes. Under NT-based systems it may also have alternate
"streams" (not to be confused with the .NET concept of a stream) which
may give additional information. Some Linux file-systems have metadata
too.

A plain Stream in .NET terms, however, has none of this - that really
*is* just a sequence of bytes. Derived types may add more information,
as you've said.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #4

"Kaki" <--NO--> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

Thanks

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!


Athough its not possible to be certain, enough tests should allow you to
figure out what it is(within a limited domain). There is a method[1] that
comes with Internet Explorer that can test for (according to the docs 26)
different types[2]. Its not perfect but the safest bet you have.
As for unicode\ascii differentation, unless you find byte order marks and
are reasonably sure its text, not binary, its not possible to say. Above
all, you should do your best to keep track of type upon loading, but these
should allow you to do some very basic checks.

1.
http://msdn.microsoft.com/library/de...mefromdata.asp
2.
http://msdn.microsoft.com/library/de...appendix_a.asp
Nov 15 '05 #5
Hopefully we'll see this potentially nice feature in framework v1.2 and
beyond...

I hadnt really considered the issue but I do side with the original poster
in that there SHOULD be a common code base that can determine the type of
stream. And, since MIME is becoming a convienient standard then so be it.
--
Eric Newton
C#/ASP Application Developer
http://ensoft-software.com/
er**@cc.ensoft-software.com [remove the first "CC."]

"Daniel O'Connell" <onyxkirx@--NOSPAM--comcast.net> wrote in message
news:eO**************@tk2msftngp13.phx.gbl...

"Kaki" <--NO--> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

Thanks

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Athough its not possible to be certain, enough tests should allow you to
figure out what it is(within a limited domain). There is a method[1] that
comes with Internet Explorer that can test for (according to the docs 26)
different types[2]. Its not perfect but the safest bet you have.
As for unicode\ascii differentation, unless you find byte order marks and
are reasonably sure its text, not binary, its not possible to say. Above
all, you should do your best to keep track of type upon loading, but these
should allow you to do some very basic checks.

1.

http://msdn.microsoft.com/library/de...mefromdata.asp 2.
http://msdn.microsoft.com/library/de...appendix_a.asp

Nov 15 '05 #6

"Eric Newton" <er**@cc.ensoft-software.com> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
Hopefully we'll see this potentially nice feature in framework v1.2 and
beyond...

I hadnt really considered the issue but I do side with the original poster
in that there SHOULD be a common code base that can determine the type of
stream. And, since MIME is becoming a convienient standard then so be it.
It has its ups, but it is still, unfortunatly, mostly a guess. Outside of
creating standard formats(for example, an xml document that had a <format>
tag), this will always be a guess, and bad luck could result in an incorrect
detection.
I suspect that it should be fairly trivial to get a good guess between image
formats, sgml derived, xml and other text formats, and perhaps other RIFF
type objects, but more complicated, propritary binary formats are probably
out of the question. Also text encoding is an issue because, with the
exception of some forms of unicode, there is no marker, only text data.

However, a managed implementation would be of value, especially if you could
plug in your own recognizers. Even if its not provided in the 1.2\2.0
framework, it is something an independent developer could write.
--
Eric Newton
C#/ASP Application Developer
http://ensoft-software.com/
er**@cc.ensoft-software.com [remove the first "CC."]

"Daniel O'Connell" <onyxkirx@--NOSPAM--comcast.net> wrote in message
news:eO**************@tk2msftngp13.phx.gbl...

"Kaki" <--NO--> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
Given a file, how do I know if it's ascii or unicode or binary? And how do I know if it's rtf or html or etc? In other words, how do I find the stream type or mime type?
(No, file extension cannot be the answer)

Thanks

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!


Athough its not possible to be certain, enough tests should allow you to
figure out what it is(within a limited domain). There is a method[1] that comes with Internet Explorer that can test for (according to the docs 26) different types[2]. Its not perfect but the safest bet you have.
As for unicode\ascii differentation, unless you find byte order marks and are reasonably sure its text, not binary, its not possible to say. Above
all, you should do your best to keep track of type upon loading, but these should allow you to do some very basic checks.

1.

http://msdn.microsoft.com/library/de...mefromdata.asp
2.

http://msdn.microsoft.com/library/de...appendix_a.asp


Nov 15 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Meyer1228 | last post: by
21 posts views Thread by Sami Viitanen | last post: by
5 posts views Thread by Christian Christmann | last post: by
4 posts views Thread by MCollins | last post: by
1 post views Thread by Dan | last post: by
2 posts views Thread by CJack | last post: by
2 posts views Thread by ljlevend | last post: by
4 posts views Thread by Bill Fuller | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.