On 2007-11-18 19:45:15 -0800, Tom <Th********@earthlink.netsaid:
Peter -- Thanks. Your comments have me thinking outside the match box
in which I was stuck. I'm now digging into the RichTextBoxStreamType
enumeration >UnicodePlainText.
If you do that, won't you limit your input to Unicode files?
I think that one approach would be to use a StreamReader to
automatically detect the encoding of the file for you, and then read
the first 1K or so, counting how many characters return true for the
Char.IsLetterOrDigit method and comparing that to the total number of
characters.
It still won't be perfect, but you should be able to come up with a
reasonably good heuristic regarding what the ratio of alphanumeric
characters to other characters you would expect to see in a text file.
Of course, you can still include the user in the determination. For
example, run the above test and if the file passes go ahead and use it,
but if it fails provide the user with a chance to override your
analysis. You could even do this just as you suggest: provide a brief
preview of the initial part of the file to the user so that they can
visually decide whether it's a file they want treated as text.
Caveat: I have basically no experience with non-alphabetic languages,
and I don't know if in a non-alphabetic language a word character would
be considered a "letter" for the purpose of the above test. If that's
important to you, you'll want to verify that and/or find a form of
classification that will correctly detect those characters as text.
Pete