By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,341 Members | 1,430 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,341 IT Pros & Developers. It's quick & easy.

Testing File Format

P: n/a
Tom
Hi all,

I am looking for a smart way to assure a file is indeed a text file
within a C# method and not binary.

For example: Will "thisMysteryFile.dat" be legible if opened in a
RichTextBox ... or is it a binary file?

I have searched various methods in the string class and am having no
luck.

Under consideration >>

Open the file in a binary reader and then test either the first 1000
char or until File End and if any char are less than 32 or greater
than 127 ... then flag it as binary.

If not binary >open in a RichTextBox

Can anyone tell me a more efficient way to accomplish this task?

Thanks !!
Nov 19 '07 #1
Share this Question
Share on Google+
7 Replies


P: n/a
The first problem I see with the "under consideration" method is that there
are so many legitimate characters (mostly in languages other than English)
that will fall outside your ASCII code range. Unicode (which can certainly be
the contents of a "text file" supports 65536 characters.

--Peter
"Inside every large program, there is a small program trying to get out."
http://www.eggheadcafe.com
http://petesbloggerama.blogspot.com
http://www.blogmetafinder.com

"Tom" wrote:
Hi all,

I am looking for a smart way to assure a file is indeed a text file
within a C# method and not binary.

For example: Will "thisMysteryFile.dat" be legible if opened in a
RichTextBox ... or is it a binary file?

I have searched various methods in the string class and am having no
luck.

Under consideration >>

Open the file in a binary reader and then test either the first 1000
char or until File End and if any char are less than 32 or greater
than 127 ... then flag it as binary.

If not binary >open in a RichTextBox

Can anyone tell me a more efficient way to accomplish this task?

Thanks !!
Nov 19 '07 #2

P: n/a
Tom
Peter -- Thanks. Your comments have me thinking outside the match box
in which I was stuck. I'm now digging into the RichTextBoxStreamType
enumeration >UnicodePlainText.

I'll experiment with this enumeration and see if loading a binary data
file throws an exception. All this RichTextBox stuff is new for me ...
so I have a lot to learn for sure.

Perhaps a restricted load of a tiny size for a preview and then have
control buttons with "Load Full File" or "Clear RichTextBox" options?

Avoiding the accidental loading of a huge binary data file is part of
my objective. The other part of the objective is read only viewing the
small parameter data file as part of a data run initialization.

I am always amazed at how another's input can cause me to refocus.
Darn trees ruining my view of the forrest!! LOL

Have a great day. Thanks again!

-- Tom

On Sun, 18 Nov 2007 18:04:00 -0800, Peter Bromberg [C# MVP]
<pb*******@yahoo.NoSpamMaam.comwrote:
>The first problem I see with the "under consideration" method is that there
are so many legitimate characters (mostly in languages other than English)
that will fall outside your ASCII code range. Unicode (which can certainly be
the contents of a "text file" supports 65536 characters.

--Peter
"Inside every large program, there is a small program trying to get out."
http://www.eggheadcafe.com
http://petesbloggerama.blogspot.com
http://www.blogmetafinder.com

"Tom" wrote:
>Hi all,

I am looking for a smart way to assure a file is indeed a text file
within a C# method and not binary.

For example: Will "thisMysteryFile.dat" be legible if opened in a
RichTextBox ... or is it a binary file?

I have searched various methods in the string class and am having no
luck.

Under consideration >>

Open the file in a binary reader and then test either the first 1000
char or until File End and if any char are less than 32 or greater
than 127 ... then flag it as binary.

If not binary >open in a RichTextBox

Can anyone tell me a more efficient way to accomplish this task?

Thanks !!
Nov 19 '07 #3

P: n/a
On 2007-11-18 19:45:15 -0800, Tom <Th********@earthlink.netsaid:
Peter -- Thanks. Your comments have me thinking outside the match box
in which I was stuck. I'm now digging into the RichTextBoxStreamType
enumeration >UnicodePlainText.
If you do that, won't you limit your input to Unicode files?

I think that one approach would be to use a StreamReader to
automatically detect the encoding of the file for you, and then read
the first 1K or so, counting how many characters return true for the
Char.IsLetterOrDigit method and comparing that to the total number of
characters.

It still won't be perfect, but you should be able to come up with a
reasonably good heuristic regarding what the ratio of alphanumeric
characters to other characters you would expect to see in a text file.

Of course, you can still include the user in the determination. For
example, run the above test and if the file passes go ahead and use it,
but if it fails provide the user with a chance to override your
analysis. You could even do this just as you suggest: provide a brief
preview of the initial part of the file to the user so that they can
visually decide whether it's a file they want treated as text.

Caveat: I have basically no experience with non-alphabetic languages,
and I don't know if in a non-alphabetic language a word character would
be considered a "letter" for the purpose of the above test. If that's
important to you, you'll want to verify that and/or find a form of
classification that will correctly detect those characters as text.

Pete

Nov 19 '07 #4

P: n/a
Unicode (which can certainly
be the contents of a "text file" supports 65536 characters.
Unicode goes up to 10FFFF, which is a bit more than one million.
Other than that, very good warning :-)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Nov 19 '07 #5

P: n/a
Tom
Pete --

Thank you! I am new to C# and I am exploring StreamReader a.s.a.p.

I work only in the English language and am not developing programs for
global distribution. Your methodology seems solid to this newb. Usage
of Char.IsLetterOrDigit would effectively provide some language
independence. That independence makes for a MUCH better tool than what
I had been focused upon.

Very, very thought provoking!

Again, thanks. -- Tom

On Sun, 18 Nov 2007 20:05:37 -0800, Peter Duniho
<Np*********@NnOwSlPiAnMk.comwrote:
>On 2007-11-18 19:45:15 -0800, Tom <Th********@earthlink.netsaid:
>Peter -- Thanks. Your comments have me thinking outside the match box
in which I was stuck. I'm now digging into the RichTextBoxStreamType
enumeration >UnicodePlainText.

If you do that, won't you limit your input to Unicode files?

I think that one approach would be to use a StreamReader to
automatically detect the encoding of the file for you, and then read
the first 1K or so, counting how many characters return true for the
Char.IsLetterOrDigit method and comparing that to the total number of
characters.

It still won't be perfect, but you should be able to come up with a
reasonably good heuristic regarding what the ratio of alphanumeric
characters to other characters you would expect to see in a text file.

Of course, you can still include the user in the determination. For
example, run the above test and if the file passes go ahead and use it,
but if it fails provide the user with a chance to override your
analysis. You could even do this just as you suggest: provide a brief
preview of the initial part of the file to the user so that they can
visually decide whether it's a file they want treated as text.

Caveat: I have basically no experience with non-alphabetic languages,
and I don't know if in a non-alphabetic language a word character would
be considered a "letter" for the purpose of the above test. If that's
important to you, you'll want to verify that and/or find a form of
classification that will correctly detect those characters as text.

Pete
Nov 19 '07 #6

P: n/a
Tom
Hey folks --

I've been rethinking my usage of RichTextBox long and hard. At first
it seemed the do all new magic class. For some tasks it is just that!
Accidentally opening a huge file from a ListView selection is
painfully slow and consumes resources like no tomorrow. Ouch.

What I really crave is a Text Viewer class without editing capability.
One that only loads a screen worth of text at a time. Where the thumb
is sized to reflect the file size and placement of the thumb loads
just that section of the data file. Like Petzold's painting with text
example from Programming Windows 95 ... only in .Net 2.0 C# and
integrated with a simpler TextBox? Or another text viewing control
that is more appropriate.

I'm still searching for such a Text Viewer. A search on "Thumb Size
..Net 2.0" led me to some graphics intensive TrackBarRenderer,
trackRectangle, thumbRectangle, etc. usage that goes way beyond the
WinForms book and C# Instructional Texts that I have. Certainly
steepening my learning curve!

My guess is someone has already duplicated that Petzold example in C#
2.0 and that I would learn more and faster from studying a guru's
coding than creating my own.

If anyone can point me towards such a useful, compact, and also
complex tool ... I would be without doubt grateful.

Thanks. -- Tom


Nov 19 '07 #7

P: n/a
On 2007-11-19 06:25:57 -0800, Tom <Th********@earthlink.netsaid:
[...]
I'm still searching for such a Text Viewer. A search on "Thumb Size
.Net 2.0" led me to some graphics intensive TrackBarRenderer,
trackRectangle, thumbRectangle, etc. usage that goes way beyond the
WinForms book and C# Instructional Texts that I have. Certainly
steepening my learning curve!

My guess is someone has already duplicated that Petzold example in C#
2.0 and that I would learn more and faster from studying a guru's
coding than creating my own.

If anyone can point me towards such a useful, compact, and also
complex tool ... I would be without doubt grateful.
I'm not familiar with Petzold's examples, so I can't comment on that.
As far as what you're asking about, I'm not aware of a specific
text-box implementation that does what you're talking about. It
wouldn't be that hard to do, at least for the basic implementation
(duplicating the full functionality of the TextBoxBase classes would be
harder, but it sounds like you only need a minimal subset).

Interestingly, taking a suggestion from a different thread -- in which
someone suggested using a ListBox to implement a console-output-like
control -- you could use the DataGridView in a similar way, taking
advantage of its "VirtualMode" mechanism. Using that, the control
handles all of the display and you provide the code that virtualizes
the data rather than having it all in memory at once.

It could be overkill -- the DataGridView control has lots of stuff in
it that would be of no value for this purpose -- and you might have
trouble getting it to look just right, since the DataGridView does have
a specific look and I don't know if you could get rid of the elements
that would be distracting in this use.

But hey, when you're hacking stuff, you can't be picky. :)

Pete

Nov 19 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.