Testing File Format

Tom

Hi all,

I am looking for a smart way to assure a file is indeed a text file
within a C# method and not binary.

For example: Will "thisMysteryFile.dat" be legible if opened in a
RichTextBox ... or is it a binary file?

I have searched various methods in the string class and am having no
luck.

Under consideration >>

Open the file in a binary reader and then test either the first 1000
char or until File End and if any char are less than 32 or greater
than 127 ... then flag it as binary.

If not binary >open in a RichTextBox

Can anyone tell me a more efficient way to accomplish this task?

Thanks !!

Nov 19 '07 #1

Subscribe Post Reply

1934

=?Utf-8?B?UGV0ZXIgQnJvbWJlcmcgW0MjIE1WUF0=?=

The first problem I see with the "under consideration" method is that there
are so many legitimate characters (mostly in languages other than English)
that will fall outside your ASCII code range. Unicode (which can certainly be
the contents of a "text file" supports 65536 characters.

--Peter
"Inside every large program, there is a small program trying to get out."
http://www.eggheadcafe.com
http://petesbloggerama.blogspot.com
http://www.blogmetafinder.com

"Tom" wrote:

Hi all,

I am looking for a smart way to assure a file is indeed a text file
within a C# method and not binary.

For example: Will "thisMysteryFile.dat" be legible if opened in a
RichTextBox ... or is it a binary file?

I have searched various methods in the string class and am having no
luck.

Under consideration >>

Open the file in a binary reader and then test either the first 1000
char or until File End and if any char are less than 32 or greater
than 127 ... then flag it as binary.

If not binary >open in a RichTextBox

Can anyone tell me a more efficient way to accomplish this task?

Thanks !!

Nov 19 '07 #2

Tom

Peter -- Thanks. Your comments have me thinking outside the match box
in which I was stuck. I'm now digging into the RichTextBoxStreamType
enumeration >UnicodePlainText.

I'll experiment with this enumeration and see if loading a binary data
file throws an exception. All this RichTextBox stuff is new for me ...
so I have a lot to learn for sure.

Perhaps a restricted load of a tiny size for a preview and then have
control buttons with "Load Full File" or "Clear RichTextBox" options?

Avoiding the accidental loading of a huge binary data file is part of
my objective. The other part of the objective is read only viewing the
small parameter data file as part of a data run initialization.

I am always amazed at how another's input can cause me to refocus.
Darn trees ruining my view of the forrest!! LOL

Have a great day. Thanks again!

-- Tom

On Sun, 18 Nov 2007 18:04:00 -0800, Peter Bromberg [C# MVP]
<pb*******@yahoo.NoSpamMaam.comwrote:

>The first problem I see with the "under consideration" method is that there
are so many legitimate characters (mostly in languages other than English)
that will fall outside your ASCII code range. Unicode (which can certainly be
the contents of a "text file" supports 65536 characters.

--Peter
"Inside every large program, there is a small program trying to get out."
http://www.eggheadcafe.com
http://petesbloggerama.blogspot.com
http://www.blogmetafinder.com

"Tom" wrote:

>Hi all,

I am looking for a smart way to assure a file is indeed a text file
within a C# method and not binary.

For example: Will "thisMysteryFile.dat" be legible if opened in a
RichTextBox ... or is it a binary file?

I have searched various methods in the string class and am having no
luck.

Under consideration >>

Open the file in a binary reader and then test either the first 1000
char or until File End and if any char are less than 32 or greater
than 127 ... then flag it as binary.

If not binary >open in a RichTextBox

Can anyone tell me a more efficient way to accomplish this task?

Thanks !!

Nov 19 '07 #3

Peter Duniho

On 2007-11-18 19:45:15 -0800, Tom <Th********@earthlink.netsaid:

Peter -- Thanks. Your comments have me thinking outside the match box
in which I was stuck. I'm now digging into the RichTextBoxStreamType
enumeration >UnicodePlainText.

If you do that, won't you limit your input to Unicode files?

I think that one approach would be to use a StreamReader to
automatically detect the encoding of the file for you, and then read
the first 1K or so, counting how many characters return true for the
Char.IsLetterOrDigit method and comparing that to the total number of
characters.

It still won't be perfect, but you should be able to come up with a
reasonably good heuristic regarding what the ratio of alphanumeric
characters to other characters you would expect to see in a text file.

Of course, you can still include the user in the determination. For
example, run the above test and if the file passes go ahead and use it,
but if it fails provide the user with a chance to override your
analysis. You could even do this just as you suggest: provide a brief
preview of the initial part of the file to the user so that they can
visually decide whether it's a file they want treated as text.

Caveat: I have basically no experience with non-alphabetic languages,
and I don't know if in a non-alphabetic language a word character would
be considered a "letter" for the purpose of the above test. If that's
important to you, you'll want to verify that and/or find a form of
classification that will correctly detect those characters as text.

Pete

Nov 19 '07 #4

Mihai N.

Unicode (which can certainly

be the contents of a "text file" supports 65536 characters.

Unicode goes up to 10FFFF, which is a bit more than one million.
Other than that, very good warning :-)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Nov 19 '07 #5

Tom

Pete --

Thank you! I am new to C# and I am exploring StreamReader a.s.a.p.

I work only in the English language and am not developing programs for
global distribution. Your methodology seems solid to this newb. Usage
of Char.IsLetterOrDigit would effectively provide some language
independence. That independence makes for a MUCH better tool than what
I had been focused upon.

Very, very thought provoking!

Again, thanks. -- Tom

On Sun, 18 Nov 2007 20:05:37 -0800, Peter Duniho
<Np*********@NnOwSlPiAnMk.comwrote:

>On 2007-11-18 19:45:15 -0800, Tom <Th********@earthlink.netsaid:

>Peter -- Thanks. Your comments have me thinking outside the match box
in which I was stuck. I'm now digging into the RichTextBoxStreamType
enumeration >UnicodePlainText.

If you do that, won't you limit your input to Unicode files?

I think that one approach would be to use a StreamReader to
automatically detect the encoding of the file for you, and then read
the first 1K or so, counting how many characters return true for the
Char.IsLetterOrDigit method and comparing that to the total number of
characters.

It still won't be perfect, but you should be able to come up with a
reasonably good heuristic regarding what the ratio of alphanumeric
characters to other characters you would expect to see in a text file.

Of course, you can still include the user in the determination. For
example, run the above test and if the file passes go ahead and use it,
but if it fails provide the user with a chance to override your
analysis. You could even do this just as you suggest: provide a brief
preview of the initial part of the file to the user so that they can
visually decide whether it's a file they want treated as text.

Caveat: I have basically no experience with non-alphabetic languages,
and I don't know if in a non-alphabetic language a word character would
be considered a "letter" for the purpose of the above test. If that's
important to you, you'll want to verify that and/or find a form of
classification that will correctly detect those characters as text.

Pete

Nov 19 '07 #6

Tom

Hey folks --

I've been rethinking my usage of RichTextBox long and hard. At first
it seemed the do all new magic class. For some tasks it is just that!
Accidentally opening a huge file from a ListView selection is
painfully slow and consumes resources like no tomorrow. Ouch.

What I really crave is a Text Viewer class without editing capability.
One that only loads a screen worth of text at a time. Where the thumb
is sized to reflect the file size and placement of the thumb loads
just that section of the data file. Like Petzold's painting with text
example from Programming Windows 95 ... only in .Net 2.0 C# and
integrated with a simpler TextBox? Or another text viewing control
that is more appropriate.

I'm still searching for such a Text Viewer. A search on "Thumb Size
..Net 2.0" led me to some graphics intensive TrackBarRenderer,
trackRectangle, thumbRectangle, etc. usage that goes way beyond the
WinForms book and C# Instructional Texts that I have. Certainly
steepening my learning curve!

My guess is someone has already duplicated that Petzold example in C#
2.0 and that I would learn more and faster from studying a guru's
coding than creating my own.

If anyone can point me towards such a useful, compact, and also
complex tool ... I would be without doubt grateful.

Thanks. -- Tom

Nov 19 '07 #7

Peter Duniho

On 2007-11-19 06:25:57 -0800, Tom <Th********@earthlink.netsaid:

[...]
I'm still searching for such a Text Viewer. A search on "Thumb Size
.Net 2.0" led me to some graphics intensive TrackBarRenderer,
trackRectangle, thumbRectangle, etc. usage that goes way beyond the
WinForms book and C# Instructional Texts that I have. Certainly
steepening my learning curve!

My guess is someone has already duplicated that Petzold example in C#
2.0 and that I would learn more and faster from studying a guru's
coding than creating my own.

If anyone can point me towards such a useful, compact, and also
complex tool ... I would be without doubt grateful.

I'm not familiar with Petzold's examples, so I can't comment on that.
As far as what you're asking about, I'm not aware of a specific
text-box implementation that does what you're talking about. It
wouldn't be that hard to do, at least for the basic implementation
(duplicating the full functionality of the TextBoxBase classes would be
harder, but it sounds like you only need a minimal subset).

Interestingly, taking a suggestion from a different thread -- in which
someone suggested using a ListBox to implement a console-output-like
control -- you could use the DataGridView in a similar way, taking
advantage of its "VirtualMode" mechanism. Using that, the control
handles all of the display and you provide the code that virtualizes
the data rather than having it all in memory at once.

It could be overkill -- the DataGridView control has lots of stuff in
it that would be of no value for this purpose -- and you might have
trouble getting it to look just right, since the DataGridView does have
a specific look and I don't know if you could get rid of the elements
that would be distracting in this use.

But hey, when you're hacking stuff, you can't be picky. :)

Pete

Nov 19 '07 #8

by: Hugh Cowan | last post by:

Hello, I don't program full-time (anymore), but I do try and stay on-top of the latest technologies and like most are always trying to upgrade my skills and remain current (as much as is...

PHP

Unit-testing single function with large number of different inputs

by: Edvard Majakari | last post by:

Hi all ya unit-testing experts there :) Code I'm working on has to parse large and complex files and detect equally complex and large amount of errors before the contents of the file is fed to...

Python

set integrity and testing DATE format

by: Jean-Marc Blaise | last post by:

Dear all, I have a table that contains a DATE in char(10) format. This table is LOADed. I put a check constraint on the column, some kind of CHECK (DATE(F) <= DATE('9999-12-31')) and I would...

DB2 Database

Testing Environment Question

by: Todd D. Levy | last post by:

IF I have a database developed in the following environment: Windows XP Professional (all security patches and critical fixes installed) Office XP Professional (Access 2000 file format in Access...

Microsoft Access / VBA

Problem with wrapping an unmanaged C++ DLL using the header file

by: Lokkju | last post by:

I am pretty much lost here - I am trying to create a managed c++ wrapper for this dll, so that I can use it from c#/vb.net, however, it does not conform to any standard style of coding I have seen....

.NET Framework

load testing

by: sviau | last post by:

whats the best way to test the load on the website which is similar the live load. we havea high traffic website (6 million page views per day). how can i use the iis logs from the live site; to...

ASP.NET

Testing for End of File in I/O

by: UofFprogrammer | last post by:

I am experimenting with several ways to test for the end of file for an input file .txt If, for example, I have a text (test.txt) file that had: ab 5 fgd 3 fdfe 3 aasa 4 (intentionally blank...

C / C++

Re: Unit Testing Techniques

by: Matthew Fitzgibbons | last post by:

I'm by no means a testing expert, but I'll take a crack at it. Casey McGinty wrote: I've never run into this. Rule of thumb: always separate software from hardware. Write mock classes or...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Similar topics