473,396 Members | 1,847 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

is it text file?

can u write a program to tell whether a file is a text file or not
Jan 20 '07 #1
13 1608
willakawill
1,646 1GB
can u write a program to tell whether a file is a text file or not
Hi
The only thing that normally distinguishes a text file is the txt suffix. But you can give that suffix to any file you like. There is nothing to check for to prove that a file absolutely is a text file or not. It is just bytes of data on disk. There are some properties that are common to most text files and that is not true of all files. So the answer is no.
Jan 20 '07 #2
Banfa
9,065 Expert Mod 8TB
I think I would disagree with you willakawill.

A text file is one that only contains printable characters plus newlines. So openning the file and checking every character to make sure it fitted that criteria would be the test I would put forward.
Jan 22 '07 #3
Hi,
I am not sure about windows, but unix provides an utility called "file" which can be used to know the properties of a file. using "sytem" funciton we can execute this command from c program.
Jan 23 '07 #4
willakawill
1,646 1GB
I think I would disagree with you willakawill.

A text file is one that only contains printable characters plus newlines. So openning the file and checking every character to make sure it fitted that criteria would be the test I would put forward.
And you are 100% certain that no other file would contain bytes of data that, individually, could be looked at as 13 or 65 or 10 or null?
A long could be stored in a binary file and look exactly like that.
Jan 23 '07 #5
Banfa
9,065 Expert Mod 8TB
And you are 100% certain that no other file would contain bytes of data that, individually, could be looked at as 13 or 65 or 10 or null?
A long could be stored in a binary file and look exactly like that.
I am certain that other files could and would contain some bytes in the printable range.

In very rare cases I guess that a binary file could contain only values that are only in the range of printable characters. However I would expect most binary files to contain at least 1 character outside the printable range on probability alone and it only takes 1 character to make the file binary and not text.

Extension is not a god test of what a file contains as it is just 3 random letters. My test will al lest classify the file as either

Binary

or

Could be Text

I guess what you could do is count the number of characters of each type and the occurance of spaces. A text file should be split into words, you probably ought to get a space on average every 7 or 8 characters. Based on the frequency of different letters you may be able to take a guess at the language.
Jan 23 '07 #6
willakawill
1,646 1GB
This code:
Expand|Select|Wrap|Line Numbers
  1. typedef char charar[10];
  2. charar arSt[6] = {"hubble", "bubble", "toil", "and", "trouble", "hahaha"};
  3.  
  4. ofstream ofst;
  5.  
  6. ofst.open("c:\\data\\testbinary.bin", ios::out | ios::binary);
  7. ofst.write(reinterpret_cast<char *>(arSt), 60);
  8. ofst.close();
  9. return;
Produces a binary file. Not a text file. It can be displayed by notepad but it is not a text file. In notepad it looks like:

Expand|Select|Wrap|Line Numbers
  1. hubble    bubble    toil      and       trouble   hahaha 
and it can be read straight back into the array as a binary file
Expand|Select|Wrap|Line Numbers
  1. ifstream ifst;
  2. ifst.open("c:\\data\\testbinary.bin", ios::in | ios::binary);
  3. ifst.read(reinterpret_cast<char *>(arSt), 60);
  4. ifst.close();
  5. return;
This is a text file:
Expand|Select|Wrap|Line Numbers
  1. hubble,
  2. bubble,
  3. toil,
  4. and,
  5. trouble,
  6. hahaha,
but it does not contain any spaces.
Therefore, I contend, looking for printable characters or spaces does not tell us if a file is a text file or not.
Jan 23 '07 #7
Banfa
9,065 Expert Mod 8TB
Produces a binary file. Not a text file. It can be displayed by notepad but it is not a text file. In notepad it looks like:

Expand|Select|Wrap|Line Numbers
  1. hubble    bubble    toil      and       trouble   hahaha 
In what sense is this not a text file. If I gave it to you without showing you the code that created it how would you classify it?

This is a text file:
Expand|Select|Wrap|Line Numbers
  1. hubble,
  2. bubble,
  3. toil,
  4. and,
  5. trouble,
  6. hahaha,
but it does not contain any spaces.
Therefore, I contend, looking for printable characters or spaces does not tell us if a file is a text file or not.
OK the algorithm was a little simplified, you would expect a text file to consist of words and therefore have a word separator every 6 - 8 characters on average. I simplified word separator to space but actually a word separator is any punctuation mark or any white space.

However a text file does not have to contain words, it is a file that only contains printable characters here is a text file

Expand|Select|Wrap|Line Numbers
  1. keiuwefhjwer
  2. wswfiwefhwef iowef weijh  wefwefwe
  3. wefwefoewoij 0943rt345 .@:{£Q(&FOfgwr
  4. ewrgf-0-sxvb@~£{=0@43p;'#tl35y-0[9;llgr
  5.  
It means nothing to you but it is text because all the characters are printable.

being a text file is nothing to do with you ability to read it and entirely to do with the values of the characters that appear in the file. The reason for this is that a text file (consisting only of printable characters) can be reliably transfered over pretty much any communication medium but a binary file that may have strange control codes in it may cause some (poorly written) transport protocols to break because of the presence of the binary values.


It does not matter it you can read or not read a text file, all that is required is that all characters are in the range of printable characters.


Text is not a classification of readability, it is a classification of allowable character value.
Jan 23 '07 #8
willakawill
1,646 1GB
Text is not a classification of readability, it is a classification of allowable character value.
And my point is this;
Do you think this is what the op was asking?
Jan 23 '07 #9
DeMan
1,806 1GB
IMHO - I think Banfa is quite right. The original question was
can u write a program to tell whether a file is a text file or not

The point (I think) Banfa may be trying to make is exactly your last comment
Do you think this is what the op was asking?
By highlighting the fact that different people call a text file different things and in its' broadest definition a text file is a file that includes only characters within the readable character set (and whitespace) he is not only giving a possible answer to the question, he is probably prompting a more specific question if his answer is not what was expected.

Unlike people, computers can't very easily interpret (or even hypothesise) the intentions of another individual (otherwise my code would always work because the computer could work out what I was TRYING to do).
Jan 23 '07 #10
willakawill
1,646 1GB
IMHO - I think Banfa is quite right. The original question was



The point (I think) Banfa may be trying to make is exactly your last comment


By highlighting the fact that different people call a text file different things and in its' broadest definition a text file is a file that includes only characters within the readable character set (and whitespace) he is not only giving a possible answer to the question, he is probably prompting a more specific question if his answer is not what was expected.

Unlike people, computers can't very easily interpret (or even hypothesise) the intentions of another individual (otherwise my code would always work because the computer could work out what I was TRYING to do).
Except, in this case, we do have the op to correct us
Jan 23 '07 #11
DeMan
1,806 1GB
Which is exactly why Banfa's post is adequate
- if the op disagrees he can better clarify his question - interestingly he hasn't.

he is not only giving a possible answer to the question, he is probably prompting a more specific question if his answer is not what was expected.
Jan 23 '07 #12
Banfa
9,065 Expert Mod 8TB
And my point is this;
Do you think this is what the op was asking?
I have no idea what the op was asking because he has given no context.

In this case I have chose to interpret what he has asked literally rather than try to guess what he meant (much like a computer).

A text file has a very specific meaning in programming and I chose to use that meaning because this is a programming forum and there is no information to indicate that the OP meant any different and they did use the term.

I do not think it is worth the time trying to guess if people mean what they have written or something else, I choose to assume they mean what they have written and that if they didn't they will correct themselves when they get an answer that isn't quite what they are expecting.



However just to satisfy you assuming that the op meant

Is it possible to detect if a file contains readable text of a specific language (say English) then the answer is:

It is jolly hard, with the use of a spelling checker you may be able to confirm that a file only contains words (or attempts at words) from a given language but ensuring that they have been put together into coherent sentences is probably beyond the capabilities of a computer unless you have very large pot of money and time to throw at the task.
Jan 23 '07 #13
willakawill
1,646 1GB
Which is exactly why Banfa's post is adequate
- if the op disagrees he can better clarify his question - interestingly he hasn't.
So ask him.
This forum is educational for all, including those of us who dare to post answers. Making assumptions about the original question is not necessary and not useful. I have often replied to an op with questions for clarification only to see others rushing in with answers and clearly stating that they are making assumptions. How is that useful? Whatever happened to dialog?

I made this mistake myself when answering this question

It takes all sorts :)
Jan 23 '07 #14

Sign in to post your reply or Sign up for a free account.

Similar topics

22
by: Ling Lee | last post by:
Hi all. I'm trying to write a program that: 1) Ask me what file I want to count number of lines in, and then counts the lines and writes the answear out. 2) I made the first part like this: ...
1
by: Rigga | last post by:
Hi, I am new to Python and need to parse a text file and cut parts out i.e. say the text file contained 5 rows of text: line 1 of the text file line 2 of the text file line 3 of the text...
27
by: Eric | last post by:
Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there...
16
by: thenightfly | last post by:
Ok, I know all about how binary numbers translate into text characters. My question is what exactly IS a text character? Is it a bitmap?
7
by: Chris | last post by:
Hi I can use a text file as a datasource but am unable to get the datatable to see the text file as having multiple columns. Everything gets put into the first column in the datatable. Sample of...
3
by: bbepristis | last post by:
Hey all I have this code that reads from one text file writes to another unless im on a certian line then it writes the new data however it only seems to do about 40 lines then quits and I cant...
1
by: Osoccer | last post by:
...to a different folder and in the relocated file concatenates all of the lines in one long string with a space between each line element. Here is a fuller statement of the problem: I need a...
10
by: bluemountain | last post by:
Hi there, Iam new to python forms and programming too I had a text file where i need to extract few words of data from the header(which is of 3 lines) and search for the keyword TEXT1, TEXT2,...
0
Debadatta Mishra
by: Debadatta Mishra | last post by:
Introduction In this article I will provide you an approach to manipulate an image file. This article gives you an insight into some tricks in java so that you can conceal sensitive information...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.