473,382 Members | 1,726 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

is_ascii() or is_binary() for files?

Is there a way to determine whether a file is plain ascii text or not
using standard C++?
Jul 5 '08 #1
16 1948
"Brad" wrote:
Is there a way to determine whether a file is plain ascii text or not
using standard C++?
No. It's in the eye of the beholder. You can make a very good guess by
looking by counting control characters that wouldn't likely be in text. But
the possibility exists that a binary file might not have any of them either.
Jul 5 '08 #2
Brad <br**@16systems.comwrites:
Is there a way to determine whether a file is plain ascii text or not
using standard C++?
Sure, just read its contents and look for any byte that's 127. If
you find one, the file's contents are not plain ASCII.

sherm--

--
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net
Jul 5 '08 #3
On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
Brad <b...@16systems.comwrites:
Is there a way to determine whether a file is plain ascii text or not
using standard C++?

Sure, just read its contents and look for any byte that's 127. If
you find one, the file's contents are not plain ASCII.
if he try to test in a text file which contain non-English text , he
will fail!!
because non-English char are 127
Jul 5 '08 #4
Medvedev wrote:
On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
>Brad <b...@16systems.comwrites:
>>Is there a way to determine whether a file is plain ascii text or not
using standard C++?
Sure, just read its contents and look for any byte that's 127. If
you find one, the file's contents are not plain ASCII.

if he try to test in a text file which contain non-English text , he
will fail!!
because non-English char are 127
OP specified ASCII, not non-English text.
Jul 5 '08 #5
On Jul 5, 11:45 am, Medvedev <3D.v.Wo...@gmail.comwrote:
On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
Brad <b...@16systems.comwrites:
Is there a way to determine whether a file is plain ascii text or not
using standard C++?
Sure, just read its contents and look for any byte that's 127. If
you find one, the file's contents are not plain ASCII.

if he try to test in a text file which contain non-English text , he
will fail!!
because non-English char are 127
sorry man , u r right
i found non-English represented by negative sign
and binary is the file which it's byte MAY BE 127
as it can hold 256-bit pattern

source:
http://www.cs.umd.edu/class/sum2003/.../asciiBin.html
Jul 5 '08 #6
Medvedev <3D********@gmail.comwrites:
On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
>Brad <b...@16systems.comwrites:
Is there a way to determine whether a file is plain ascii text or not
using standard C++?

Sure, just read its contents and look for any byte that's 127. If
you find one, the file's contents are not plain ASCII.

if he try to test in a text file which contain non-English text , he
will fail!!
Exactly as it should.
because non-English char are 127
In other words, they're not plain ASCII. :-)

sherm--

--
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net
Jul 5 '08 #7
On Jul 5, 9:45 pm, Medvedev <3D.v.Wo...@gmail.comwrote:
On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
Brad <b...@16systems.comwrites:
Is there a way to determine whether a file is plain ascii text or not
using standard C++?
Sure, just read its contents and look for any byte that's 127. If
you find one, the file's contents are not plain ASCII.
if he try to test in a text file which contain non-English
text , he will fail!! because non-English char are 127
ASCII is a seven bit code, so no characters are greater than
127 in it.

Of course, just because you don't find any characters greater
than 127 doesn't mean that it is ASCII. It could still be ISO
8859-1, or UTF-8, in which, by chance, none of the characters
happen to be greater than 127. (Or it could be that plain char
is signed on your machine, in which case, it can't contain a
value greater that 127, regardless of the encoding:-).)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Jul 5 '08 #8
Stefan Ram wrote:
Brad <br**@16systems.comwrites:
>Is there a way to determine whether a file is plain ascii text
or not using standard C++?

If someone can define in words when a file is deemed to be a
»a plain ascii text« without ambiguity and for each possible
file, I am sure that then this newsgroup will be able to
help to implement a test for it in C++.
...
Thanks for all the responses. The program recurses through a directory
processing files. I do not know beforehand what type of files the
program may encounter. The processing is simply reading the file and
passing its content to a regular expression to search for certain strings.

Binary files cause problems, so I thought if I could just skip them and
only read ASCII and perhaps UTF-8 encoded files, things would be better.
That lead to my initial question. Later I could learn how to deal with
binary files that I may want to search like PDF and MS Office documents.
Just curious if standard C++ had some built-in function that made this easy.

Thanks again,

Brad
Jul 6 '08 #9
Sam
Brad writes:
That lead to my initial question. Later I could learn how to deal with
binary files that I may want to search like PDF and MS Office documents.
Just curious if standard C++ had some built-in function that made this easy.
No. The only 'built-in' function of any kind is one to test if a single
character belongs in a given character class: isascii() and its equivalents.
It's up to you to scan the entire contents of the file, to classify it.

In POSIX, you might be able to get away with opening a file, stat()ing its
contents, to get the file's size, mmap-ing the file into memory, then using
std::find_if() to search for non-ascii bytes. Of course, if you hit a 4gb
file, that might cause ...problems.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkhwJWwACgkQx9p3GYHlUOKRiQCfecGK931qQS jLwg/zLXXth6jg
J9gAnRTyl3xwtwGTLp9HdwfvpjEaO8tF
=K1um
-----END PGP SIGNATURE-----

Jul 6 '08 #10
On 2008-07-06 02:48, Brad wrote:
Stefan Ram wrote:
>Brad <br**@16systems.comwrites:
>>Is there a way to determine whether a file is plain ascii text
or not using standard C++?

If someone can define in words when a file is deemed to be a
»a plain ascii text« without ambiguity and for each possible
file, I am sure that then this newsgroup will be able to
help to implement a test for it in C++.
...

Thanks for all the responses. The program recurses through a directory
processing files. I do not know beforehand what type of files the
program may encounter. The processing is simply reading the file and
passing its content to a regular expression to search for certain strings.

Binary files cause problems, so I thought if I could just skip them and
only read ASCII and perhaps UTF-8 encoded files, things would be better.
That lead to my initial question. Later I could learn how to deal with
binary files that I may want to search like PDF and MS Office documents.
Just curious if standard C++ had some built-in function that made this easy.
The simplest way to solve your problem is probably to impose some
additional constraints, such as requiring that text files have a name
ending with ".txt" or that you only guarantee correct operation if no
none ASCII files are in the directory.

If you are running on a POSIX system you can also use the 'file' program
which tries to figure out what kind of contents a file has.

--
Erik Wikström
Jul 6 '08 #11
On Jul 6, 3:52 am, Sam <s...@email-scan.comwrote:
Brad writes:
That lead to my initial question. Later I could learn how to
deal with binary files that I may want to search like PDF
and MS Office documents. Just curious if standard C++ had
some built-in function that made this easy.
No. The only 'built-in' function of any kind is one to test if
a single character belongs in a given character class:
isascii() and its equivalents. It's up to you to scan the
entire contents of the file, to classify it.
There is no isascii function, and the other isxxx functions are
locale dependent (and don't really work for narrow characters
anyway). There are heuristics for "guessing" the type of
contents of a file, but they're just that, heuristics, and none
are 100% certain.

Most systems have various conventions which may reveal the type,
but those are also just conventions, and individual files may
actually violate them: you can give a text file an name ending
with .exe under Windows, and there's nothing to prevent a binary
file from starting with something that looks like like
"<!DOCTYPE..." on any system.
In POSIX, you might be able to get away with opening a file,
stat()ing its contents, to get the file's size, mmap-ing the
file into memory, then using std::find_if() to search for
non-ascii bytes. Of course, if you hit a 4gb file, that might
cause ...problems.
Under most Unix systems, you'd probably read the first N bytes
(maybe 512, although that's a lot more than would typically be
necessary), and then exploit magic. For that matter,
*generally*, reading the first 512 bytes, then looking for
characters outside the set 0x07-0x0D and 0x20-0x7E, is probably
a pretty good heuristic; the probability of your guessing wrong
is pretty slim (but of course, it will treat non-ascii text
files as binary).

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Jul 6 '08 #12
On Jul 6, 11:18 am, Erik Wikström <Erik-wikst...@telia.comwrote:
On 2008-07-06 02:48, Brad wrote:
If you are running on a POSIX system you can also use the
'file' program which tries to figure out what kind of contents
a file has.
Note that the information output by file is not guaranteed to be
correct (except in specific cases: the file doesn't exist, isn't
a regular file, or is empty). (On the other hand, it also works
under Windows, if you've installed it correctly.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Jul 6 '08 #13
Sherman Pendley wrote:
Sure, just read its contents and look for any byte that's 127. If
you find one, the file's contents are not plain ASCII.
Actually there are certain characters with values < 32 which can be a
sign of non-ascii file if present, 0 being the most prominent one.
Jul 6 '08 #14
On Jul 6, 4:58 pm, Juha Nieminen <nos...@thanks.invalidwrote:
Sherman Pendley wrote:
Sure, just read its contents and look for any byte that's >
127. If you find one, the file's contents are not plain
ASCII.
Actually there are certain characters with values < 32 which
can be a sign of non-ascii file if present, 0 being the most
prominent one.
Technically, 0 is the encoding of the character nul in ASCII.
ASCII defines "characters" for all encodings in the range 0-127.

Practically, I don't think he really means ASCII per se, but
rather text encoded using ASCII. Or rather files that can be
interpreted as such---it's been years since I've seen a file
encoded as "ASCII" (but a lot of files created as ISO 8859-1 or
UTF-8 can probably be read as ASCII, if the file only contains
characters from the basic character set).

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Jul 6 '08 #15
James Kanze wrote:
(but a lot of files created as ISO 8859-1 or
UTF-8 can probably be read as ASCII, if the file only contains
characters from the basic character set).
UTF-8 has been specifically designed so that if the highest bit of any
byte is set, you know you can't interpret that character as a simple
ASCII one, so in this case the check is rather easy.
Jul 7 '08 #16
On Jul 7, 3:04 pm, Juha Nieminen <nos...@thanks.invalidwrote:
James Kanze wrote:
(but a lot of files created as ISO 8859-1 or
UTF-8 can probably be read as ASCII, if the file only contains
characters from the basic character set).
UTF-8 has been specifically designed so that if the highest
bit of any byte is set, you know you can't interpret that
character as a simple ASCII one, so in this case the check is
rather easy.
The same is true of the ISO 8859 encodings. I don't know of any
machines still using ASCII, but most do use either one of the
ISO 8859 encodings, or UTF-8. And most of those that don't also
follow this rule. So as long as all of the characters in the
file are in the basic execution character set, as defined by the
standard, you can read it as if it were ASCII. There are a few
additional characters which don't cause problems either: $, or @
for example.

The problem with doing so, of course, is that whatever tool
generated the file might have inserted the word "naïve" (or
anything else with a special character: a true less than or
equals sign, or the section sign §, or the name of someone)
somewhere near the end, so even reading the first 512 bytes
won't reveal it.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Jul 7 '08 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Mike | last post by:
I am sure that I am making a simple boneheaded mistake and I would appreciate your help in spotting in. I have just installed apache_2.0.53-win32-x86-no_ssl.exe php-5.0.3-Win32.zip...
44
by: Xah Lee | last post by:
here's a large exercise that uses what we built before. suppose you have tens of thousands of files in various directories. Some of these files are identical, but you don't know which ones are...
0
by: Tom Lee | last post by:
Hi, I'm new to .NET 2003 compiler. When I tried to compile my program using DEBUG mode, I got the following errors in the C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7 \include\xdebug...
18
by: JKop | last post by:
Here's what I know so far: You have a C++ project. You have source files in it. When you go to compile it, first thing the preprocessor sticks the header files into each source file. So now...
3
by: pooja | last post by:
Suppose i have created a class c1 with f1()in c1.cpp and included this c1.cpp in file1.cpp file , which is also having main() by giving the statement #include "c1.cpp". the same i can do by...
11
by: ambika | last post by:
Iam just trying to know "c". And I have a small doubt about these header files. The header files just contain the declaration part...Where is the definition for these declarations written??And how...
22
by: Daniel Billingsley | last post by:
Ok, I wanted to ask this separate from nospam's ridiculous thread in hopes it could get some honest attention. VB6 had a some simple and fast mechanisms for retrieving values from basic text...
14
by: Mick | last post by:
I wrote a C# program that interfaces with a data vendor over the web using an API they supplied and their examples in C#. Now I have another data vendor's API and example that I want to add to...
3
by: aRTx | last post by:
I have try a couple of time but does not work for me My files everytime are sortet by NAME. I want to Sort my files by Date-desc. Can anyone help me to do it? The Script <? /* ORIGJINALI
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.