473,569 Members | 2,789 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Identifying File type by reading files

This is not really a Python-centric question, however, I am using
Python to solve this problem (as of now) so I thought it appropiate to
pose the question here.

I have some functions that search for files that contain certian
strings and if the files found to have these string do not already
have a filename extension (such as '.doc' or '.xls') the function will
append that to the files and rename them. So, if a file named 'report'
was found to have the string 'Microsoft' and the string
'Word.Document. ' (notice the '.' at the end of both words) and it does
not already have an extension, then a rename would take place that
would name the file 'report.doc'

These functions work very well on most files (98% guessed correctly).
However, I would like the functions to be more precise (100%). So,
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.

Are there certain sets of binary data that are unique to files that
would be a better way of identifying them? For example, on the N line
of a MS doc file begining at position X a binary string that is L
digits in lentgh that begins with B and ends with E will *ALWAYS* be
present... some one tell me that I'm not dreaming and that something
like the above example exists???

A few of my string searches today:

doc = string.find(fil e(os.path.join( root,fname), 'rb').read(),
'Word.Document. ')
xls = string.find(fil e(os.path.join( root,fname), 'rb').read(),
'Excel.Sheet.')
pdf = string.find(fil e(os.path.join( root,fname), 'rb').read(),
'PDF-1.')
jpg = string.find(fil e(os.path.join( root,fname), 'rb').read(), 'JFIF')

Any suggestions or information that better describes how to positively
ID files w/o the possibiliy of mistake would be very helpful to me. As
of now, some of my files, though not many (~ 2%) will be given the
wrong extension, but the logic of the functions is such that they
append any extension that probably applies to the file so at that
point it is a simple process of elimination to determine which
extension is actually the correct one. Normally, I never have more
than 2 unique extensions attached to the same file.

Thank you!!!
Jul 18 '05 #1
1 5157
hokiegal99:
what should I look for in a file to determine whether or not it is a
MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
list of some of the strings I use to ID files, but I can't help but
wonder that there must be a more precise way of doing this. I know of
the Unix 'file' command. It is not very useful for me as it doesn't
distinguish between MS Office documents... all .xls, .docs, .ppts are
MS documents to it.


That likely means you have an incomplete 'magic' file. This is the
file used by the 'file' command to figure out the file type. Take a
look at http://www.unixhideout.com/freebsd/share/misc/magic for
a more complete (I think) version.

That's dated 1995 and is close the one on my Mac. It doesn't support
the newer MS Word and Excel formats. I'm having trouble
finding the most recent, definitive version. One link pointed me
to ftp://ftp.astron.com/pub/file/ but I haven't investigated it further.

There's also a pymagic, http://thomas.mangin.me.uk/software/python.html
which may help for a pure Python implementation of 'file'.

Andrew
da***@dalkescie ntific.com
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
848
by: hokiegal99 | last post by:
This is not really a Python-centric question, however, I am using Python to solve this problem (as of now) so I thought it appropiate to pose the question here. I have some functions that search for files that contain certian strings and if the files found to have these string do not already have a filename extension (such as '.doc' or...
9
3190
by: Hans-Joachim Widmaier | last post by:
Hi all. Handling files is an extremely frequent task in programming, so most programming languages have an abstraction of the basic files offered by the underlying operating system. This is indeed also true for our language of choice, Python. Its file type allows some extraordinary convenient access like: for line in open("blah"):...
4
2061
by: Michael J. Fromberger | last post by:
Greetings, The following question pertains to a problem I am solving in Python 2.4b2 on a MacOS 10 (Panther) system with an HFS+ filesystem. Given the pathname of a directory in my filesystem, I would like a graceful way to determine whether or not that directory represents a "bundle", in the sense that, when you double-click on the...
0
4682
by: Peter | last post by:
I am having a problem reading an Excel file that is XML based. The directory I am reading contains Excel files that can be of two types. Either generic Microsoft based or XML based. I am reading the Microsoft based files with an OleDbDataAdapter. Then filling the contents of the first worksheet into a dataset. However when I try to add the...
1
1762
by: Galen Somerville | last post by:
And yet another VB6 to VB2005 problem. All helpful suggestions appreciated. As you can see in the code below, my structures use fixed length strings and known array sizes. Consequently I can save to files as a large byte array. This is a series of Lectures where there is a capacity for 8 instructors with up to 8 lectures each. So a...
6
7587
by: Frank Rizzo | last post by:
I have an interesting problem. I have a directory of image files. However, none of the files have an extension. I need to figure out what type if image it is and attach an extension to the file. Is there a way to determine image type in the .net framework? Thanks.
1
64043
AdrianH
by: AdrianH | last post by:
Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts,...
1
2658
by: dwaterpolo | last post by:
Hi Everyone, I am trying to read two text files swY40p10t3ctw45.col.txt and solution.txt and compare them, the first text file has a bunch of values listed like: y y y y y y y
1
5577
by: shyaminf | last post by:
hi everybody! iam facing a problem with the transfer of file using servlet programming. i have a code for uploading a file. but i'm unable to execute it using tomcat5.5 server. kindly help me how to execute it using tomcat server5.5. the code is as follows. if you have any other coding regarding this, please send me.it's urgent. import...
0
7924
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8120
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7672
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5512
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5219
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3653
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3640
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1212
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
937
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.