473,766 Members | 2,060 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Determining when a file is an Open Office Document

Silly question, but here goes... what's a good way to determine when a
file is an Open Office document? I could look at the file extension, but
it seems there would be a better way. VI shows this info in the files:

mimetypeapplica tion/vnd.oasis.opend ocument.textPK
mimetypeapplica tion/vnd.oasis.opend ocument.present ationPK
etc.

Not really a Python specific question but, how do you guys do this sort
of thing? I've figured out how to break out the content.xml file in the
new OOo XML format, and do re searching and matching on that, now I just
need a fast, reliable way to determine when I need to do that versus
just reading the file.

Thanks,
Tubby
Jan 19 '07 #1
9 3090
tubby <tu***@bandahea rt.comwrites:
Silly question, but here goes... what's a good way to determine when
a file is an Open Office document? I could look at the file
extension, but it seems there would be a better way.
Yes, the name of a file may be useful for communicating with humans
about that file's intended use, but is a lousy, unreliable way to make
a definite statement about the actual contents of the file.

The Unix 'file' command determines the type of a file by its contents,
not its name. This functionality is essentially a database of "magic"
byte patterns mapping to file types, and is provided by a library
called "libmagic", distributed with most GNU/Linux distributions.

<URL:http://packages.debian .org/testing/source/file>

There is a Python interface to the "magic" functionality. It's in
Debian; I'm not sure if it's part of the "magic" code base, or written
separately to interface with it. Either way, you can get the source
for those packages and find out more.

<URL:http://packages.debian .org/unstable/python/python-magic>

--
\ "If life deals you lemons, why not go kill someone with the |
`\ lemons (maybe by shoving them down his throat)." -- Jack Handey |
_o__) |
Ben Finney

Jan 19 '07 #2
tubby wrote:
Silly question, but here goes... what's a good way to determine when a
file is an Open Office document? I could look at the file extension, but
it seems there would be a better way. VI shows this info in the files:

mimetypeapplica tion/vnd.oasis.opend ocument.textPK
It's a ZIP archive. The info you've found are the file name
"mimetype", the uncompressed contents of that file
"applicatio n/vnd.oasis.opend ocument.text", and part of the ZIP magic
number "PK". You should be able to use the "zipfile" module to check
to see if the file a ZIP file, if it has a member named "mimetype" and
if the contents of the file match one of the OpenOffice MIME types.

Ross Ridge

Jan 19 '07 #3
tubby wrote:
Silly question, but here goes... what's a good way to determine when a
file is an Open Office document? I could look at the file extension, but
it seems there would be a better way. VI shows this info in the files:

mimetypeapplica tion/vnd.oasis.opend ocument.textPK
It's a ZIP archive. The info you've found are the file name
"mimetype", the uncompressed contents of that file
"applicatio n/vnd.oasis.opend ocument.text", and part of the ZIP magic
number "PK". You should be able to use the "zipfile" module to check
to see if the file a ZIP file, if it has a member named "mimetype" and
if the contents of the file match one of the OpenOffice MIME types.

Ross Ridge

Jan 19 '07 #4
Ross Ridge wrote:
tubby wrote:
>Silly question, but here goes... what's a good way to determine when a
file is an Open Office document? I could look at the file extension, but
it seems there would be a better way. VI shows this info in the files:

mimetypeapplic ation/vnd.oasis.opend ocument.textPK

It's a ZIP archive.
Thanks, I used this approach:

import zipfile
if zipfile.is_zipf ile(filename):
...

Now, If only I could something like that on PDF files :)
Jan 19 '07 #5
On Fri, 19 Jan 2007 12:22:04 +1100, Ben Finney wrote:
tubby <tu***@bandahea rt.comwrites:
>Silly question, but here goes... what's a good way to determine when
a file is an Open Office document? I could look at the file
extension, but it seems there would be a better way.

Yes, the name of a file may be useful for communicating with humans
about that file's intended use, but is a lousy, unreliable way to make
a definite statement about the actual contents of the file.

The Unix 'file' command determines the type of a file by its contents,
not its name. This functionality is essentially a database of "magic"
byte patterns mapping to file types,
Ah, another lousy, unreliable way to make a definite statement about the
actual contents of a file. Looking at magic bytes inside a file is hardly
bullet-proof (although file seems to be moderately reliable in practice,
at least under Linux).

Simple example: is the file consisting of two bytes "x09x0A" meant to be a
text file with a tab and a newline, or a binary file consisting of a
single two-byte int? There's no way to tell just from the contents.
It's a circular problem: to be sure what the file is ("it's a two-byte
int") one has to understand the contents ("the integer 2305") -- but you
can only understand the contents if you know what the file is.

There are only two ways out of this vicious circle:

(1) Have the creator of the file unambiguously label it. Some file systems
associate file-type metadata to files (e.g. Classic Apple Macintosh did
that), but sadly the main file systems in use today do not.

(2) Make an educated guess from various heuristics and conventions. The
old DOS 8.3 naming system is one such convention, and modern operating
systems tend to follow it. The Unix "file" utilities database of magic
bytes is such a heuristic.
--
Steven.

Jan 19 '07 #6
tubby wrote:
Now, If only I could something like that on PDF files :)
PDF files should begin with "%PDF-" followed by a version number, eg.
"%PDF-1.4". The PDF Reference notes that Adobe Acrobat Reader is a bit
more flexiable about what it will accept:

13. Acrobat viewers require only that the header appear
somewhere within the first 1024 bytes of the file.
14. Acrobat viewers also accept a header of the form
%!PS-Adobe-N.n PDF-M.m

So identifying PDF files is pretty easy. If you want to examine the
contents of a PDF file you're better off using Postscript, Ghostscript
specifically, since PDF is essentially Postscript with a special
dictionary of commands.

Ross Ridge

Jan 19 '07 #7
On Fri, 19 Jan 2007, Steven D'Aprano wrote:
On Fri, 19 Jan 2007 12:22:04 +1100, Ben Finney wrote:
>tubby <tu***@bandahea rt.comwrites:
>>Silly question, but here goes... what's a good way to determine
when a file is an Open Office document? I could look at the file
extension, but it seems there would be a better way.
<snip>
The Unix 'file' command determines the type of a file by its
contents, not its name. This functionality is essentially a
database of "magic" byte patterns mapping to file types,

Ah, another lousy, unreliable way to make a definite statement about
the actual contents of a file. Looking at magic bytes inside a file
is hardly bullet-proof (although file seems to be moderately
reliable in practice, at least under Linux).

Simple example: is the file consisting of two bytes "x09x0A" meant
to be a text file with a tab and a newline, or a binary file
consisting of a single two-byte int? There's no way to tell just
from the contents.
And see for example the problem that development versions of emacs is
(were?) having with C files that started #define and were then treated
as graphics files!

http://thread.gmane.org/gmane.emacs....23/focus=65228
Robert
--
La grenouille songe..dans son château d'eau
Links and things http://rmstar.blogspot.com/
Jan 19 '07 #8
On Fri, 19 Jan 2007 12:48:14 -0800, Ross Ridge wrote:
tubby wrote:
>Now, If only I could something like that on PDF files :)

PDF files should begin with "%PDF-" followed by a version number, eg.
"%PDF-1.4". The PDF Reference notes that Adobe Acrobat Reader is a bit
more flexiable about what it will accept:

13. Acrobat viewers require only that the header appear
somewhere within the first 1024 bytes of the file.
14. Acrobat viewers also accept a header of the form
%!PS-Adobe-N.n PDF-M.m

So identifying PDF files is pretty easy.
Sure. MIS-identifying PDF files is pretty easy. Identifying them is not.
Consider this example:

$ cat not_a_pdf
%PDF-1.4
This is not a pdf file.
$ file not_a_pdf
not_a_pdf: PDF document, version 1.4

Is there a security vulnerability buried in the detection of file types by
magic bytes? I don't know, but I wouldn't be surprised if there were.

Here's another example:

$ cat not_a_gif.txt
GIF89a is the header used to define a GIF file.
$ file not_a_gif.txt
not_a_gif: GIF image data, version 89a, 26912 x 8307

Any file system that doesn't have file type metadata is reduced to
guessing the type of the file, and guesses can be wrong. As heuristics go,
"look at the characters after the dot in the file name" is not that much
worse than "look at the bytes at offset X through Y inside the file", and
has the significant advantage that it is visible and easy to change for
the end user.

--
Steven.

Jan 20 '07 #9
Ross Ridge wrote:
So identifying PDF files is pretty easy.
Steven D'Aprano wrote:
Sure. MIS-identifying PDF files is pretty easy. Identifying them is not.
Consider this example:
Your contrived example doesn't show how a PDF file would be
misidentified, it only shows how a file deliberately made to look like
PDF file would be "misidentified" . Since that was the intent of
crafting such a file, I don't see the problem.
Is there a security vulnerability buried in the detection of file types by
magic bytes? I don't know, but I wouldn't be surprised if there were.
There's only a security vulnerability if you choose to trust a file
based on it's assumed file type. Since PDF files generally aren't
trusted, it's not likely to be an issue for whatever application tubby
has in mind.
>Any file system that doesn't have file type metadata is reduced to
guessing the type of the file, and guesses can be wrong.
File type metadata can also be wrong. You can give any file a .PDF
extension and Windows will believe it's a PDF file. On Mac OS if file
has a signature "CARO"/"PDF ", it will believe it's a PDF file
regardless of it's contents. Metadata doesn't make programs any less
vulnerable to deliberate attempts to fool them.

Ross Ridge

Jan 20 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1399
by: Ana | last post by:
Hi! I have problems with the following scenario: My application is developed using C# under .NET. It must run on all Windows versions starting from Windows 98. The user must open different documents (txt, MS Office files, pdf, pictures,…) from inside my app. It must start the file with the adequate external program (Notepad, MS Office programs, Acrobat Reader, some Picture viewer,... ) and be notified when this programs closes the...
9
3464
by: Christopher Benson-Manica | last post by:
I have the following situation: Page A opens a window named 'foo'. Page A then reloads itself. Is there a way for the reloaded Page A to determine whether there is an open window named 'foo', *without* calling window.open? -- Christopher Benson-Manica | I *should* know what I'm talking about - if I ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
6
3649
by: Kenneth Courville | last post by:
Hello, I'm looking for assistance with the Access object model. I know this is VB, but I'm building an Office Add-using C# directed at Access 2002. I'm literate in VB, so you can reply in VB... I think my problem mainly lies in knowing the quirks of the Access object model. Basically, I'm looking for a method of determining if an Access database is open yet. AND I'm looking for a method that doesn't not require checking for an...
5
4042
by: Jerry Hull | last post by:
I'm working with a database developed by an untrained person over several years - and on a network that has recently been upgraded with a new server installed and MS office upgraded from 2K (I think - it might have been XP) to 2003. The database is impressive, both in what it does and the obtuse and inconsistent ways it works. There are several hundred queries, for example, with no indication of where they are used or if they are in fact...
4
5156
by: Brian K. | last post by:
I am trying to split up and secure a database that we've been using on a network for a few years, following the procedure listed in Q304932 from microsoft. I create a new blank database for the front end, and am able to import tables and link from the backend. Queries come in ok too. When I tried to import Modules, forms, and reports, most of them failed with the error: "File Not Found:...
2
6844
by: VP | last post by:
Hi, I am trying to open an existing word document, the execution happens with no error but Word does not come up and my document does not open. This is the code I am using.. Microsoft.Office.Interop.Word.ApplicationClass WordApp = new Microsoft.Office.Interop.Word.ApplicationClass();
6
5167
by: Ana | last post by:
Hi! I have problems with the following scenario: My application is developed using C# under .NET. It must run on all Windows versions starting from Windows 98. The user must open different documents (txt, MS Office files, pdf, pictures,…) from inside my app. It must start the file with the adequate external program (Notepad, MS Office programs, Acrobat Reader, some Picture viewer,... ) and be notified when this programs closes the...
23
52737
by: PW | last post by:
Hi, I'd like to close a recordset and set the database to nothing if a recordset is open if an error has occured. Leaving a recordset open and a database open isn't a good idea, right? Thanks, -paul w
8
2159
by: =?Utf-8?B?R2VvcmdlQXRraW5z?= | last post by:
Greetings! I wrote a small Exe that simply runs Shell to load PowerPoint and launch a particular file, depending on the day of the week. However, it was set up for office 2003 (I naively hardcoded the path) and I also used Shell. Does anybody have a snipped showing a more efficient method for launching a Powerpoint file, regardless of which version of Office is running? My current, ineffecient code: Sub main() Try
0
9568
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9404
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10168
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9959
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9837
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7381
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5279
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3929
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2806
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.