By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,606 Members | 2,019 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,606 IT Pros & Developers. It's quick & easy.

Howto Determine mimetype without the file name extension?

P: n/a
Hi all,
I had a filesystem crash and when I retrieved the data back
the files had random names without extension. I decided to write a
script to determine the file extension and create a newfile with
extension.
---
method 1:
# File extension utility.

import os
import mimetypes
import shutil

def main():

for root,dirs,files in os.walk(r'C:\Senthil\test'):
for each in files:
fname = os.path.join(root,each)
print fname
mtype,entype = mimetypes.guess_type(fname)
fext = mimetypes.guess_extension(mtype)
if fext is not None:
try:
newname = fname + fext
print newname
shutil.copyfile(fname,newname)
except (IOError,os.error), why:
print "Can't copy %s to %s: %s" %
(fname,newname,str(why))
if __name__ == "__main__":
main()

----
The problem I faced with this script is. if the filename did not have
any extension, the mimetypes.guess_type(filename) failed!!!
How do I get around this problem.

As it was a linux box, I tried using file command to get the work done.
----
Method 2:

import os
import shutil
import re

def detext(filename):
cin,cout,cerr = os.popen3('file ' + filename)
fileoutput = cout.read()
rtf = re.compile('Rich Text Format data')
# doc = re.compile('Microsoft Office Document')
pdf = re.compile('PDF')

if rtf.search(fileoutput) is not None:
shutil.copyfile(filename,filename + '.rtf')
if doc.search(fileoutput) is not None:
shutil.copyfile(filename,filename + '.doc')

if pdf.search(fileoutput) is not None:
shutil.copyfile(filename,filename + '.pdf')

def main():
for root,dirs,files in os.walk(os.getcwd()):
for each in files:
fname = os.path.join(root,each)
detext(fname)

if __name__ == '__main__':
main()

----
but the problem with using file was it recognized both .xls (MS Excel)
and .doc ( MS Doc) as Microsoft Word Document only. I need to separate
the .xls and .doc files, I dont know if file will be helpful here.

--
If the first approach of mimetypes works, it would be great!
Has anyone faced this problem? How did you solve it?

thanks,
Senthil

http://phoe6.livejournal.com

Jul 18 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Phoe6 wrote:
Hi all,
I had a filesystem crash and when I retrieved the data back
the files had random names without extension. I decided to write a
script to determine the file extension and create a newfile with
extension.
[...]
but the problem with using file was it recognized both .xls (MS Excel)
and .doc ( MS Doc) as Microsoft Word Document only. I need to separate
the .xls and .doc files, I dont know if file will be helpful here.
You may want to try the gnome.vfs module:

info = gnome.vfs.get_file_info(filename,
gnome.vfs.FILE_INFO_GET_MIME_TYPE)
info.mime_type #mime type

If all of your documents are .xls and .doc, you could also use one of
the cli tools that converts .doc to txt like catdoc. These tools will
fail on an .xls document, so if you run it and check for output. .doc
files would output a lot, .xls files would output an error or nothing.
The gnome.vfs module is probably your best bet though :-)

Additionally, I would re-organize your program a bit. something like:

import os
import re
import subprocess

types = (
('rtf', 'Rich Text Format data'),
('doc', 'Microsoft Office Document'),
('pdf', 'PDF'),
('txt', 'ASCII English text'),
)

def get_magic(filename):
pipe=subprocess.Popen(['file',filename],stdout=subprocess.PIPE)
output = pipe.stdout.read()
pipe.wait()
return output

def detext(filename):
fileoutput = get_magic(filename)
for ext, pattern in types:
if pattern in fileoutput:
return ext
def allfiles(path):
for root,dirs,files in os.walk(os.getcwd()):
for each in files:
fname = os.path.join(root,each)
yield fname

def fixnames(path):
for fname in allfiles(path):
extension = detext(fname)
print fname, extension #....

def main():
path = os.getcwd()
fixnames(path)

if __name__ == '__main__':
main()

Short functions that just do one thing are always best.

To change that to use gnome.vfs, just change the types list to be a
dictionary like
types = {
'application/msword': 'doc',
'application/vnd.ms-powerpoint': 'ppt',
}

and then

def get_mime(filename):
info = gnome.vfs.get_file_info(filename,
gnome.vfs.FILE_INFO_GET_MIME_TYPE)
return info.mime_type

def detext(filename):
mime_type = get_mime(filename)
return types.get(mime_type)

--
- Justin

Jul 18 '06 #2

P: n/a
Justin Azoff wrote:
Additionally, I would re-organize your program a bit. something like:
Thanks Justin, that was a helpful one. Helping me in learning python
programming.

Thanks,
Senthil

Jul 18 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.