471,330 Members | 1,820 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,330 software developers and data experts.

check to see if file contents are text.

I am trying my hand at indexing a folder, and then searching for the contents. i am using lupy as a base because of its simplicity

now, the problem with this is that it indexes png files and jpeg and what not as text documents, thus slowing down my stuff. so how do i check to see if the inards of the file is legit text or not. simply putting a filter on the extension txt wont work since i also want to index html and other documents.
Jul 27 '07 #1
1 9830
bartonc
6,596 Expert 4TB
I am trying my hand at indexing a folder, and then searching for the contents. i am using lupy as a base because of its simplicity

now, the problem with this is that it indexes png files and jpeg and what not as text documents, thus slowing down my stuff. so how do i check to see if the inards of the file is legit text or not. simply putting a filter on the extension txt wont work since i also want to index html and other documents.
I came up with a solution like this for checking to see if a password was encrypted. It just checks to see if there are non-ascii components of a string. You'd need to open each file in text mode, grab the data (or a chunk if it), then:
Expand|Select|Wrap|Line Numbers
  1. import re
  2. ReCheck = re.compile(".*[\x00-\x1f\x7f-\xff]+.*")
  3. if ReCheck.match(data):
  4.     print "Has non-ascii content"
Could work..
Jul 27 '07 #2

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

1 post views Thread by semi | last post: by
33 posts views Thread by Jason Heyes | last post: by
11 posts views Thread by Skc | last post: by
4 posts views Thread by Jim Michaels | last post: by
3 posts views Thread by Beliavsky | last post: by
4 posts views Thread by giftson.john | last post: by
24 posts views Thread by Bill | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.