473,795 Members | 2,914 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Using PIL to find separator pages

I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.

Regards,
Larry Bates
May 31 '07 #1
8 1344
Larry Bates wrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 31 '07 #2
On May 31, 10:01 am, Larry Bates <larry.ba...@we bsafe.comwrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.

Regards,
Larry Bates
I used GraphicsMagick for a similar situation. Once installed you can
run `gm identify' to return all sorts of usefull information about the
images. In my case I had python call 'gm' to identify the number of
colors in each image, then inspect the output and handle the image
accordingly. I'll bet PIL could do a similar thing, but in my case I
was examining DPX files which PIL can't handle. Either approach will
most likely take a bit of time unless you spread the work over several
machines.

~Sean

Jun 1 '07 #3
Steve Holden wrote:
Larry Bates wrote:
>I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry
Jun 1 '07 #4
Steve Holden wrote:
Larry Bates wrote:
>I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry
Jun 1 '07 #5
Larry Bates wrote:
Steve Holden wrote:
>Larry Bates wrote:
>>I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve

Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry
I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Jun 1 '07 #6
Steve Holden wrote:
Larry Bates wrote:
>Steve Holden wrote:
>>Larry Bates wrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.

I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve

Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry

I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...

regards
Steve
I think I've come up with something that will work. I use PIL
Image.getcolors () to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.

Thanks for pushing me in the right direction.

-Larry
Jun 2 '07 #7
Larry Bates wrote:
Steve Holden wrote:
>Larry Bates wrote:
>>Steve Holden wrote:
Larry Bates wrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individua l files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentag e match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.
>
Thanks in advance for any thoughts or advice.
>
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated .

Thanks in advance,
Larry
I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...

regards
Steve

I think I've come up with something that will work. I use PIL
Image.getcolors () to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.

Thanks for pushing me in the right direction.

-Larry
Well done! Thanks for letting me know that the basic approach was correct.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Jun 2 '07 #8
Steve Holden wrote:
Larry Bates wrote:
[...]
>I think I've come up with something that will work. I use PIL
Image.getcolor s() to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.

Thanks for pushing me in the right direction.

-Larry

Well done! Thanks for letting me know that the basic approach was correct.

regards
Steve
Oh, by the way: instead of averaging over the *whole page*, now average
over (say) eight samples each of 10 x 10 pixels or thereabouts. They
should all be roughly the same, and they should all be close to the
color of the separator color.

Seems to me (again, without bothering to actually write the code, which
you are far more motivated to do than I anyway) that the much smaller
amount of arithmetic will compensate for any loss in accuracy (which I
surmise will anyway be trivial if the separator is something like DayGlo
green or yellow).

Or maybe it's already running "fast enough", and you are already on to
the next job?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Jun 2 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

30
75132
by: Stephen Ferg | last post by:
I have a question that is not directly Python-related. But I thought I'd ask the most erudite group that I know... :-) When did Windows start accepting the forward slash as a path separator character? At one time, it was accepted as a truism that Windows (like MS-DOS) was different from Unix because Windows used the backslash as the path separator character, whereas Unix used the forward slash.
9
2637
by: Jaime Wyant | last post by:
I know I've seen this somewhere, but can't seem to google it. Is there a way to use an alternate statement separator, other than the default ';'? jw
4
1557
by: Vishal | last post by:
I need a simple method to find whether there are any instances of consecutive commas (more than 1) in a given string without parsing each character of the string. I tried with strtok() with comma as separator but it considers all consecutive commas as a single separator and gives the next token. Is there any simple method to do the same?
6
12594
by: Julien | last post by:
Hello, I have some files located in a file server and managed by a SQL database from a web based interface using ASP + VBSCRIPT technology. I need to automatically copy those files to a web folder. I that possible? Thanks.
1
4530
by: Jules | last post by:
In a ASP.NET 2.0 project I'd like to set an image as path separator in the SiteMapPath. When I edit the PathSepartor template (for this SiteMapPath) and drag an image on it, it works fine for all pages in the root of the Web project. VS2005 creates the folling aspx: <asp:SiteMapPath ID="SiteMapPath" runat="server"> <PathSeparatorTemplate> <img src="Images/arrow-right.jpg" /> </PathSeparatorTemplate> </asp:SiteMapPath>
5
6780
by: Michael Sperlle | last post by:
Is it possible? Bestcrypt can supposedly be set up on linux, but it seems to need changes to the kernel before it can be installed, and I have no intention of going through whatever hell that would cause. If I could create a large file that could be encrypted, and maybe add files to it by appending them and putting in some kind of delimiter between files, maybe a homemade version of truecrypt could be constructed. Any idea what it...
1
3749
by: bruce628 | last post by:
I want to use SWT Label and popmenu to construct a menubar ,and the effect of this menubar is same to the menubar in SWT.When click the Label,it should be highlighted and popmenu shows.The issue is when click the label and move the mouse to enter next label,the next label can not be highlighted.Can anyone have a solution for it? Here is my code: import org.eclipse.swt.SWT; import org.eclipse.swt.widgets.*; public class labelmenu { ...
0
1883
by: ARC | last post by:
Hello all, I'm a bit stumped on this one. I have a ribbon menu where I am allowing the users to hide main options that they don't use. I use a separator id in between each option to get a separtor bar in the menu ribbon. However...When hiding a main button, the separator doesn't hide automatically (like I had hoped). So, I added code for a "onGetVisible" event, and used the line below in the ribbon. <separator id="sepM1"...
1
4086
by: Bjorn Brox | last post by:
Hi! In germany, norway and France(?) we are using ',' as decimal separator and it always messes up when you convert a double to and from a string where the interface expects double values stored as string is using '.' What parameter shall I use in double.ToString() to ensure that the outputstring always are using '.' as separator without thousand sep, and what parameter shall I use to convert double to Double.Parse() or...
0
9673
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
1
10167
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10003
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9046
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7544
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5440
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5566
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4114
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2922
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.