472,126 Members | 1,463 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,126 software developers and data experts.

Using PIL to find separator pages

I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.

Regards,
Larry Bates
May 31 '07 #1
8 1259
Larry Bates wrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 31 '07 #2
On May 31, 10:01 am, Larry Bates <larry.ba...@websafe.comwrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.

Regards,
Larry Bates
I used GraphicsMagick for a similar situation. Once installed you can
run `gm identify' to return all sorts of usefull information about the
images. In my case I had python call 'gm' to identify the number of
colors in each image, then inspect the output and handle the image
accordingly. I'll bet PIL could do a similar thing, but in my case I
was examining DPX files which PIL can't handle. Either approach will
most likely take a bit of time unless you spread the work over several
machines.

~Sean

Jun 1 '07 #3
Steve Holden wrote:
Larry Bates wrote:
>I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry
Jun 1 '07 #4
Steve Holden wrote:
Larry Bates wrote:
>I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry
Jun 1 '07 #5
Larry Bates wrote:
Steve Holden wrote:
>Larry Bates wrote:
>>I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve

Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry
I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Jun 1 '07 #6
Steve Holden wrote:
Larry Bates wrote:
>Steve Holden wrote:
>>Larry Bates wrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.

Thanks in advance for any thoughts or advice.

I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve

Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry

I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...

regards
Steve
I think I've come up with something that will work. I use PIL
Image.getcolors() to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.

Thanks for pushing me in the right direction.

-Larry
Jun 2 '07 #7
Larry Bates wrote:
Steve Holden wrote:
>Larry Bates wrote:
>>Steve Holden wrote:
Larry Bates wrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.
>
Thanks in advance for any thoughts or advice.
>
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.

Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.

regards
Steve
Steve,

I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.

Thanks in advance,
Larry
I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...

regards
Steve

I think I've come up with something that will work. I use PIL
Image.getcolors() to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.

Thanks for pushing me in the right direction.

-Larry
Well done! Thanks for letting me know that the basic approach was correct.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Jun 2 '07 #8
Steve Holden wrote:
Larry Bates wrote:
[...]
>I think I've come up with something that will work. I use PIL
Image.getcolors() to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.

Thanks for pushing me in the right direction.

-Larry

Well done! Thanks for letting me know that the basic approach was correct.

regards
Steve
Oh, by the way: instead of averaging over the *whole page*, now average
over (say) eight samples each of 10 x 10 pixels or thereabouts. They
should all be roughly the same, and they should all be close to the
color of the separator color.

Seems to me (again, without bothering to actually write the code, which
you are far more motivated to do than I anyway) that the much smaller
amount of arithmetic will compensate for any loss in accuracy (which I
surmise will anyway be trivial if the separator is something like DayGlo
green or yellow).

Or maybe it's already running "fast enough", and you are already on to
the next job?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Jun 2 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by Jaime Wyant | last post: by
5 posts views Thread by Michael Sperlle | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.