I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.
Thanks in advance for any thoughts or advice.
Regards,
Larry Bates 8 1344
Larry Bates wrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.
Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.
Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
On May 31, 10:01 am, Larry Bates <larry.ba...@we bsafe.comwrote:
I have a project that I wanted to solicit some advice
on from this group. I have millions of pages of scanned
documents with each page in and individual .JPG file.
When the documents were scanned the people that did
the scanning put a colored (hot pink) separator page
between the individual documents. I was wondering if
there was any way to utilize PIL to scan through the
individual files, look at some small section on the
page, and determine if it is a separator page by
somehow comparing the color to the separator page
color? I realize that this would be some sort of
percentage match where 100% would be a perfect match
and any number lower would indicate that it was less
likely that it was a coverpage.
Thanks in advance for any thoughts or advice.
Regards,
Larry Bates
I used GraphicsMagick for a similar situation. Once installed you can
run `gm identify' to return all sorts of usefull information about the
images. In my case I had python call 'gm' to identify the number of
colors in each image, then inspect the output and handle the image
accordingly. I'll bet PIL could do a similar thing, but in my case I
was examining DPX files which PIL can't handle. Either approach will
most likely take a bit of time unless you spread the work over several
machines.
~Sean
Steve Holden wrote:
Larry Bates wrote:
>I have a project that I wanted to solicit some advice on from this group. I have millions of pages of scanned documents with each page in and individual .JPG file. When the documents were scanned the people that did the scanning put a colored (hot pink) separator page between the individual documents. I was wondering if there was any way to utilize PIL to scan through the individual files, look at some small section on the page, and determine if it is a separator page by somehow comparing the color to the separator page color? I realize that this would be some sort of percentage match where 100% would be a perfect match and any number lower would indicate that it was less likely that it was a coverpage.
Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.
Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.
regards
Steve
Steve,
I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.
Thanks in advance,
Larry
Steve Holden wrote:
Larry Bates wrote:
>I have a project that I wanted to solicit some advice on from this group. I have millions of pages of scanned documents with each page in and individual .JPG file. When the documents were scanned the people that did the scanning put a colored (hot pink) separator page between the individual documents. I was wondering if there was any way to utilize PIL to scan through the individual files, look at some small section on the page, and determine if it is a separator page by somehow comparing the color to the separator page color? I realize that this would be some sort of percentage match where 100% would be a perfect match and any number lower would indicate that it was less likely that it was a coverpage.
Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each
image and average the color values of the pixels, then normalize to hue
rather than RGB.
Close enough to the hue you want (and you could include saturation and
intensity too, if you felt like it) across several areas of the page
would be a hit for a separator.
regards
Steve
Steve,
I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.
Thanks in advance,
Larry
Larry Bates wrote:
Steve Holden wrote:
>Larry Bates wrote:
>>I have a project that I wanted to solicit some advice on from this group. I have millions of pages of scanned documents with each page in and individual .JPG file. When the documents were scanned the people that did the scanning put a colored (hot pink) separator page between the individual documents. I was wondering if there was any way to utilize PIL to scan through the individual files, look at some small section on the page, and determine if it is a separator page by somehow comparing the color to the separator page color? I realize that this would be some sort of percentage match where 100% would be a perfect match and any number lower would indicate that it was less likely that it was a coverpage.
Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each image and average the color values of the pixels, then normalize to hue rather than RGB.
Close enough to the hue you want (and you could include saturation and intensity too, if you felt like it) across several areas of the page would be a hit for a separator.
regards Steve
Steve,
I'm completely lost on how to proceed. I don't know how to average color
values, normalize to hue... Any guidance you could give would be greatly
appreciated.
Thanks in advance,
Larry
I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
Steve Holden wrote:
Larry Bates wrote:
>Steve Holden wrote:
>>Larry Bates wrote: I have a project that I wanted to solicit some advice on from this group. I have millions of pages of scanned documents with each page in and individual .JPG file. When the documents were scanned the people that did the scanning put a colored (hot pink) separator page between the individual documents. I was wondering if there was any way to utilize PIL to scan through the individual files, look at some small section on the page, and determine if it is a separator page by somehow comparing the color to the separator page color? I realize that this would be some sort of percentage match where 100% would be a perfect match and any number lower would indicate that it was less likely that it was a coverpage.
Thanks in advance for any thoughts or advice.
I suspect the easiest way would be to select a few small patches of each image and average the color values of the pixels, then normalize to hue rather than RGB.
Close enough to the hue you want (and you could include saturation and intensity too, if you felt like it) across several areas of the page would be a hit for a separator.
regards Steve
Steve,
I'm completely lost on how to proceed. I don't know how to average color values, normalize to hue... Any guidance you could give would be greatly appreciated.
Thanks in advance, Larry
I'd like to help but I don't have any sample code to hand. Maybe someone
who does could give you more of a clue. Let's hope so, anyway ...
regards
Steve
I think I've come up with something that will work. I use PIL
Image.getcolors () to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.
Thanks for pushing me in the right direction.
-Larry
Larry Bates wrote:
Steve Holden wrote:
>Larry Bates wrote:
>>Steve Holden wrote: Larry Bates wrote: I have a project that I wanted to solicit some advice on from this group. I have millions of pages of scanned documents with each page in and individual .JPG file. When the documents were scanned the people that did the scanning put a colored (hot pink) separator page between the individual documents. I was wondering if there was any way to utilize PIL to scan through the individua l files, look at some small section on the page, and determine if it is a separator page by somehow comparing the color to the separator page color? I realize that this would be some sort of percentag e match where 100% would be a perfect match and any number lower would indicate that it was less likely that it was a coverpage. > Thanks in advance for any thoughts or advice. > I suspect the easiest way would be to select a few small patches of each image and average the color values of the pixels, then normalize to hue rather than RGB.
Close enough to the hue you want (and you could include saturation and intensity too, if you felt like it) across several areas of the page would be a hit for a separator.
regards Steve Steve,
I'm completely lost on how to proceed. I don't know how to average color values, normalize to hue... Any guidance you could give would be greatly appreciated .
Thanks in advance, Larry
I'd like to help but I don't have any sample code to hand. Maybe someone who does could give you more of a clue. Let's hope so, anyway ...
regards Steve
I think I've come up with something that will work. I use PIL
Image.getcolors () to get colors and take the top 10 colors of my
background page. I then calculate the average of the R, G, B
components. That becomes my reference. Then I read a page and
make the same calculation. I then calculate the absolute value
of the difference of R, G, B of the two values. Sum those
together gives something like the average difference between
the two average colors (at least that is what I think it does).
This seems to give me small numbers when the pages are the same
and large numbers when they are different. It isn't super fast
but it is working.
Thanks for pushing me in the right direction.
-Larry
Well done! Thanks for letting me know that the basic approach was correct.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
Steve Holden wrote:
Larry Bates wrote:
[...]
>I think I've come up with something that will work. I use PIL Image.getcolor s() to get colors and take the top 10 colors of my background page. I then calculate the average of the R, G, B components. That becomes my reference. Then I read a page and make the same calculation. I then calculate the absolute value of the difference of R, G, B of the two values. Sum those together gives something like the average difference between the two average colors (at least that is what I think it does). This seems to give me small numbers when the pages are the same and large numbers when they are different. It isn't super fast but it is working.
Thanks for pushing me in the right direction.
-Larry
Well done! Thanks for letting me know that the basic approach was correct.
regards
Steve
Oh, by the way: instead of averaging over the *whole page*, now average
over (say) eight samples each of 10 x 10 pixels or thereabouts. They
should all be roughly the same, and they should all be close to the
color of the separator color.
Seems to me (again, without bothering to actually write the code, which
you are far more motivated to do than I anyway) that the much smaller
amount of arithmetic will compensate for any loss in accuracy (which I
surmise will anyway be trivial if the separator is something like DayGlo
green or yellow).
Or maybe it's already running "fast enough", and you are already on to
the next job?
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogs pot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ---------------- This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Stephen Ferg |
last post by:
I have a question that is not directly Python-related. But I thought
I'd ask the most erudite group that I know... :-)
When did Windows start accepting the forward slash as a path separator
character?
At one time, it was accepted as a truism that Windows (like MS-DOS)
was different from Unix because Windows used the backslash as the path
separator character, whereas Unix used the forward slash.
|
by: Jaime Wyant |
last post by:
I know I've seen this somewhere, but can't seem to google it. Is
there a way to use an alternate statement separator, other than the
default ';'?
jw
|
by: Vishal |
last post by:
I need a simple method to find whether there are any instances of consecutive
commas (more than 1) in a given string without parsing each character of the
string. I tried with strtok() with comma as separator but it considers all
consecutive commas as a single separator and gives the next token.
Is there any simple method to do the same?
|
by: Julien |
last post by:
Hello,
I have some files located in a file server and managed by a SQL database
from a web based interface using ASP + VBSCRIPT technology.
I need to automatically copy those files to a web folder.
I that possible?
Thanks.
|
by: Jules |
last post by:
In a ASP.NET 2.0 project I'd like to set an image as path separator in
the SiteMapPath. When I edit the PathSepartor template (for this
SiteMapPath) and drag an image on it, it works fine for all pages in
the root of the Web project. VS2005 creates the folling aspx:
<asp:SiteMapPath ID="SiteMapPath" runat="server">
<PathSeparatorTemplate>
<img src="Images/arrow-right.jpg" />
</PathSeparatorTemplate>
</asp:SiteMapPath>
| |
by: Michael Sperlle |
last post by:
Is it possible? Bestcrypt can supposedly be set up on linux, but it seems
to need changes to the kernel before it can be installed, and I have no
intention of going through whatever hell that would cause.
If I could create a large file that could be encrypted, and maybe add
files to it by appending them and putting in some kind of delimiter
between files, maybe a homemade version of truecrypt could be constructed.
Any idea what it...
|
by: bruce628 |
last post by:
I want to use SWT Label and popmenu to construct a menubar ,and the effect
of this menubar is same to the menubar in SWT.When click the Label,it should be highlighted and popmenu shows.The issue is when click the label and move the mouse to enter next label,the next label can not be highlighted.Can anyone have a solution for it?
Here is my code:
import org.eclipse.swt.SWT;
import org.eclipse.swt.widgets.*;
public class labelmenu {
...
|
by: ARC |
last post by:
Hello all,
I'm a bit stumped on this one. I have a ribbon menu where I am allowing the
users to hide main options that they don't use. I use a separator id in
between each option to get a separtor bar in the menu ribbon. However...When
hiding a main button, the separator doesn't hide automatically (like I had
hoped). So, I added code for a "onGetVisible" event, and used the line below
in the ribbon.
<separator id="sepM1"...
|
by: Bjorn Brox |
last post by:
Hi!
In germany, norway and France(?) we are using ',' as decimal separator
and it always messes up when you convert a double to and from a string
where the interface expects double values stored as string is using '.'
What parameter shall I use in double.ToString() to ensure that the
outputstring always are using '.' as separator without thousand sep,
and what parameter shall I use to convert double to Double.Parse() or...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
| |
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |