By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
458,143 Members | 1,730 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 458,143 IT Pros & Developers. It's quick & easy.

Extract Image From PDF

P: n/a
Hi all

Does anybody please know a way to extract an Image from a pdf file and save
it as a TIFF?

I have used a scanner to scan documents which are then placed on a server,
but I need to extract the image of the document (just the first page if
there are multiple pages) and save it as a TIFF so I can then use the
Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the company
I am working for if they provide a way to do this in my .NET application.

Thank you for your help.

Kind Regards,
Steve
Aug 22 '08 #1
Share this Question
Share on Google+
5 Replies


P: n/a
I don't know of a Net way exactly, however you can check out Ghostscript
which will allow you to read a Pdf and save it as a Tiff. I think you can
specify page numbers to convert. You can call Ghostscript from a command
line with your params with Process.Start.

hth,

Rick
"Steve" <st***************@hotmail.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
Hi all

Does anybody please know a way to extract an Image from a pdf file and
save it as a TIFF?

I have used a scanner to scan documents which are then placed on a server,
but I need to extract the image of the document (just the first page if
there are multiple pages) and save it as a TIFF so I can then use the
Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the
company I am working for if they provide a way to do this in my .NET
application.

Thank you for your help.

Kind Regards,
Steve
Aug 22 '08 #2

P: n/a
Hi Rick

Thanks for that. I have downloaded and installed Ghostscript. I have a demo
app that can execute Ghostscript with command line parameters, and at the
moment I can only get the revision number and a thumbnail view of the first
page (JPEG) based on the content I have found.

Do you know the parameters I would need to extract the image on the first
page to a TIFF please? I can't seem to find these amywhere :o(

Here are the args I found to generate a jpeg based on a pdf document:

Dim astrArgs(7) As String
astrArgs(0) = "pdf2jpg" 'The First Parameter is Ignored
astrArgs(1) = "-dNOPAUSE"
astrArgs(2) = "-dBATCH"
astrArgs(3) = "-dSAFER"
astrArgs(4) = "-sDEVICE=jpeg"
astrArgs(5) = "-sOutputFile=C:\Thumbnail.jpg"
astrArgs(6) = "C:\MyPDFDoc.pdf"

Thanks for your help!

Regards,
Steve

"Rick" <Ri**@lakevalleyseed.comwrote in message
news:CB**********************************@microsof t.com...
>I don't know of a Net way exactly, however you can check out Ghostscript
which will allow you to read a Pdf and save it as a Tiff. I think you can
specify page numbers to convert. You can call Ghostscript from a command
line with your params with Process.Start.

hth,

Rick
"Steve" <st***************@hotmail.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
>Hi all

Does anybody please know a way to extract an Image from a pdf file and
save it as a TIFF?

I have used a scanner to scan documents which are then placed on a
server, but I need to extract the image of the document (just the first
page if there are multiple pages) and save it as a TIFF so I can then use
the Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the
company I am working for if they provide a way to do this in my .NET
application.

Thank you for your help.

Kind Regards,
Steve

Aug 23 '08 #3

P: n/a
I run mine from Net process.start like this:

process.StartInfo.Arguments =
String.Format("-dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg3 -sOutputFile=""{0}""
""{1}""", Path.ChangeExtension(fileName, ".tif"), fileName)
if you want to read only the first page you would add -dFirstPage=1
and -dLastPage=1 (see http://web.mit.edu/ghostscript/www/Use.htm )

I am slightly confused about what you really want. I understood you want to
convert the entire first page to a tiff file and then use an OCR program to
read text. If you want to only extract an image from the first page, I'm
not sure this would work. I don't know of a facility to extract an image
from a pdf. You might check iTextSharp which can create and read pdf's. If
you know the name of the image you may be able to extract it.

Also, if you want a Tiff file why are you extracting to a jpeg below?

Rick
"Steve Amey" <steveamey@[removeme]hotmail.comwrote in message
news:%2****************@TK2MSFTNGP05.phx.gbl...
Hi Rick

Thanks for that. I have downloaded and installed Ghostscript. I have a
demo app that can execute Ghostscript with command line parameters, and at
the moment I can only get the revision number and a thumbnail view of the
first page (JPEG) based on the content I have found.

Do you know the parameters I would need to extract the image on the first
page to a TIFF please? I can't seem to find these amywhere :o(

Here are the args I found to generate a jpeg based on a pdf document:

Dim astrArgs(7) As String
astrArgs(0) = "pdf2jpg" 'The First Parameter is Ignored
astrArgs(1) = "-dNOPAUSE"
astrArgs(2) = "-dBATCH"
astrArgs(3) = "-dSAFER"
astrArgs(4) = "-sDEVICE=jpeg"
astrArgs(5) = "-sOutputFile=C:\Thumbnail.jpg"
astrArgs(6) = "C:\MyPDFDoc.pdf"

Thanks for your help!

Regards,
Steve

"Rick" <Ri**@lakevalleyseed.comwrote in message
news:CB**********************************@microsof t.com...
>>I don't know of a Net way exactly, however you can check out Ghostscript
which will allow you to read a Pdf and save it as a Tiff. I think you can
specify page numbers to convert. You can call Ghostscript from a command
line with your params with Process.Start.

hth,

Rick
"Steve" <st***************@hotmail.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
>>Hi all

Does anybody please know a way to extract an Image from a pdf file and
save it as a TIFF?

I have used a scanner to scan documents which are then placed on a
server, but I need to extract the image of the document (just the first
page if there are multiple pages) and save it as a TIFF so I can then
use the Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the
company I am working for if they provide a way to do this in my .NET
application.

Thank you for your help.

Kind Regards,
Steve

Aug 23 '08 #4

P: n/a
Thank you, I'm generating tiff files now.

The pdf is an image of a scanned document. I would like to get the text of
the scanned image. I looked into OCR, and came across Tesseract. To my
knowledge, Tesseract can (only) read a tiff file and extract the text. If I
open up a document in Adobe Pro and save the scanned image as a tiff,
tesseract does read most of it quite well, but my problem is that I have to
automate the process and can't open up the documents and manually save the
images, so I need something to extract the scanned image in the pdf file and
save it as a tiff so tesseract can read it. I tried iTextSharp already but I
get an error "PDF header signature not found", which I'm guessing is a
problem with the way the scanner creates the pdf files and iTextSharp can't
open it.

I found some sample code that creates a jpeg, which is what I posted, but I
didn't know how to create a tiff file, but I see that it's just a case of
changing the -sDEVICE parameter to the one you are using.

Unfortunately, the resulting tiff image is not great quality and tesseract
makes many errors when trying to read it, so I have to find another way or
give up :o(

Thank you for your help, if you know of any other way to do what I'm trying
then I'd love to know! I don't mind paying a small amount for some
commercial software that can extract images from pdf docs that I can use in
..NET, but I haven't found any yet that don't cost hundreds or even thousands
of dollars.

"Rick" <Ri**@lakevalleyseed.comwrote in message
news:F4**********************************@microsof t.com...
>I run mine from Net process.start like this:

process.StartInfo.Arguments =
String.Format("-dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg3 -sOutputFile=""{0}""
""{1}""", Path.ChangeExtension(fileName, ".tif"), fileName)
if you want to read only the first page you would add -dFirstPage=1
and -dLastPage=1 (see http://web.mit.edu/ghostscript/www/Use.htm )

I am slightly confused about what you really want. I understood you want
to convert the entire first page to a tiff file and then use an OCR
program to read text. If you want to only extract an image from the first
page, I'm not sure this would work. I don't know of a facility to extract
an image from a pdf. You might check iTextSharp which can create and read
pdf's. If you know the name of the image you may be able to extract it.

Also, if you want a Tiff file why are you extracting to a jpeg below?

Rick
"Steve Amey" <steveamey@[removeme]hotmail.comwrote in message
news:%2****************@TK2MSFTNGP05.phx.gbl...
>Hi Rick

Thanks for that. I have downloaded and installed Ghostscript. I have a
demo app that can execute Ghostscript with command line parameters, and
at the moment I can only get the revision number and a thumbnail view of
the first page (JPEG) based on the content I have found.

Do you know the parameters I would need to extract the image on the first
page to a TIFF please? I can't seem to find these amywhere :o(

Here are the args I found to generate a jpeg based on a pdf document:

Dim astrArgs(7) As String
astrArgs(0) = "pdf2jpg" 'The First Parameter is Ignored
astrArgs(1) = "-dNOPAUSE"
astrArgs(2) = "-dBATCH"
astrArgs(3) = "-dSAFER"
astrArgs(4) = "-sDEVICE=jpeg"
astrArgs(5) = "-sOutputFile=C:\Thumbnail.jpg"
astrArgs(6) = "C:\MyPDFDoc.pdf"

Thanks for your help!

Regards,
Steve

"Rick" <Ri**@lakevalleyseed.comwrote in message
news:CB**********************************@microso ft.com...
>>>I don't know of a Net way exactly, however you can check out Ghostscript
which will allow you to read a Pdf and save it as a Tiff. I think you can
specify page numbers to convert. You can call Ghostscript from a command
line with your params with Process.Start.

hth,

Rick
"Steve" <st***************@hotmail.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl.. .
Hi all

Does anybody please know a way to extract an Image from a pdf file and
save it as a TIFF?

I have used a scanner to scan documents which are then placed on a
server, but I need to extract the image of the document (just the first
page if there are multiple pages) and save it as a TIFF so I can then
use the Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the
company I am working for if they provide a way to do this in my .NET
application.

Thank you for your help.

Kind Regards,
Steve



Aug 23 '08 #5

P: n/a
http://www.foolabs.com/xpdf/

pdfimages - Portable Document Format (PDF) image extractor
(version 3.02)


"Steve Amey" <steveamey@[removeme]hotmail.comwrote in message
news:#F**************@TK2MSFTNGP05.phx.gbl...
Hi Rick

Thanks for that. I have downloaded and installed Ghostscript. I have a
demo app that can execute Ghostscript with command line parameters, and at
the moment I can only get the revision number and a thumbnail view of the
first page (JPEG) based on the content I have found.

Do you know the parameters I would need to extract the image on the first
page to a TIFF please? I can't seem to find these amywhere :o(

Here are the args I found to generate a jpeg based on a pdf document:

Dim astrArgs(7) As String
astrArgs(0) = "pdf2jpg" 'The First Parameter is Ignored
astrArgs(1) = "-dNOPAUSE"
astrArgs(2) = "-dBATCH"
astrArgs(3) = "-dSAFER"
astrArgs(4) = "-sDEVICE=jpeg"
astrArgs(5) = "-sOutputFile=C:\Thumbnail.jpg"
astrArgs(6) = "C:\MyPDFDoc.pdf"

Thanks for your help!

Regards,
Steve

"Rick" <Ri**@lakevalleyseed.comwrote in message
news:CB**********************************@microsof t.com...
>>I don't know of a Net way exactly, however you can check out Ghostscript
which will allow you to read a Pdf and save it as a Tiff. I think you can
specify page numbers to convert. You can call Ghostscript from a command
line with your params with Process.Start.

hth,

Rick
"Steve" <st***************@hotmail.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
>>Hi all

Does anybody please know a way to extract an Image from a pdf file and
save it as a TIFF?

I have used a scanner to scan documents which are then placed on a
server, but I need to extract the image of the document (just the first
page if there are multiple pages) and save it as a TIFF so I can then
use the Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the
company I am working for if they provide a way to do this in my .NET
application.

Thank you for your help.

Kind Regards,
Steve

Aug 23 '08 #6

This discussion thread is closed

Replies have been disabled for this discussion.