473,405 Members | 2,154 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Code to Extract Text from PDF

I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.

Hi. I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. Has
anyone developed code to convert PDF's to Text?

I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.

Jul 2 '08 #1
3 10563
1 get this to convert pdf2text
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip
2 use this sub
Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile
As String)
Dim arguments As String = options & " " & pdfFile & " " & txtFile
'make sure to provide the path with the pdfFile and the txtFile
System.Diagnostics.Process.Start("pdftotext.exe", arguments)
End Sub


"SteveB" <st**********@usbank.comwrote in message
news:a9**********************************@34g2000h sh.googlegroups.com...
I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.

Hi. I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. Has
anyone developed code to convert PDF's to Text?

I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.
Jul 2 '08 #2
I have tried many free libraries and had mixed results.
The only reliable avenue was using Aspose library.

http://www.aspose.com/categories/fil...a/default.aspx

My steps using Aspose
1. Init library, open file.
2. Loop thru each page
3 Collect page data/massage/post to database

A typical PDF file for me has 1,500 pages with no forms, 20+ elements per
page to extract.
Average time per document to extract, massage data, pass to database is
10-15 seconds
Apose can read each document in 3 seconds total.

Downside, it cost money yet it's a great investment as I have found out
because it has served me well on multiple projects.

"SteveB" <st**********@usbank.comwrote in message
news:a9**********************************@34g2000h sh.googlegroups.com...
>I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.

Hi. I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. Has
anyone developed code to convert PDF's to Text?

I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.

Jul 7 '08 #3
On Jul 2, 5:05*pm, "Gillard" <gillard_geor...@hotmail.comwrote:
* 1 *get this to convert pdf2textftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip
2 use this sub
Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile
As String)
* * * * Dim arguments As String = options & " " & pdfFile & " "& txtFile
* * * * 'make sure to provide the path with the pdfFile and the txtFile
* * * * System.Diagnostics.Process.Start("pdftotext.exe", arguments)
* * End Sub

"SteveB" <stephen.b...@usbank.comwrote in message

news:a9**********************************@34g2000h sh.googlegroups.com...
I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.
Hi. *I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. *Has
anyone developed code to convert PDF's to Text?
I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.- Hide quoted text -

- Show quoted text -
I tried your suggestion and this app works great from a command line.
However, when I try to call pdftotext as you sugeested, I keep getting
an exception this error:

System.ComponentModel.Win32Exception was unhandled by user code
ErrorCode=-2147467259
Message="The system cannot find the file specified"
Source="System"
StackTrace:
at
System.Diagnostics.Process.StartWithShellExecuteEx (ProcessStartInfo
startInfo)
at System.Diagnostics.Process.Start()
at System.Diagnostics.Process.Start(ProcessStartInfo startInfo)
at System.Diagnostics.Process.Start(String fileName)
at _Default.Pdf2Txt(String options, String pdffile, String
textfile) in D:\documents and settings\srbray\My Documents\Visual
Studio 2005\Websites\RegCC\FRB.aspx.vb:line 48
at _Default.Submit1_Click(Object sender, EventArgs e) in D:
\documents and settings\srbray\My Documents\Visual Studio 2005\Websites
\RegCC\FRB.aspx.vb:line 27
at System.Web.UI.WebControls.Button.OnClick(EventArgs e)
at System.Web.UI.WebControls.Button.RaisePostBackEven t(String
eventArgument)
at
System.Web.UI.WebControls.Button.System.Web.UI.IPo stBackEventHandler.RaisePostBackEvent(String
eventArgument)
at System.Web.UI.Page.RaisePostBackEvent(IPostBackEve ntHandler
sourceControl, String eventArgument)
at System.Web.UI.Page.RaisePostBackEvent(NameValueCol lection
postData)
at System.Web.UI.Page.ProcessRequestMain(Boolean
includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)

This is my code:

Protected Sub Submit1_Click(ByVal sender As Object, ByVal e As
System.EventArgs) Handles Submit1.Click

Dim Path As String =
System.IO.Path.GetDirectoryName(File1.PostedFile.F ileName)
Dim FileName As String
Dim MyText() As String
Dim NewFileName As String
Dim DataPath As String = "D:\Documents and Settings\srbray\My
Documents\Visual Studio 2005\WebSites\RegCC\Data\"
Dim ArchivePath As String = "D:\Documents and Settings\srbray
\My Documents\Visual Studio 2005\WebSites\RegCC\Archive\"
Dim MMM As String = MonthName(Month(Now()), True)
Dim YYYY As String = Year(Now())

'Create new archive directory.
My.Computer.FileSystem.CreateDirectory(ArchivePath & YYYY &
"\" & MMM)
ArchivePath = ArchivePath & YYYY & "\" & MMM & "\"

System.IO.Directory.SetCurrentDirectory(DataPath)

If Not File1.PostedFile Is Nothing And
File1.PostedFile.ContentLength 0 Then
For Each oneFile As String In
My.Computer.FileSystem.GetFiles(Path,
FileIO.SearchOption.SearchTopLevelOnly, "*.pdf")
FileName = System.IO.Path.GetFileName(oneFile)
MyText = Split(FileName, ".")
NewFileName = MyText(0) & ".txt"
movepdffile(oneFile, DataPath & FileName)
Pdf2Txt("-layout", DataPath & FileName, DataPath &
NewFileName)
Next oneFile
Else
MsgBox("Please select the file(s) to upload.")
End If
'Insert code here to:
'Convert .pdf documents into .txt documents with
additional code to
'import data into the Float Reg CC database.
'Move .pdf files from working directory to archive
directory and delete .txt files.
'My.Computer.FileSystem.MoveFile(DataPath & FileName,
ArchivePath & FileName, True)

'My.Computer.FileSystem.DeleteFile(DataPath &
NewFileName)
End Sub
Sub Pdf2Txt(ByVal options As String, ByVal pdffile As String,
ByVal textfile As String)
Dim exe As String = "D:\xpdf-win32\pdftotext.exe"
Dim cmd As String = ("'" & exe & "' " & options & " '" &
pdffile & "' '" & textfile & "'")
MsgBox(cmd)
System.Diagnostics.Process.Start(cmd)
End Sub
Sub movepdffile(ByVal origin As String, ByVal destination As
String)
Try
My.Computer.FileSystem.MoveFile(origin, destination,
false)
Catch Exc As Exception
MsgBox("Error: " & Exc.Message)
End Try
MsgBox("Move is successful.")
End Sub

I believe I can make this work, but I am missing something minor....
Jul 11 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Lues | last post by:
Hi, I'm trying to protect some data in tables with encription (you know why, don't you ;)) I must confess that I'm not very expirienced in writing code, especially encription code. Can any...
1
by: csgraham74 | last post by:
Hi Guys, I want to populate a nodelist so that i can extract various details. The xml document i have is similar to the one below. baiscally i want to extract the first instance of <PP>...
7
by: teo | last post by:
hallo, I need to extract a word and few text that precedes and follows it (about 30 + 30 chars) from a long textual document. Like the description that Google returns when it has found a...
8
by: Fabian Braennstroem | last post by:
Hi, I would like to remove certain lines from a log files. I had some sed/awk scripts for this, but now, I want to use python with its re module for this task. Actually, I have two different...
6
by: Just Me | last post by:
Any ideas on this. I am trying to loop through an xml document to remove attributes, but Im having so much trouble, any help is appreciated //THIS IS THE EXCEPTION ( SEE CODE LINE WHERE FAILURE...
1
by: Alberto Sartori | last post by:
Hello, I have a html text with custom tags which looks like html comment, such: "text text text <p>text</ptext test test text text text <p>text</ptext test test <!-- @MyTag@ -->extract...
0
by: wbw | last post by:
I am trying to extract capitalized words from text in Excel. I have a list of a combination of brands and products and I am trying to extract out the product attribute from the text. Since the text...
0
by: Jianwei Sun | last post by:
Hello Alf, Thank you, and I like "that could be like interpreting as favorably as possible the writings of a chimpanzee posing as a college professor.". However, I will still read this...
5
by: Steve | last post by:
Hi all Does anybody please know a way to extract an Image from a pdf file and save it as a TIFF? I have used a scanner to scan documents which are then placed on a server, but I need to...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.