473,703 Members | 3,343 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Code to Extract Text from PDF

I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.

Hi. I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. Has
anyone developed code to convert PDF's to Text?

I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.

Jul 2 '08 #1
3 10581
1 get this to convert pdf2text
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip
2 use this sub
Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile
As String)
Dim arguments As String = options & " " & pdfFile & " " & txtFile
'make sure to provide the path with the pdfFile and the txtFile
System.Diagnost ics.Process.Sta rt("pdftotext.e xe", arguments)
End Sub


"SteveB" <st**********@u sbank.comwrote in message
news:a9******** *************** ***********@34g 2000hsh.googleg roups.com...
I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.

Hi. I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. Has
anyone developed code to convert PDF's to Text?

I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.
Jul 2 '08 #2
I have tried many free libraries and had mixed results.
The only reliable avenue was using Aspose library.

http://www.aspose.com/categories/fil...a/default.aspx

My steps using Aspose
1. Init library, open file.
2. Loop thru each page
3 Collect page data/massage/post to database

A typical PDF file for me has 1,500 pages with no forms, 20+ elements per
page to extract.
Average time per document to extract, massage data, pass to database is
10-15 seconds
Apose can read each document in 3 seconds total.

Downside, it cost money yet it's a great investment as I have found out
because it has served me well on multiple projects.

"SteveB" <st**********@u sbank.comwrote in message
news:a9******** *************** ***********@34g 2000hsh.googleg roups.com...
>I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.

Hi. I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. Has
anyone developed code to convert PDF's to Text?

I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.

Jul 7 '08 #3
On Jul 2, 5:05*pm, "Gillard" <gillard_geor.. .@hotmail.comwr ote:
* 1 *get this to convert pdf2textftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip
2 use this sub
Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile
As String)
* * * * Dim arguments As String = options & " " & pdfFile & " "& txtFile
* * * * 'make sure to provide the path with the pdfFile and the txtFile
* * * * System.Diagnost ics.Process.Sta rt("pdftotext.e xe", arguments)
* * End Sub

"SteveB" <stephen.b...@u sbank.comwrote in message

news:a9******** *************** ***********@34g 2000hsh.googleg roups.com...
I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.
Hi. *I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. *Has
anyone developed code to convert PDF's to Text?
I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?
Once I have the text strings, I can parse the data easily into the
SQL
database tables.- Hide quoted text -

- Show quoted text -
I tried your suggestion and this app works great from a command line.
However, when I try to call pdftotext as you sugeested, I keep getting
an exception this error:

System.Componen tModel.Win32Exc eption was unhandled by user code
ErrorCode=-2147467259
Message="The system cannot find the file specified"
Source="System"
StackTrace:
at
System.Diagnost ics.Process.Sta rtWithShellExec uteEx(ProcessSt artInfo
startInfo)
at System.Diagnost ics.Process.Sta rt()
at System.Diagnost ics.Process.Sta rt(ProcessStart Info startInfo)
at System.Diagnost ics.Process.Sta rt(String fileName)
at _Default.Pdf2Tx t(String options, String pdffile, String
textfile) in D:\documents and settings\srbray \My Documents\Visua l
Studio 2005\Websites\R egCC\FRB.aspx.v b:line 48
at _Default.Submit 1_Click(Object sender, EventArgs e) in D:
\documents and settings\srbray \My Documents\Visua l Studio 2005\Websites
\RegCC\FRB.aspx .vb:line 27
at System.Web.UI.W ebControls.Butt on.OnClick(Even tArgs e)
at System.Web.UI.W ebControls.Butt on.RaisePostBac kEvent(String
eventArgument)
at
System.Web.UI.W ebControls.Butt on.System.Web.U I.IPostBackEven tHandler.RaiseP ostBackEvent(St ring
eventArgument)
at System.Web.UI.P age.RaisePostBa ckEvent(IPostBa ckEventHandler
sourceControl, String eventArgument)
at System.Web.UI.P age.RaisePostBa ckEvent(NameVal ueCollection
postData)
at System.Web.UI.P age.ProcessRequ estMain(Boolean
includeStagesBe foreAsyncPoint, Boolean includeStagesAf terAsyncPoint)

This is my code:

Protected Sub Submit1_Click(B yVal sender As Object, ByVal e As
System.EventArg s) Handles Submit1.Click

Dim Path As String =
System.IO.Path. GetDirectoryNam e(File1.PostedF ile.FileName)
Dim FileName As String
Dim MyText() As String
Dim NewFileName As String
Dim DataPath As String = "D:\Documen ts and Settings\srbray \My
Documents\Visua l Studio 2005\WebSites\R egCC\Data\"
Dim ArchivePath As String = "D:\Documen ts and Settings\srbray
\My Documents\Visua l Studio 2005\WebSites\R egCC\Archive\"
Dim MMM As String = MonthName(Month (Now()), True)
Dim YYYY As String = Year(Now())

'Create new archive directory.
My.Computer.Fil eSystem.CreateD irectory(Archiv ePath & YYYY &
"\" & MMM)
ArchivePath = ArchivePath & YYYY & "\" & MMM & "\"

System.IO.Direc tory.SetCurrent Directory(DataP ath)

If Not File1.PostedFil e Is Nothing And
File1.PostedFil e.ContentLength 0 Then
For Each oneFile As String In
My.Computer.Fil eSystem.GetFile s(Path,
FileIO.SearchOp tion.SearchTopL evelOnly, "*.pdf")
FileName = System.IO.Path. GetFileName(one File)
MyText = Split(FileName, ".")
NewFileName = MyText(0) & ".txt"
movepdffile(one File, DataPath & FileName)
Pdf2Txt("-layout", DataPath & FileName, DataPath &
NewFileName)
Next oneFile
Else
MsgBox("Please select the file(s) to upload.")
End If
'Insert code here to:
'Convert .pdf documents into .txt documents with
additional code to
'import data into the Float Reg CC database.
'Move .pdf files from working directory to archive
directory and delete .txt files.
'My.Computer.Fi leSystem.MoveFi le(DataPath & FileName,
ArchivePath & FileName, True)

'My.Computer.Fi leSystem.Delete File(DataPath &
NewFileName)
End Sub
Sub Pdf2Txt(ByVal options As String, ByVal pdffile As String,
ByVal textfile As String)
Dim exe As String = "D:\xpdf-win32\pdftotext .exe"
Dim cmd As String = ("'" & exe & "' " & options & " '" &
pdffile & "' '" & textfile & "'")
MsgBox(cmd)
System.Diagnost ics.Process.Sta rt(cmd)
End Sub
Sub movepdffile(ByV al origin As String, ByVal destination As
String)
Try
My.Computer.Fil eSystem.MoveFil e(origin, destination,
false)
Catch Exc As Exception
MsgBox("Error: " & Exc.Message)
End Try
MsgBox("Move is successful.")
End Sub

I believe I can make this work, but I am missing something minor....
Jul 11 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
3535
by: Lues | last post by:
Hi, I'm trying to protect some data in tables with encription (you know why, don't you ;)) I must confess that I'm not very expirienced in writing code, especially encription code. Can any one, please , send VB code for access which I can c/p into one function. It don't have to be RSA, it can be anything which is easy to
1
1652
by: csgraham74 | last post by:
Hi Guys, I want to populate a nodelist so that i can extract various details. The xml document i have is similar to the one below. baiscally i want to extract the first instance of <PP> including <PP1> <PP2> <PP3> then separately extract the next instance of <PP> <PP1> <PP2> <PP3>. Does anyone have any examples on how to do this ??? Im trying to create an html table using these values but its a bit of a nightmare. any help appreciated
7
2884
by: teo | last post by:
hallo, I need to extract a word and few text that precedes and follows it (about 30 + 30 chars) from a long textual document. Like the description that Google returns when it has found a given word. In example from:
8
2835
by: Fabian Braennstroem | last post by:
Hi, I would like to remove certain lines from a log files. I had some sed/awk scripts for this, but now, I want to use python with its re module for this task. Actually, I have two different log files. The first file looks like: ...
6
1722
by: Just Me | last post by:
Any ideas on this. I am trying to loop through an xml document to remove attributes, but Im having so much trouble, any help is appreciated //THIS IS THE EXCEPTION ( SEE CODE LINE WHERE FAILURE OCCURS '//Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it. Line 13, position 11. //THE XHTML TEXT WHICH IS BEING LOOOKED AT
1
4789
by: Alberto Sartori | last post by:
Hello, I have a html text with custom tags which looks like html comment, such: "text text text <p>text</ptext test test text text text <p>text</ptext test test <!-- @MyTag@ -->extract this<!-- /@MyTag@ --> text text text <p>text</ptext test test <!-- @MyTag@ -->and this<!-- /@MyTag@ --> text text text <p>text</ptext test test"
0
1433
by: wbw | last post by:
I am trying to extract capitalized words from text in Excel. I have a list of a combination of brands and products and I am trying to extract out the product attribute from the text. Since the text varies in length, I cannot use standard text parsing excel functions to extract the product from the text. I could use text to columns but that gets labor intensive. Is there a way to extract out the capitalized words from text in Excel? How do I...
0
1112
by: Jianwei Sun | last post by:
Hello Alf, Thank you, and I like "that could be like interpreting as favorably as possible the writings of a chimpanzee posing as a college professor.". However, I will still read this book, with more attention to the c++ code. J.W.
5
5759
by: Steve | last post by:
Hi all Does anybody please know a way to extract an Image from a pdf file and save it as a TIFF? I have used a scanner to scan documents which are then placed on a server, but I need to extract the image of the document (just the first page if there are multiple pages) and save it as a TIFF so I can then use the Tesseract OCR to get the text in the image.
0
8662
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8956
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7853
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6585
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5922
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4677
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3113
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2434
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2057
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.