473,666 Members | 2,296 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

convert .pdf files to .txt files

Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:
from pdftools.pdffil e import PDFDocument
from pdftools.pdftex t import Text

def contents_to_tex t (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_tex t (item):
yield i
elif isinstance (item, Text):
yield item.text

doc = PDFDocument ("/home/dave/pruebas_fichero s/carlos.pdf")
n_pages = doc.count_pages ()
text = []

for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_conte nts ().contents
text.extend (contents_to_te xt (contents))

print "".join (text)

the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN " -----> "IMPLEMENTA CI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.

Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt
Sorry for my english.
Thanks for all.

Jun 10 '06 #1
4 10454
Davor wrote:
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,


If you have 'xpdf' installed in your system,
'pdftotext' command will be available in your system.

Now to convert a pdf to text from Python use system call.
For example:

import os
os.system("pdft otext -layout my_pdf_file.pdf ")

This will create 'my_pdf_file.tx t' file.

Regards,
Baiju M

Jun 10 '06 #2

If you don't already have xpdf, you can get it here:

http://glyphandcog.com/Xpdf.html

Install it and then try what Baiju said, should work.
I've used it, its good, that's why I say it should work. If any
problems, post here again.

-------------------------------------------------------------------------------------------
Vasudev Ram
Independent software consultant
Personal site: http://www.geocities.com/vasudevram
PDF conversion tools: http://sourceforge.net/projects/xtopdf
-------------------------------------------------------------------------------------------

Baiju M wrote:
Davor wrote:
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,


If you have 'xpdf' installed in your system,
'pdftotext' command will be available in your system.

Now to convert a pdf to text from Python use system call.
For example:

import os
os.system("pdft otext -layout my_pdf_file.pdf ")

This will create 'my_pdf_file.tx t' file.

Regards,
Baiju M


Jun 10 '06 #3
Davor wrote:
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:
[...]
for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_conte nts ().contents
text.extend (contents_to_te xt (contents))

print "".join (text)

the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN " -----> "IMPLEMENTA CI?" give strange
characters
pdftools just extracts the textual data in the file and stores it in
Text instances - it doesn't try to interpret or decode the text. I'd
like to fix the library so that it does try and decode the text
properly and put it into unicode strings, but I don't have the time
right now.

Remember that text can be stored in PDF files in many different
ways, and that the text cannot always be extracted in its original
form.
if someone knows how to use the pdftools and can help me it makes me
very happy.

Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt
You need to do something like this:

f = open("myfilenam e", "w").write("".j oin (text))
Sorry for my english.


Don't worry about it. It's much better than my Spanish will ever be.

Sorry I couldn't give you more help with this. You may find that the
other tools mentioned by people in this thread will do what you
need better than pdftools can at the moment.

David

Jun 10 '06 #4
Thanks for all you wrote, It will be very usefull to me, at the end I
use that code and the file I introduce is converted to .txt on the
directory where the file is placed, and in documents written in spanish
this do not gives problems on "acentos" in words like "camión" or
"introducci ón" that was very important to me. Thanks!

import os
os.system("pdft otext -layout my_pdf_file.pdf ")

#This will create 'my_pdf_file.tx t' file.

Jun 14 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4472
by: Swarup | last post by:
I am reading a file (txt, xml, gif, ico, bmp etc) byte by byte and filling it into a byte arry. Now i have to convert it into a string to store it in the database. I use System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding(); now i am using enc.GetString(value) and the value retured is one byte less if the size of the byte array is Odd. In case of files having even number of bytes, the convertion is happening correctly and...
5
3901
by: bbb | last post by:
Hi, I need to convert XML files from Japanese encoding to UTF-8. I was using the following code: using ( FileStream fs = File.OpenRead(fromFile) ) { int fileSize = (int)fs.Length; int buffer = fileSize; byte b = new byte;
3
13751
by: GM | last post by:
Dear all, Could you all give me some guide on how to convert my big5 string to unicode using python? I already knew that I might use cjkcodecs or python 2.4 but I still don't have idea on what exactly I should do. Please give me some sample code if you could. Thanks a lot Regards, Gary
2
8770
by: Craig | last post by:
Hi there, I'm trying to convert some PNG files to bitmap files which can then be converted to X11 bitmaps using the im.tobitmap() function. But the error I get when using the im.tobitmap() function on the PNG files I get the following error: Traceback (most recent call last): File "<pyshell#2>", line 1, in <module>
10
15779
by: gokul | last post by:
Hi, Before i convert .doc binary format files to .txt files and i added some content to .txt files. Now i again convert back to .doc binary format. Pls Help Me How to Convert .txt files to .doc binary format files in Linux. Regards, Gokul.N
6
6456
by: leonel.gayard | last post by:
Hi, Does anyone know a good python library to convert a RTF file into PDF ? This should be done automaticaly: I have a web page that takes some values and inserts them into a RTF template, resulting in an RTF file. However, I cannot send the output back to the user in RTF, it must be sent in PDF instead, so I need to convert the result. So, what library can I use to convert from RTF to PDF ? GPL / BSD
1
3591
by: johnlim20088 | last post by:
Hi, Currently I have 6 web projects located in Visual Source Safe 6.0, as usual, everytime I will open solution file located in my local computer, connected to source safe, then check out/check in some files and work on it. Let say, I want add new page to web project named websiteOrder.sln, i will open websiteOrder.sln in my local computer, connected to websiteOrder.sln located in Visual Source Safe 6.0(source safe located in another...
2
12668
by: edw | last post by:
I want to convert rfc822 (.eml) files to MAPI (.msg) files. I do not need to log in to Outlook to retrieve the messages, since the messages have all been downloaded from the exchange server and are sitting in a folder. I also do not need to save the converted file (.msg) to outlook. I merely need to convert the files from .eml to .msg and store the new .msg file in the same directory where I have the .eml files. How do I use...
3
2953
by: Sun | last post by:
Hi everyone . I have two files named a.txt and b.txt. I open a.txt with ultraeditor.exe. here is the first row of the file: neu für then I switch to the HEX mode: 00000000h: FF FE 6E 00 65 00 75 00 20 00 66 00 FC 00 72 00 20 00 0A 00 0D 00
0
10753
Debadatta Mishra
by: Debadatta Mishra | last post by:
Introduction In this article I will provide you an approach to manipulate an image file. This article gives you an insight into some tricks in java so that you can conceal sensitive information inside an image, hide your complete image as text ,search for a particular image inside a directory, minimize the size of the image. However this is not a new concept, there is a concept called Steganography which enables to conceal your secret...
0
8440
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8355
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8638
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7381
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6191
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5662
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4193
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2769
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2006
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.