Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:
from pdftools.pdffil e import PDFDocument
from pdftools.pdftex t import Text
def contents_to_tex t (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_tex t (item):
yield i
elif isinstance (item, Text):
yield item.text
doc = PDFDocument ("/home/dave/pruebas_fichero s/carlos.pdf")
n_pages = doc.count_pages ()
text = []
for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_conte nts ().contents
text.extend (contents_to_te xt (contents))
print "".join (text)
the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN " -----> "IMPLEMENTA CI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.
Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt
Sorry for my english.
Thanks for all. 4 10454
Davor wrote: Hi, my name is david. I need to read information from .pdf files and convert to .txt files, and I have to do this on python,
If you have 'xpdf' installed in your system,
'pdftotext' command will be available in your system.
Now to convert a pdf to text from Python use system call.
For example:
import os
os.system("pdft otext -layout my_pdf_file.pdf ")
This will create 'my_pdf_file.tx t' file.
Regards,
Baiju M
If you don't already have xpdf, you can get it here: http://glyphandcog.com/Xpdf.html
Install it and then try what Baiju said, should work.
I've used it, its good, that's why I say it should work. If any
problems, post here again.
-------------------------------------------------------------------------------------------
Vasudev Ram
Independent software consultant
Personal site: http://www.geocities.com/vasudevram
PDF conversion tools: http://sourceforge.net/projects/xtopdf
-------------------------------------------------------------------------------------------
Baiju M wrote: Davor wrote: Hi, my name is david. I need to read information from .pdf files and convert to .txt files, and I have to do this on python,
If you have 'xpdf' installed in your system, 'pdftotext' command will be available in your system.
Now to convert a pdf to text from Python use system call. For example:
import os os.system("pdft otext -layout my_pdf_file.pdf ")
This will create 'my_pdf_file.tx t' file.
Regards, Baiju M
Davor wrote: Hi, my name is david. I need to read information from .pdf files and convert to .txt files, and I have to do this on python, I have been looking for libraries on python and the pdftools seems to be the solution, but I do not know how to use them well, this is the example that I found on the internet is:
[...]
for n_page in range (1, (n_pages+1)): print "Page", n_page page = doc.read_page (n_page) contents = page.read_conte nts ().contents text.extend (contents_to_te xt (contents))
print "".join (text)
the problem is that on some pdf´s it generates join words and In spanish the "acentos" in words like: "camión" goes to --> cami/86n or "IMPLEMENTACIÓN " -----> "IMPLEMENTA CI?" give strange characters
pdftools just extracts the textual data in the file and stores it in
Text instances - it doesn't try to interpret or decode the text. I'd
like to fix the library so that it does try and decode the text
properly and put it into unicode strings, but I don't have the time
right now.
Remember that text can be stored in PDF files in many different
ways, and that the text cannot always be extracted in its original
form.
if someone knows how to use the pdftools and can help me it makes me very happy.
Another thing is that I can see the letters readden from .pdf on the screen, but I do not know how to create a file and save this information inside the file a .txt
You need to do something like this:
f = open("myfilenam e", "w").write("".j oin (text))
Sorry for my english.
Don't worry about it. It's much better than my Spanish will ever be.
Sorry I couldn't give you more help with this. You may find that the
other tools mentioned by people in this thread will do what you
need better than pdftools can at the moment.
David
Thanks for all you wrote, It will be very usefull to me, at the end I
use that code and the file I introduce is converted to .txt on the
directory where the file is placed, and in documents written in spanish
this do not gives problems on "acentos" in words like "camión" or
"introducci ón" that was very important to me. Thanks!
import os
os.system("pdft otext -layout my_pdf_file.pdf ")
#This will create 'my_pdf_file.tx t' file. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Swarup |
last post by:
I am reading a file (txt, xml, gif, ico, bmp etc) byte by byte and filling
it into a byte arry. Now i have to convert it into a string to store it in
the database.
I use
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
now i am using enc.GetString(value) and the value retured is one byte less
if the size of the byte array is Odd. In case of files having even number
of bytes, the convertion is happening correctly and...
|
by: bbb |
last post by:
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:
using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte b = new byte;
|
by: GM |
last post by:
Dear all,
Could you all give me some guide on how to convert my big5 string to
unicode using python? I already knew that I might use cjkcodecs or
python 2.4 but I still don't have idea on what exactly I should do.
Please give me some sample code if you could. Thanks a lot
Regards,
Gary
|
by: Craig |
last post by:
Hi there,
I'm trying to convert some PNG files to bitmap files which can then be
converted to X11 bitmaps using the im.tobitmap() function. But the
error I get when using the im.tobitmap() function on the PNG files I
get the following error:
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
|
by: gokul |
last post by:
Hi,
Before i convert .doc binary format files to .txt files and i added some content to .txt files. Now i again convert back to .doc binary format.
Pls Help Me How to Convert .txt files to .doc binary format files in Linux.
Regards,
Gokul.N
| |
by: leonel.gayard |
last post by:
Hi,
Does anyone know a good python library to convert a RTF file into PDF ?
This should be done automaticaly: I have a web page that takes some
values and inserts them into a RTF template, resulting in an RTF file.
However, I cannot send the output back to the user in RTF, it must be
sent in PDF instead, so I need to convert the result.
So, what library can I use to convert from RTF to PDF ? GPL / BSD
|
by: johnlim20088 |
last post by:
Hi,
Currently I have 6 web projects located in Visual Source Safe 6.0, as usual, everytime I will open solution file located in my local computer, connected to source safe, then check out/check in some files and work on it.
Let say, I want add new page to web project named websiteOrder.sln, i will open websiteOrder.sln in my local computer, connected to websiteOrder.sln located in Visual Source Safe 6.0(source safe located in another...
|
by: edw |
last post by:
I want to convert rfc822 (.eml) files to MAPI (.msg) files. I do not
need to log in to Outlook to retrieve the messages, since the messages
have all been downloaded from the exchange server and are sitting in a
folder. I also do not need to save the converted file (.msg) to
outlook. I merely need to convert the files from .eml to .msg and store
the new .msg file in the same directory where I have the .eml files.
How do I use...
|
by: Sun |
last post by:
Hi everyone
. I have two files named a.txt and b.txt.
I open a.txt with ultraeditor.exe. here is the first row of the file:
neu für
then I switch to the HEX mode:
00000000h: FF FE 6E 00 65 00 75 00 20 00 66 00 FC 00 72 00 20 00 0A
00 0D 00
|
by: Debadatta Mishra |
last post by:
Introduction
In this article I will provide you an approach to manipulate an image file. This article gives you an insight into some tricks in java so that you can conceal sensitive information inside an image, hide your complete image as text ,search for a particular image inside a directory, minimize the size of the image. However this is not a new concept, there is a concept called Steganography which enables to conceal your secret...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
| |
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
| |
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |