473,426 Members | 1,511 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,426 software developers and data experts.

Read a PDF and print content in console

freddieMaize
Hi All,

Wondering if a PDF can be read and the content inside it can be written into a txt file. Fow now i'm just giving a sys out. Below is my attempt,

Expand|Select|Wrap|Line Numbers
  1. public static void main(String args[]) throws IOException {
  2.         FileInputStream fis = new FileInputStream(new File("c:\\zoutput.pdf"));
  3.         ByteArrayOutputStream docContents = new ByteArrayOutputStream();
  4.         byte[] buffer = new byte[16384];
  5.         int bytesRead = fis.read(buffer);
  6.         while (bytesRead > -1) {
  7.             docContents.write(buffer, 0, bytesRead);            
  8.             bytesRead = fis.read(buffer);
  9.         }
  10.         System.out.println(docContents.toString("UTF-8"));
  11.     }
  12.  
sry if the question is silly.

Freddie
Aug 17 '10 #1
5 4920
Oralloy
988 Expert 512MB
Freddie,

PDF files are a mixture of text and binary data, depending on whether there's any compression.

The basic (think "hello world") PDF file is a plain text file - you can find a copy of the file specification document at adobe.com.

What are you trying to accomplish, may I ask?
Aug 17 '10 #2
Sure you can ask..

The actual purpose is, I'm trying to index documents into a Search Engine for which i need to read the contents of a PDF (and also other formates like docx, doc, ppt, pptx and list goes on). All the read content should be put to a String which is then pused to the Search Engine. Currently we are using Apache Tika for this. But was just thinking if a simple ByteArrayOutputStream could slove the issue..

Thanks for responding..

Freddie
Aug 18 '10 #3
Oralloy
988 Expert 512MB
Well, if the search engine is one doing the parsing, then you should be fine with a ByteArrayOutputStream.

I'm not sure why you're converting to UTF-8 in your output, though. Be forewarned that PDF documents may contain binary data, so converting them to UTF-8 will damage their contents. The binary data is usually images and sounds, which might not be critical to you, however, it's also in the document specification that sequences of arbitrary text objects can be compressed.

Good Luck!
Aug 18 '10 #4
Thanks for responding !!

Well, i need to parse if myself since, that way we can customize the search engine better.. Well it ll get complicated if i need to explain what EXACTLY im trying to do. Also would not be that necessary..

And regarding UTF-8, no specific reason. It was one of the trail which ended in error :)

Anyways, I'm trying my best out and ll be sure to post back if i find a solution. Thanks Oralloy and all..

Freddie
Aug 19 '10 #5
Oralloy
988 Expert 512MB
Freddie,

Try to avoid parsing PDF files, if you can. They are not difficult to manipulate, however they are complex and have more than a few gocchas. Go buy a good tool to do it for you. The money you spend will be money well spent.

Anyway, good luck with your quest.
Aug 19 '10 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: roopeman | last post by:
i want read print info.(such as bytes,pages printed,port) from windows system eventlog's Description item, anybody can help me ?
2
by: Ing. Rajesh Kumar | last post by:
Hi everybody I have a problem reading the content of a cell. When i use AutogenerateColumns = True or when i use <asp:BoundColumn DataField="COLUMN1" HeaderText="COLUMN1"/> then i can simply read...
0
by: VeeraLakshmi | last post by:
I am doing a project for internet control using Java,PHP and MySql.All sites should go through the proxy server only.If the HTTP header contains Content-Length,am getting the content length as below:...
2
by: VeeraLakshmi | last post by:
Can anybody tell me how to get or read the value of transfer encoding. I got the HTTP Response header as "Transfer-Encoding: chunked".But i can't get the chunk size or the chunked data. Without...
2
by: ernesto | last post by:
I've read that search engines do not read flash content, and meta tags are not supported by many of the main search engines anymore. Can anyone tell me if there is a way to work around it, so my all...
3
by: twibblej | last post by:
Hello, I have a function which currently prints the contents of an object to the console. I want pass the function an arguement so that instead of always printing the object to the console it. ...
1
by: bogie | last post by:
Hello I have some problem with psqldump. I need to read some table from my postgresql backup (psqldump file). is there any body can help me, how can i read this psqldump, or is there any way to...
4
by: Ravigandha | last post by:
Hello everyone, I want to read the content of the mail from outlook express. then send that content to other mail. First is it possible to read the content of mail in PHP? waiting for ur...
3
by: Man4ish | last post by:
Hi, I am working on one application in which i need to read the contents of one file test.tar.gz which has 50000 files. I know the names of files inside but i don't want to unzip it. I want C++...
8
by: rohanit46 | last post by:
i need to read xml content from a web service onto a local machine and parse the data using javascript. While using XMLHttpRequest, it allows me to access xml content from anywhere on local...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.