473,327 Members | 1,892 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

help in C/C++ to read .DOC & PDF files

hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

so that i can implement it in my program
any suggestions would be of great help
thanx in advance for the help

Jun 4 '06 #1
10 13722
steve wrote:
hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files Why all the shouting?
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and


Simple? You'll be lucky.

Have a look at the xpdf or openoffice source to see why.

--
Ian Collins.
Jun 4 '06 #2
"steve" writes:

i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

so that i can implement it in my program
any suggestions would be of great help
thanx in advance for the help


This could really eat up the time if you insist it be highly automated. The
easiest, and highly manual, way is to export the .doc file as a text file
and then operate on that file. I don't know for sure, but it seems
reasonable that you might be able to do something similar with the .pdf
file. There may be third party - I don't mean to exclude freeware or
shareware - conversion programs.
Jun 4 '06 #3
ben
In article <4e*************@individual.net>, Ian Collins
<ia******@hotmail.com> wrote:
steve wrote:
hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files

Why all the shouting?
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and


Simple? You'll be lucky.

Have a look at the xpdf or openoffice source to see why.


steve,

the internal format of pdf is *really* complex in my opinion. even if
you're a brilliant programmer there's still an awful lot of hoops to
just through so i think would require a lot of work. pdf's format spec
is over 1000 pages long. xpdf, as mentioned above, has a command line
utility in it called pdf2text, which takes a pdf as input and outputs
plain text and works on its own and doesn't require all the GUI stuff
round it. i suggest you get hold of pdf2text which is in xpdf whose
source is available for free and make use of that. for the .doc format
i don't know.
Jun 4 '06 #4

"steve" <ad********@gmail.com> wrote
hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.
Jun 8 '06 #5
Malcolm said:

"steve" <ad********@gmail.com> wrote
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.


Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Jun 8 '06 #6
"Richard Heathfield" wrote:
Malcolm said:

"steve" <ad********@gmail.com> wrote
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of
bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.


Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.


I couldn't tell if that was a serious post or not. I thought it might be
"humour".
Jun 8 '06 #7
"Richard Heathfield" <in*****@invalid.invalid> wrote
Malcolm said:

"steve" <ad********@gmail.com> wrote
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of
bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.


Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.

The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.

We then apply the Viterbi algorithm. Essentially this runs the model in
reverse over the input, determining which mode is more likely to have
generated the sequence, and where the transition points must be.

Now if the embedded English strings are reasonably long, so the chance of
transition to gibberish is not too high, the algorithm will regard short
stretches of gibberish like C identifiers as English, on the balance of
probability. That is not to say that it will be perfect - if a string is
ended with a C identifier then the algorithm might well assign the
identifier to the gibberish. But it should do a reasonable job.

I'll try to find time to implement one.

--
Buy my book 12 Common Atheist Arguments (refuted)
$1.25 download or $7.20 paper, available www.lulu.com/bgy1mm

Jun 9 '06 #8
"Malcolm" <re*******@btinternet.com> writes:
"Richard Heathfield" <in*****@invalid.invalid> wrote
Malcolm said:
"steve" <ad********@gmail.com> wrote
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of
bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.


Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.

The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.

[...]

And is this supposed to work if the content is encrypted?

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Jun 10 '06 #9
Keith Thompson <ks***@mib.org> wrote:
"Malcolm" <re*******@btinternet.com> writes:
"Richard Heathfield" <in*****@invalid.invalid> wrote
Malcolm said:
"steve" <ad********@gmail.com> wrote
> so please give me a sample C program which can read a DOC file or
> a PDF file and
> //*
> print the same text in EXCEL as the output*//
>
The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of
bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.

Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.

The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.

[...]

And is this supposed to work if the content is encrypted?


Or (even more likely than encrypted) compressed?
Jun 10 '06 #10

"Keith Thompson" <ks***@mib.org> wrote >>>
The way the program would work is to build two Markov models, one of PDF
/
doc gibberish, and one of English language. The we also give a
probability
of transitioning from English to gibberish and back again.

[...]

And is this supposed to work if the content is encrypted?

Depends on the quality of the encryption.
By Markov modelling of English / non-English encrypted texts you might be
able to distinguish between them. It would obviously work if the encryption
was a substitution cipher.
I suspect that decent encryption would require a much more sophisticated
attack.
--
Buy my book 12 Common Atheist Arguments (refuted)
$1.25 download or $7.20 paper, available www.lulu.com/bgy1mm

Jun 10 '06 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: baustin75 | last post by:
Posted: Mon Oct 03, 2005 1:41 pm Post subject: cannot mail() in ie only when debugging in php designer 2005 -------------------------------------------------------------------------------- ...
38
by: Red Dragon | last post by:
I am self study C student. I got stuck in the program below on quadratic equation and will be most grateful if someone could help me to unravel the mystery. Why does the computer refuse to execute...
2
by: melo | last post by:
Hello, I've been struggling with a function(s) to recursively set all folders and files to NOT read-only. So, I thought I'd post this message. What I need to do is: given a starting path, I...
1
by: pangel83 | last post by:
I've been trying for days to write a piece of VB.NET code that will read from winamp's memory space the paths of the files from the current winamp playlist. The GETPLAYLISTFILE command of the...
3
by: Moqtar | last post by:
I am using python to walk a directory and write the filename in an xml document of type <?xml version="1.0" encoding="ISO-8859-1"?> <job> <jobname>Test</jobname>...
0
by: gunimpi | last post by:
http://www.vbforums.com/showthread.php?p=2745431#post2745431 ******************************************************** VB6 OR VBA & Webbrowser DOM Tiny $50 Mini Project Programmer help wanted...
3
by: TommyC | last post by:
Hi guys, i have written codiing to read files automatically using argv function. Can you guys teach me how to write each processed image files that has been read one by one automatically in loop?...
0
by: alivip | last post by:
I write code to get most frequent words in the file I won't to implement bigram probability by modifying the code to do the following: How can I get every Token (word) and ...
2
by: CRAIG DALTON | last post by:
Hi, I'm looking to append several text files in one director and out put the combined files into another director. I'm new to Python and just can't get itto work. So far I've been able to create...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.