473,396 Members | 2,052 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

extracting text from powerpoint file

hi,
I decided to extract the text from some powerpoint files. The results have
thrown up some questions.

When I use the 'char *valid' character array (in the program below) to
choose the characters to write in the new file... the result is totally
different to when I use the line with isalpha() and isdigit().

Yes .. There are more valid characters in the valid array but this is not
the problem .. Using it, I see extra spaces in the new file and it is more
difficult to read (in notepad there appears to be a space between each
character .. in wordpad there are boxes between characters).. why?

anyone care to investigate and enlighten me? .. the code is below all you
need to do is comment and uncommment to achieve the differences I am talking
about

To use the program (with MS Windows) all you need to do is drag the file you
want to process onto the .exe file

cheeers
cw

the program:
############

#include<stdio.h>
#include<ctype.h>

void writeFile(FILE *infile,FILE *outfile);

int main(int argc, char *argv[])
{
FILE *outfile = NULL; //the file to write to
FILE *infile = NULL; //the file to read

if(((infile=fopen(argv[1],"rb"))==NULL)||((outfile=fopen("new.txt","wb"))== NULL))
{
printf("error opening file - fatal error - goodbye");
getchar();
exit(1);
}
writeFile(infile,outfile);
fflush(stdout);
system("pause");
return 0;
}

void writeFile(FILE *infile,FILE *outfile)
{
char *valid =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW XYZ0123456789
\n.;:<>?/|\\!\"£$%^&*()_-=+,#~[]{}";

int byte;

while(1)
{
byte = fgetc(infile);/*read one byte*/
if(feof(infile)){break;}/*break from while at end of file*/

/*if(strchr(valid,byte))*/
if((isalpha(byte))||(isdigit(byte))||(byte==' ')||(byte == '\n'))
{
fputc(byte,outfile);
}
else
{ }

}
}

############
Nov 15 '05 #1
3 2378
"code_wrong" <ta*@tac.ouch.co.uk> wrote:
<snip>
When I use the 'char *valid' character array (in the program below) to
choose the characters to write in the new file... the result is totally
different to when I use the line with isalpha() and isdigit().

Yes .. There are more valid characters in the valid array but this is not
the problem .. Using it, I see extra spaces in the new file and it is more
difficult to read (in notepad there appears to be a space between each
character .. in wordpad there are boxes between characters).. why? <snip>void writeFile(FILE *infile,FILE *outfile)
{
char *valid =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV WXYZ0123456789
\n.;:<>?/|\\!\"£$%^&*()_-=+,#~[]{}";
You'd better off declaring the array static, but that's not the
problem.
int byte;

while(1)
{
byte = fgetc(infile);/*read one byte*/
if(feof(infile)){break;}/*break from while at end of file*/

/*if(strchr(valid,byte))*/
I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte)) {
fputc(byte,outfile);
}
else
{ }

}
}


Best regards
--
Irrwahn Grausewitz (ir*******@freenet.de)
welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
clc frequent answers: http://benpfaff.org/writings/clc
Nov 15 '05 #2

"Irrwahn Grausewitz" <ir*******@freenet.de> wrote in message
news:e4********************************@4ax.com...

snip
I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte))


snip

Thanks, you have identified the line of code that was producing the
boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
So I guess the program reads a null character in the file and writes it to
the output file ...

wonder why there are so many null characters in the powerpoint file (every
second character) ....interesting

cheers
cw


Nov 15 '05 #3

"code_wrong" <ta*@tac.ouch.co.uk> wrote in message
news:43**********@mk-nntp-2.news.uk.tiscali.com...

"Irrwahn Grausewitz" <ir*******@freenet.de> wrote in message
news:e4********************************@4ax.com...

snip
I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte))


snip

Thanks, you have identified the line of code that was producing the
boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
So I guess the program reads a null character in the file and writes it to
the output file ...

wonder why there are so many null characters in the powerpoint file (every
second character) ....interesting


Well, it's a 'binary' file (as opposed to 'plain text'), in which embedded
zero characters are common. Your remark about 'every second character'
makes me guess that perhaps (at least part of) the data might be stored
as multibyte or 'wide' characters (e.g. Unicode). You might want to look
into that possibility.

-Mike
Nov 15 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: cstudent79 | last post by:
Hello folks,how do u do ? I want to develop an application that can extract text from a powerpoint presentation.But i am in dark about the powerpoint file format.I would be obliged if somebody can...
2
by: Jonathan Trevor | last post by:
Hi, For the last couple of releases of a product we're developing we've been running to very wierd behavior from IE and our ASP.NET web application which serves up various types of files and I'm...
2
by: Akeel | last post by:
Hi, I want to read all the text of powerpoint presentation (.ppt file), then i have to show it on some webpage without any formating (so i need to retrieve the text only). Can somebody help me...
1
by: ellenh | last post by:
I have read postings on the similar subject including the posting from 2003 shown below. This process works fine to display a single page snapshot report in PowerPoint. I need to display...
1
by: pankajhotmailone | last post by:
I m trying to make a powerpoint presentation in VB6 Want to export text and images to Powerpoint Presentation. I have already many slide like pictureboxs which feel like powerpoint but it isn't....
8
by: =?Utf-8?B?R2VvcmdlQXRraW5z?= | last post by:
Greetings! I wrote a small Exe that simply runs Shell to load PowerPoint and launch a particular file, depending on the day of the week. However, it was set up for office 2003 (I naively hardcoded...
6
by: BWPanda | last post by:
Hi everyone, I'm wanting to use VB.NET to display a powerpoint presentation, much the same way as the presenter that comes with PowerPoint (when used on multiple monitors). Basically, I want to...
2
by: LucasLondon | last post by:
Hi, I'm trying to use VBA to extract underlying data from charts in powerpoint to excel, i.e from the underlying powerpoint datasheet that feeds the chart. I've found the macro below on the...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.