By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,710 Members | 1,626 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,710 IT Pros & Developers. It's quick & easy.

extracting text from powerpoint file

P: n/a
hi,
I decided to extract the text from some powerpoint files. The results have
thrown up some questions.

When I use the 'char *valid' character array (in the program below) to
choose the characters to write in the new file... the result is totally
different to when I use the line with isalpha() and isdigit().

Yes .. There are more valid characters in the valid array but this is not
the problem .. Using it, I see extra spaces in the new file and it is more
difficult to read (in notepad there appears to be a space between each
character .. in wordpad there are boxes between characters).. why?

anyone care to investigate and enlighten me? .. the code is below all you
need to do is comment and uncommment to achieve the differences I am talking
about

To use the program (with MS Windows) all you need to do is drag the file you
want to process onto the .exe file

cheeers
cw

the program:
############

#include<stdio.h>
#include<ctype.h>

void writeFile(FILE *infile,FILE *outfile);

int main(int argc, char *argv[])
{
FILE *outfile = NULL; //the file to write to
FILE *infile = NULL; //the file to read

if(((infile=fopen(argv[1],"rb"))==NULL)||((outfile=fopen("new.txt","wb"))== NULL))
{
printf("error opening file - fatal error - goodbye");
getchar();
exit(1);
}
writeFile(infile,outfile);
fflush(stdout);
system("pause");
return 0;
}

void writeFile(FILE *infile,FILE *outfile)
{
char *valid =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW XYZ0123456789
\n.;:<>?/|\\!\"$%^&*()_-=+,#~[]{}";

int byte;

while(1)
{
byte = fgetc(infile);/*read one byte*/
if(feof(infile)){break;}/*break from while at end of file*/

/*if(strchr(valid,byte))*/
if((isalpha(byte))||(isdigit(byte))||(byte==' ')||(byte == '\n'))
{
fputc(byte,outfile);
}
else
{ }

}
}

############
Nov 15 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
"code_wrong" <ta*@tac.ouch.co.uk> wrote:
<snip>
When I use the 'char *valid' character array (in the program below) to
choose the characters to write in the new file... the result is totally
different to when I use the line with isalpha() and isdigit().

Yes .. There are more valid characters in the valid array but this is not
the problem .. Using it, I see extra spaces in the new file and it is more
difficult to read (in notepad there appears to be a space between each
character .. in wordpad there are boxes between characters).. why? <snip>void writeFile(FILE *infile,FILE *outfile)
{
char *valid =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV WXYZ0123456789
\n.;:<>?/|\\!\"$%^&*()_-=+,#~[]{}";
You'd better off declaring the array static, but that's not the
problem.
int byte;

while(1)
{
byte = fgetc(infile);/*read one byte*/
if(feof(infile)){break;}/*break from while at end of file*/

/*if(strchr(valid,byte))*/
I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte)) {
fputc(byte,outfile);
}
else
{ }

}
}


Best regards
--
Irrwahn Grausewitz (ir*******@freenet.de)
welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
clc frequent answers: http://benpfaff.org/writings/clc
Nov 15 '05 #2

P: n/a

"Irrwahn Grausewitz" <ir*******@freenet.de> wrote in message
news:e4********************************@4ax.com...

snip
I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte))


snip

Thanks, you have identified the line of code that was producing the
boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
So I guess the program reads a null character in the file and writes it to
the output file ...

wonder why there are so many null characters in the powerpoint file (every
second character) ....interesting

cheers
cw


Nov 15 '05 #3

P: n/a

"code_wrong" <ta*@tac.ouch.co.uk> wrote in message
news:43**********@mk-nntp-2.news.uk.tiscali.com...

"Irrwahn Grausewitz" <ir*******@freenet.de> wrote in message
news:e4********************************@4ax.com...

snip
I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte))


snip

Thanks, you have identified the line of code that was producing the
boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
So I guess the program reads a null character in the file and writes it to
the output file ...

wonder why there are so many null characters in the powerpoint file (every
second character) ....interesting


Well, it's a 'binary' file (as opposed to 'plain text'), in which embedded
zero characters are common. Your remark about 'every second character'
makes me guess that perhaps (at least part of) the data might be stored
as multibyte or 'wide' characters (e.g. Unicode). You might want to look
into that possibility.

-Mike
Nov 15 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.