Connecting Tech Pros Worldwide Forums | Help | Site Map

Convert some files from html to plaintext

Luca Villa
Guest
 
Posts: n/a
#1: Nov 11 '07
I have many html files named like these:

c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html

How can I convert only the files named "c:\dir\*-white.html" to
plaintext files named c:\dir\(original filename)-text.txt?

Is there a PHP module that does a good quality conversion HTML to
plaintext?


ZeldorBlat
Guest
 
Posts: n/a
#2: Nov 11 '07

re: Convert some files from html to plaintext


On Nov 11, 1:05 pm, Luca Villa <lucavi...@cashette.comwrote:
Quote:
I have many html files named like these:
>
c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html
>
How can I convert only the files named "c:\dir\*-white.html" to
plaintext files named c:\dir\(original filename)-text.txt?
>
Is there a PHP module that does a good quality conversion HTML to
plaintext?
See this:

<http://www.php.net/strip_tags>

Luca Villa
Guest
 
Posts: n/a
#3: Nov 11 '07

re: Convert some files from html to plaintext


See this:
Quote:
>
<http://www.php.net/strip_tags>
Isn't there something of higher quality, like the rendering engine of
the textual browser Lynx?

Oli Thissen
Guest
 
Posts: n/a
#4: Nov 11 '07

re: Convert some files from html to plaintext


On Nov 11, 8:58 pm, Luca Villa <lucavi...@cashette.comwrote:
Quote:
Quote:
See this:
>
Quote:
<http://www.php.net/strip_tags>
>
Isn't there something of higher quality, like the rendering engine of
the textual browser Lynx?
I guess you dont't simply want to remove all the tags. You rather want
to make sure, that the content of your <h1>-element is followed by an
empty line or that your <p>-elements are indented, etc.

This might seem a little oversized, but if all of your files have the
same structure, you might want to create an XSLT and have PHP
transform it to whatever strucure you prefer.

Check out the PHP manual here http://www.php.net/ref.xsl and maybe
this tutorial on XSLT http://www.w3schools.com/xsl/

Oli

Luca Villa
Guest
 
Posts: n/a
#5: Nov 11 '07

re: Convert some files from html to plaintext


Oli, there are ready and open source converters available like Lynx,
Links, ELinks, W3M etc...
I think that it's not the case to re-write with XSLT what's it's
already done by others with many years of work.
I hoped that PHP had an integrated solution for this, like the engine
of one of the mentioned textual browsers...

Ulf Kadner
Guest
 
Posts: n/a
#6: Nov 12 '07

re: Convert some files from html to plaintext


Luca Villa wrote:
Quote:
Oli, there are ready and open source converters available like Lynx,
Links, ELinks, W3M etc...
I think that it's not the case to re-write with XSLT what's it's
already done by others with many years of work.
I hoped that PHP had an integrated solution for this, like the engine
of one of the mentioned textual browsers...
Usually very simple. Install Lynx on youre server and call Lynx by one
of the command executing functions of PHP:

http://php.net/exec

Other Options you dont have without alot of work...

So long, Ulf

--
_,
_(_p Ulf [Kado] Kadner
\<_)
^^
Luca Villa
Guest
 
Posts: n/a
#7: Nov 12 '07

re: Convert some files from html to plaintext


Quote:
Usually very simple. Install Lynx on youre server and call Lynx by one
of the command executing functions of PHP:
That's the road I'm following, but calling an external program
thousands of times (I need to process thousand of files) is not much
efficient...

Ulf Kadner
Guest
 
Posts: n/a
#8: Nov 13 '07

re: Convert some files from html to plaintext


Luca Villa wrote:
Quote:
Quote:
>Usually very simple. Install Lynx on youre server and call Lynx by one
>of the command executing functions of PHP:
>
That's the road I'm following, but calling an external program
thousands of times (I need to process thousand of files) is not much
efficient...
sure, not a performance wonder :-)

better you write a shellscript that reads all resources from a file
(maybee dynamic generated) and handles it by lynx in a loop. Thats faster

So long, Ulf

--
_,
_(_p Ulf [Kado] Kadner
\<_)
^^
Closed Thread