Connecting Tech Pros Worldwide Forums | Help | Site Map

How to filter the words in HTML ?

ChianHsieh@gmail.com
Guest
 
Posts: n/a
#1: Oct 19 '06
Hi,

I face some problem that I want to filter the all words in HTML.

Example:

Before Filter:
<div id="pp"hello man <br/Thank's for your answer. </div>

After Filter:
<div id="pp"<br/</div>

What I want is reserve all HTML tags but words.
Is there any good packages or classes or suggestion ? Thank you very
much.


Pedro Graca
Guest
 
Posts: n/a
#2: Oct 19 '06

re: How to filter the words in HTML ?


ChianHsieh@gmail.com wrote:
Quote:
Example:
>
Before Filter:
<div id="pp"hello man <br/Thank's for your answer. </div>
>
After Filter:
<div id="pp"<br/</div>
I lova good challenges :)


<?php
function get_html($x, $sep=' ') {
$inbrackets = false;
$inquotes = false;
$html = '';
$l = strlen($x);
for ($i = 0; $i < $l; ++$i) {
$y = substr($x, $i, 1);
if (($inbrackets) && ($y == '"')) {
$inquotes = !$inquotes;
}
if ((!$inquotes) && ($y == '<')) {
if ($i 0) {
$html .= $sep;
}
$inbrackets = true;
}
if ($inbrackets) {
$html .= $y;
}
if ((!$inquotes) && ($y == '>')) {
$inbrackets = false;
}
}
return $html;
}

$data = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head><title>example</title></head>
<body>
<div id="pp">
<a name="funky><name"></a>
<!-- *VALID*^^*HTML* -->
hello man<br/>
Thank's for your answer.
</div>
</body>
</html>
HTML;

$html = get_html($data, "\n");
echo $html;
?>

Or you could try a regular expression (but I'm not sure you could do one
that accepts all valid HTML).

--
File not found: (R)esume, (R)etry, (R)erun, (R)eturn, (R)eboot
p.lepin@ctncorp.com
Guest
 
Posts: n/a
#3: Oct 20 '06

re: How to filter the words in HTML ?



ChianHsieh@gmail.com wrote:
Quote:
I face some problem that I want to filter the all words
in HTML.
>
Before Filter:
<div id="pp"hello man <br/Thank's for your answer.
</div>
>
After Filter:
<div id="pp"<br/</div>
Forget regexes. As the saying goes, 'You cannot parse HTML
with regexes'. There's also no reason to write your own
HTML parser -- there already are more than enough of those.

XSLT was meant exactly for this type of processing, and it
doesn't really care what you're processing, as long as it's
a DOMDocument.

Using PHP5's DOM and XSL modules:

<?php
$xml_str =
'<div id="pp"><phello man <br/Thank\'s for your ' .
'answer. </div>' ;
$xsl_str =
'<xsl:stylesheet ' .
' xmlns:xsl="http://www.w3.org/1999/XSL/Transform" ' .
' version="1.0">' .
' <xsl:template match="node()|@*">' .
' <xsl:copy>' .
' <xsl:apply-templates select="node()|@*"/>' .
' </xsl:copy>' .
' </xsl:template>' .
' <xsl:template match="html">' .
' <xsl:apply-templates/>' .
' </xsl:template>' .
' <xsl:template match="body">' .
' <result>' .
' <xsl:apply-templates/>' .
' </result>' .
' </xsl:template>' .
' <xsl:template match="text()"/>' .
' </xsl:stylesheet>' ;

$xml = DOMDocument :: loadHTML ( $xml_str ) ;
$xsl = DOMDocument :: loadXML ( $xsl_str ) ;
$xform = new XSLTProcessor ( ) ;
$xform -importStylesheet ( $xsl ) ;
$result = $xform -transformToDoc ( $xml ) ;
header ( 'Content-type: text/xml' ) ;
print ( $result -saveXML ( ) ) ;
?>

If you're using real XHTML (as opposed to mumbo jumbo tag
soup pretending to be XHTML), it's even better, because you
don't have to pretend you're processing XML. XHTML *is*
XML.

--
Pavel Lepin

Closed Thread