>I've tried
>'<script[\S\s]*/script>'
but that didn't work properly. I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?
I use re.compile("<script.*?</script>",re.DOTALL)
for scripts. I strip this out first since my tag stripping re will
strip out script tags as well hope this was of help.
This won't catch various alterations of
<
script
>
doEvil()
<
/
script
>
which is valid html/xhtml.
For less valid html, but still attemptable, one might find
something like
<scrip<script>hah</script>t>doEvil()</script>
which, if you nuke your pattern, leaves the valid but unwanted
<script>doEvil()</script>
I'd propose that it's better to use something such as
BeautifulSoup that actually parses the HTML, and then skim
through it whitelisting the tags you plan to allow, and skipping
the emission of any tags that don't make the whitelist.
-tkc