"louis" <lo************@gmail.com> wrote:
It's painful, but you can loop thru every character. Use the ASCII
function to identify and remove any non-alphanumerics (0-9 or a-z or
A-Z or space or period or comma ...).
I've written software (as a standalone utility, not in the context of
SQL) that goes the other way, but can't offer anything that helps in
this direction. I can offer some advice though.
You need to be a bit more careful than outlined above. In RTF the
backslash "\" and brace characters "{}" are reserved. RTF is
essentially a markup language and the "tags" start with a backslash,
and can contain alphanumerics (typically alphas and then - optionally
- numerics). Braces are used to delimit sections. Some sections
such as those in the header (info and font tables) can be entirely
discarded from the visible output.
Braces and backslashes in the text are escaped - IIRC - with a
backslash. Using this, you could indeed convert most of the RTF
to text.
This approach would recover most text, but some features such as
lists might come out strange, and it wouldn't be formatted nicely,
unless you wanted to honour the \par and \line tags to give line
formatting, but in the context of insertion into a database I guess
that's the most you's want to do.
Also, depending on the source of the RTF you may be dealing with easy
to parse snippets, as opposed to a fully-featured document.
I should imagine that programming the above in SQL would be - as you
rightly point out - quite painful.
--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/