I've done a Google Groups search, read the 1,000-odd articles on this
group currently present on my news server, and read the FAQ at
http://www.htmlhelp.org/faq/html/all.txt, but I can't find any hint of
an answer to this question, so here goes....
I'm attempting to construct a Lua script to convert basic HTML into a
form which can be imported by Impression Publisher (a popular DTP
application for my platform). I've already come to the conclusion
that I'm attempting to reinvent the wheel, in that my program
essentially has to be able to parse the HTML itself in order to output
similar markup in Impression's DDF (Document Description Format) - thus
I'm practically ending up trying to write a Web browser, and all that
implies ;-)
However, my problem at the moment is that my simplistic approach to
linefeed characters isn't working. (My attempt to construct a regular
expression that will match any and all of \n, \r, \t and the space
character isn't currently working either, but that's another problem.)
I think I must have misunderstood how browsers cope with whitespace.
What I'm doing at the moment is performing a preliminary scan of the
entire document to convert all whitespace into single space characters,
then attempting to convert the tags as I come to them. The problem
comes with HTML like this:
<blockquote>
<p>
<i>
Some text....
</i>
</p>
<p> [etc]
Using my simplistic approach, the output becomes
<blockquote> <p> <i> Some text...
which is then translated into {"Blockquote" on} \n {italic on} Some text
with the result that my 'paragraph' now starts with two spaces:
Some text (in italics)
Browsers don't do this, so obviously I'm interpreting something wrong.
(I realise that there is a whole can of worms related to <pre> formatted
text which I have as yet not even attempted to consider, but I'd like to
get this one right first....)
Do I need to concatenate the whitespace *last*? (If so, I'm going to
have to do a separate pass to cope with <BR> and <P> tags, I think.)
--
Harriet Bazley == Loyaulte me lie ==
Positive: Mistaken at the top of one's voice.