473,325 Members | 2,308 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

Parsing linefeeds correctly

I've done a Google Groups search, read the 1,000-odd articles on this
group currently present on my news server, and read the FAQ at
http://www.htmlhelp.org/faq/html/all.txt, but I can't find any hint of
an answer to this question, so here goes....
I'm attempting to construct a Lua script to convert basic HTML into a
form which can be imported by Impression Publisher (a popular DTP
application for my platform). I've already come to the conclusion
that I'm attempting to reinvent the wheel, in that my program
essentially has to be able to parse the HTML itself in order to output
similar markup in Impression's DDF (Document Description Format) - thus
I'm practically ending up trying to write a Web browser, and all that
implies ;-)

However, my problem at the moment is that my simplistic approach to
linefeed characters isn't working. (My attempt to construct a regular
expression that will match any and all of \n, \r, \t and the space
character isn't currently working either, but that's another problem.)

I think I must have misunderstood how browsers cope with whitespace.
What I'm doing at the moment is performing a preliminary scan of the
entire document to convert all whitespace into single space characters,
then attempting to convert the tags as I come to them. The problem
comes with HTML like this:

<blockquote>
<p>
<i>
Some text....
</i>
</p>
<p> [etc]

Using my simplistic approach, the output becomes
<blockquote> <p> <i> Some text...

which is then translated into {"Blockquote" on} \n {italic on} Some text

with the result that my 'paragraph' now starts with two spaces:
Some text (in italics)

Browsers don't do this, so obviously I'm interpreting something wrong.
(I realise that there is a whole can of worms related to <pre> formatted
text which I have as yet not even attempted to consider, but I'd like to
get this one right first....)

Do I need to concatenate the whitespace *last*? (If so, I'm going to
have to do a separate pass to cope with <BR> and <P> tags, I think.)

--
Harriet Bazley == Loyaulte me lie ==

Positive: Mistaken at the top of one's voice.
Jul 23 '05 #1
0 1268

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: silviu | last post by:
I have the following XML string that I want to parse using the SAX parser. If I remove the portion of the XML string between the <audit> and </audit> tags the SAX is parsing correctly. Otherwise...
3
by: lino | last post by:
Hello, I have the following string: const char plaintext = "Fourscore and seven years ago our \ fathers brought forth upon this continent \ a new nation, conceived in liberty, and dedicated...
7
by: Rocky Moore | last post by:
I have a web site called HintsAndTips.com. On this site people post tips using a very simply webform with a multi line TextBox for inputing the tip text. This text is encode to HTML so that no...
2
by: Grant Mills | last post by:
I'm trying to get linefeeds to display correctly in a datagrid (text is taken from a database of text that has been htmlencoded) By default it doesn't display them. But <br/> tags display fine. ...
1
by: Grant Mills | last post by:
I'm trying to get linefeeds to display correctly in a datagrid (text is taken from a database of text that has been htmlencoded) By default it doesn't display them. But <br/> tags display fine. ...
0
by: Kevin Hodgson | last post by:
I'm having a problem with System.Web.Mail.MailMessage losing linefeeds when creating an email. This is a VB.NET project. I use a stringbuilder to construct the MailMessage.Body and use vbCrLf...
3
by: Gina_Marano | last post by:
Hey All, There are many ways to skin a dog (I like cats so no cats) in .Net I have an XML string in the following format (without linefeeds): <remoteDir> <dir>MyDir</dir>...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...
26
by: Ramon F Herrera | last post by:
http://groups.google.com/group/comp.lang.c/browse_frm/thread/86a3ddf0724d9630/4e38340aa824bee0?lnk=gst&q=how+to+best+parse+a+CSV&rnum=1#4e38340aa824bee0 http://tinyurl.com/29q4kf Michael & Paul...
1
by: dino d. | last post by:
I need to parse a string by linefeeds and paragraphs. I need to preserve all periods and linefeeds too, so I can't simply use strtok or explode (both of which don't tell you where or which (period...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.