473,473 Members | 1,838 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Text parser

Hi all,

I'm working on a little app that will go through a text file (right now a
"rich text" document), and parse it into a pseudo-html that our flash
programmers can use in their presentation.

I'm having a lot of trouble, because the rtf format is quite complicated...
at first we thought it seemed that there was no "nesting" of formatting, but
every once in a while it seems like there is. Also, depending on the
complexity of the original document, we may end up with lots of
un-decypherable syntax. In other words it's not as simple as:

{\b this is bold text}{\b\i this is bold and italicised text}

because every once in a while you'll have something like:

{\b bold text}\d\adsfaa\adsagd\aeaqwewe\a\\\\asdf\\{\b\i this is bold and
italicised text}{{/das/dd /d More text} /d/as///jh/}

So there's no way to easily break content into just the {/format Text}
definitions.

It all means something, I'm sure, but rather than try and re-work the whole
spec for rtf -> my format, I was hoping that there was a simplier format
that the text could be saved as before parsing. The originals are word
documents. The target pre-parsing format simply needs to include line
breaks, bolding, italicising, and underlining. All other formatting can go
out the window.

There are commerical components that handle rtf -> HTML, but that's not
really what I need and would have to re-parse it all anyway.

Is there a format that does this? Or does anyone have any good ideas?
Thanks for any input,

MCD
Nov 15 '05 #1
2 5354
You could probably do this with a custom clipboard format, but I expect
that would be as much work as you are already facing.

You will probably have better luck posting this in one of the SDK groups,
such as microsoft.public.win32.programmer.ui or maybe
microsoft.public.platformsdk.shell. The folks there may be more familiar
that straight dotnet programmers.

Thank you for choosing the MSDN Managed Newsgroups,

John Eikanger
Microsoft Developer Support
--------------------
| From: "Big D" <a@a.com>
| Subject: Text parser
| Date: Tue, 2 Mar 2004 14:44:24 -0700
| Lines: 40
| X-Priority: 3
| X-MSMail-Priority: Normal
| X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
| X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
| Message-ID: <#b**************@tk2msftngp13.phx.gbl>
| Newsgroups:
microsoft.public.dotnet.languages.csharp,microsoft .public.dotnet.languages.v
b
| NNTP-Posting-Host: 28185.w1.dsl.vcn.com 209.193.76.126
| Path:
cpmsftngxa06.phx.gbl!TK2MSFTNGXA06.phx.gbl!TK2MSFT NGXA05.phx.gbl!TK2MSFTNGP0
8.phx.gbl!tk2msftngp13.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.languages.vb:186048
microsoft.public.dotnet.languages.csharp:225840
| X-Tomcat-NG: microsoft.public.dotnet.languages.csharp
|
| Hi all,
|
| I'm working on a little app that will go through a text file (right now a
| "rich text" document), and parse it into a pseudo-html that our flash
| programmers can use in their presentation.
|
| I'm having a lot of trouble, because the rtf format is quite
complicated...
| at first we thought it seemed that there was no "nesting" of formatting,
but
| every once in a while it seems like there is. Also, depending on the
| complexity of the original document, we may end up with lots of
| un-decypherable syntax. In other words it's not as simple as:
|
| {\b this is bold text}{\b\i this is bold and italicised text}
|
| because every once in a while you'll have something like:
|
| {\b bold text}\d\adsfaa\adsagd\aeaqwewe\a\\\\asdf\\{\b\i this is bold and
| italicised text}{{/das/dd /d More text} /d/as///jh/}
|
| So there's no way to easily break content into just the {/format Text}
| definitions.
|
| It all means something, I'm sure, but rather than try and re-work the
whole
| spec for rtf -> my format, I was hoping that there was a simplier format
| that the text could be saved as before parsing. The originals are word
| documents. The target pre-parsing format simply needs to include line
| breaks, bolding, italicising, and underlining. All other formatting can
go
| out the window.
|
| There are commerical components that handle rtf -> HTML, but that's not
| really what I need and would have to re-parse it all anyway.
|
| Is there a format that does this? Or does anyone have any good ideas?
|
|
| Thanks for any input,
|
| MCD
|
|
|

Nov 15 '05 #2
"Big D" <a@a.com> wrote in message
news:%2******************@tk2msftngp13.phx.gbl...
Hi all,

I'm working on a little app that will go through a text file (right now a
"rich text" document), and parse it into a pseudo-html that our flash
programmers can use in their presentation.

I'm having a lot of trouble, because the rtf format is quite complicated... at first we thought it seemed that there was no "nesting" of formatting, but every once in a while it seems like there is. Also, depending on the
complexity of the original document, we may end up with lots of
un-decypherable syntax. In other words it's not as simple as:

{\b this is bold text}{\b\i this is bold and italicised text}

because every once in a while you'll have something like:

{\b bold text}\d\adsfaa\adsagd\aeaqwewe\a\\\\asdf\\{\b\i this is bold and
italicised text}{{/das/dd /d More text} /d/as///jh/}

So there's no way to easily break content into just the {/format Text}
definitions.

It all means something, I'm sure, but rather than try and re-work the whole spec for rtf -> my format, I was hoping that there was a simplier format
that the text could be saved as before parsing. The originals are word
documents. The target pre-parsing format simply needs to include line
breaks, bolding, italicising, and underlining. All other formatting can go out the window.

There are commerical components that handle rtf -> HTML, but that's not
really what I need and would have to re-parse it all anyway.

Is there a format that does this? Or does anyone have any good ideas?
Thanks for any input,

MCD


This is a kludge, but I thought I'd throw it out there if you are strapped
for ideas. The RichTextBox control has two members that may be of use to
you, once you load your rtf into the control:

RichTextBox.Select(int start, int length)

and the property

RichTextBox.SelectionFont

Between those two, you could programmatically figure out what has been made
bold, underlined, or italicised. You could also parse line breaks.

It may not be an efficient algorithm, but it will be easy to write.

Erik
Nov 15 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Nebojsa Topolscak | last post by:
Hello, I wrote application which doesn't use graphic at all. One part of application needs Parser class from javax.swing.text.html.HTMLEditorKit. The ridiculous fact is when I try to...
0
by: Himanshu Garg | last post by:
Hello, I am using HTML::Parser to extract text from html pages from http://bbc.co.uk/urdu/ However the encoding of the input text seems to change to some unknown encoding in the output. The...
27
by: Eric | last post by:
Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there...
1
by: Eileene Cordoves | last post by:
hi i'm a newbie in xml and we're using org.apache.xerces.parsers.SAXParser. anyone know what the invalid characters in xml are? one of the value in the parsed xml is '<space><space>1', we...
1
by: google | last post by:
It would seem that when I assign an HTML entity to a form text input using "inline" javascript that it will display properly. But when trying to set it via a function call, the entity text shows...
2
by: Mattias Thuresson | last post by:
I want to convert a text based protocol into XML, using C#, how do I do that in the best maner ? Can I use XSLT in any way ? or do I have to make an parser that converts the file manually ? An...
3
by: matofarides | last post by:
Hello all! I was wondering whether there was a way, using any XML parser (preferably xerces), to create an XML file by reading, line by line a text file. Thank you, Marios Tofarides
9
by: Alex Buell | last post by:
I have a small text file which consist of the following data: ]] And the code I've written is as follows: ]] The trouble is, I can't work out why it goes into an infinite loop reading the...
2
by: David Virgil Hobbs | last post by:
Loading text strings containing HTML code into an HTML parser in a Javascript/Jscript I would like to know, how one would go about loading a text string containing HTML code, so as to be able to...
13
by: sonald | last post by:
Hi, Can anybody tell me how to change the text delimiter in FastCSV Parser ? By default the text delimiter is double quotes(") I want to change it to anything else... say a pipe (|).. can anyone...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.