473,395 Members | 1,495 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Markup to Text

Hello grp:

I have a situation I was hoping someone might be able to suggest a solution.
I am retrieving html from a url and storing this information in Sql Server.
Our web service supplies this data to our clients via a web service that is
a client of the ws and to integration clients as xml data (HTML is encoded
in CDATA). We have an integration client who cannot accept html embedded in
the xml for whatever reason. Due to the large volume of this client, we are
tasked with coming up with a robust solution to convert the html markup to
an equivalent text representation. In the short term, we are removing the
html formatting and replacing it with regex, but this solution is not very
robust or particularly effective due to table structures not translating
very well. I have been looking for an 'HTML Stripper' tool but searching
Google hasn't yielded too much. Most of these tools are either gui-based or
require files for input. Neither one of these options will work for us
since this needs to run in a 'service' context without user interaction.

Does anyone know of an effective way to either handle the markup or know of
a COM object library that provides support for HTML tables?

Some things we have tried to date include HTMLDocument class (formatting is
not preserved), XHTML conversion followed by xslt parsing (effective but not
very efficient) and, as already mentioned, regex.

Any help is much appreciated,

Alex
Nov 15 '05 #1
1 1689
Have you at all considered maybe using the IE component framework to load
the HTML script and provide an object model ready to be programmed against?
Surely Internet Explorer can provide the majority of functionality to
extract the munged data you require. Maybe there are HTML parsers as
components ready for your use - I am sure there are some ActiveX parsers.
If there arent, have you also considered building the parser?

Nick.

"Trebek" <tr****@intheformofaquestion.com> wrote in message
news:40***********************@nnrp.fuse.net...
Hello grp:

I have a situation I was hoping someone might be able to suggest a solution. I am retrieving html from a url and storing this information in Sql Server. Our web service supplies this data to our clients via a web service that is a client of the ws and to integration clients as xml data (HTML is encoded
in CDATA). We have an integration client who cannot accept html embedded in the xml for whatever reason. Due to the large volume of this client, we are tasked with coming up with a robust solution to convert the html markup to
an equivalent text representation. In the short term, we are removing the
html formatting and replacing it with regex, but this solution is not very
robust or particularly effective due to table structures not translating
very well. I have been looking for an 'HTML Stripper' tool but searching
Google hasn't yielded too much. Most of these tools are either gui-based or require files for input. Neither one of these options will work for us
since this needs to run in a 'service' context without user interaction.

Does anyone know of an effective way to either handle the markup or know of a COM object library that provides support for HTML tables?

Some things we have tried to date include HTMLDocument class (formatting is not preserved), XHTML conversion followed by xslt parsing (effective but not very efficient) and, as already mentioned, regex.

Any help is much appreciated,

Alex

Nov 15 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Jacek Generowicz | last post by:
:::Title::: A simple text markup utility :::/Title::: :::Section Introduction ::: I'm looking for something to help developers wirte documentation for bits of software they are writing. The...
30
by: Reinhold Birkenfeld | last post by:
Hello, I know that there are different YAML engines for Python out there (Syck, PyYaml, more?). Which one do you use, and why? For those of you who don't know what YAML is: visit...
47
by: Andy Dingley | last post by:
Assume an English language page, linking to a foreign history resource that's only available in a foreign language. Any suggestions on appropriate "best practice" markup, particularly regarding the...
7
by: Lachlan Hunt | last post by:
Hi, I have recently downloaded and experemented with IBM HPR 3.0, and Opera 8 with text-to-speech, and have come to realise some fairly annoying issues regarding punctuation marks. I've found,...
2
by: neovantage | last post by:
Hey all, I have created transparent PNG images from text dynamically. But it edges are pixel-ate or we can say edges are distorted. Here is my LINK which shows my generated transparent PNG image....
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.