473,598 Members | 2,953 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

There is need for Text to XML semi/automatic conversion?

Hi, I'm new at this newsgroup and I want do ask some questions and
opinions about this subject.

I'm developing an application focused in a very specific task: clean
and labelling text documents with user-defined structural tags (title,
cite, date, paragraph, itemList, ...). It makes the typical
pre-processing tasks needed for computational linguistics in order to
work with big corporas to use statistical tools.

But I'm worried that this field be too small/specific. I choosed it
because it's a field that I know and where I'd some contacts, *but* I'm
not sure if research departments of universities are able to spend
money/purchase software, or may be they are too used to the free/open
source world.

For this reason I'm looking for some other field where the task of
adding structural labels to text be needed (specifically converting
unstructured and format-oriented documents to structured
function-oriented XML documents). May be some area on publishing, but I
think that they will not be interested in "small" desktop applications.

Please, any of you had worked for or listen about some business with
this kind of need? Do you think that there is demand for legacy
document conversion in small business?

Some info about the application:

- Importing form main document formats (TXTs, HTML, RTF, others?).
- GUI Based for interactive labelling (active learning techniques,
similar to the OCR programs).
- Interactive labelling used to "train" the program by automatic
induction of statistical rules (based on textual, lexical,
typographical and structural properties of the block).
- After trainning the labeller can be used in batch-processing in a
full-atonomous mode.
- Exporting to user-defined XML (any estandard? docbook? TEI?)
- A lot of cleaning and normalization small tasks: removing headers,
de-hyphenation, reconstruction of paragraphs with broken lines,
removing non-textual or decorative elements as (asccii art), ...

I think that legacy document conversion may be a need for many
bussinnes, but I'm not able to found them, may be some of you can give
a clue?

thanks very much in advance.

Dec 23 '05 #1
3 1890

Francesc wrote:
Hi, I'm new at this newsgroup and I want do ask some questions and
opinions about this subject.

EDITED FOR BREVITY

I think that legacy document conversion may be a need for many
bussinnes, but I'm not able to found them, may be some of you can give
a clue?

thanks very much in advance.


You might want to look at companies like Exegenix (www.exegenix.com).

There are a number of vendors who provide XML conversion software.

Dec 27 '05 #2
Thanks for the link, this program seems to be very similar to what I'm
doing, but as often this company is focused to "big services". I wonder
why there are not "small desktop applications" to help taggers to
automate its labelling tasks.

The good new is that exists market to hold a big company as Exegenix,
sure that exists market to hold an small company as mine. :)

Francesc

Dec 28 '05 #3
Francesc wrote:
I'm worried that this field be too small/specific. I choosed it
because it's a field that I know and where I'd some contacts, *but* I'm
not sure if research departments of universities are able to spend
money/purchase software, or may be they are too used to the free/open
source world.
Yes Fransesc, such research efforts are undertaken by Universities and
publishing houses round the world. The single most important reason for
the same is that XML is customizable to very large extents (as compared
to HTML). And many such organizations have (and still are) spending
money and efforts onthese front. But I haven't heard of any commercial
application that can convert text to XML at a stretch.
For this reason I'm looking for some other field where the task of
adding structural labels to text be needed (specifically converting
unstructured and format-oriented documents to structured
function-oriented XML documents). May be some area on publishing, but I
think that they will not be interested in "small" desktop applications.
Now, that sounds interesting, and yes, there can be publications (I am
not sure on this front) that may need to convert unstructured
information to XML. But again, isn't XML format-oriented itself? The
basic purpose of text to XML conversion (in publishing houses and
universities) is that the XML-ized documents add to a data bank, from
which they can be searched/sorted out. I believe, this could be
possible only by structurizing them.
Please, any of you had worked for or listen about some business with
this kind of need? Do you think that there is demand for legacy
document conversion in small business?
I also have been part of one such organizations.
- A lot of cleaning and normalization small tasks: removing headers,
de-hyphenation, reconstruction of paragraphs with broken lines,
removing non-textual or decorative elements as (asccii art), ...
These issues can be taken care of by defining a macro in MS Word. The
folw would then be directed through MS Word itself (Word to text, text
to XML)
I think that legacy document conversion may be a need for many
bussinnes....


The most possible options (my perception), would be publications that
pay too much emphasis to typography (those can be typography-oriented
too), or those who prefer keeping their text in undefined/spontaneous
structures. Such mags may not be archiving/publishing their issues for
a recall/research purpose.

Thanking you,

Manu Stanley

A journey of a thousand miles must begin with a single step.

Dec 29 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
1664
by: Michael | last post by:
Hi, I'm fairly new at Python, and have the following code that works but isn't very concise, is there a better way of writing it?? It seems much more lengthy than python code i have read..... :-) (takes a C++ block and extracts the namespaces from it) def ExtractNamespaces(data): print("Extracting Namespaces")
10
687
by: Christopher H. Laco | last post by:
Long story longer. I need to get web user input into a backend system that a) only grocks single byte encoding, b) expectes the data transer to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system wide. As much as it sounds good, UTF/Unicode encoding is not an option, nor is changing the codepage. Tackling the first is easy via Encoding.Default.GetBytes and shoving it over the network. However, Encoding.Default is the...
2
3066
by: Klaus Nowikow | last post by:
I would like to be able to do the following: std::cout // Or any other ostream << "Line 1\n" << push_tab << "Line 2\n" << "Line 3\n" << push_tab << "Line 4\n" << pop_tab
4
7951
by: Russ | last post by:
Is it possible to create a input mask on a text box? If a user enters a date like this 060103 I would like to see it automatically converted to 06/01/03. In Access if you set a input mask to 99/99/00;0; when the cursor enters the cell you see __/__/__ . That is the functionality I am trying to accomplish. Can it be done?
3
12002
by: nan | last post by:
Hi All, I am trying to connect the Database which is installed in AS400 using DB2 Client Version 8 in Windows box. First i created the Catalog, then when i selected the connection type as ODBC, then i am getting
7
4214
by: Ben R. | last post by:
How does automatic type casting happen in vb.net? I notice that databinder.eval "uses reflectoin" to find out the type it's dealing with. Does vb.net do the same thing behind the scenes when an invisible cast is made? Is there any reason why one would use databinder.eval while in VB.NET? I can see why one might use it in C# so as to avoid specifying the type for the cast but since this is not necessary in VB.NET, I'm not sure I follow. ...
6
1655
by: GHUM | last post by:
I need to split a text at every ; (Semikolon), but not at semikolons which are "escaped" within a pair of $$ or $_$ signs. My guess was that something along this should happen withing csv.py; but ... it is done within _csv.c :( Example: the SQL text should be splitted at "<split here>" (of course, those "split heres" are not there yet :) set interval 2;
3
5128
by: gulllet | last post by:
I try to import a tab separated text file into sql server 2005 using the import guide. But when running the job I get the error message Error 0xc02020c5: Data Flow Task: Data conversion failed while converting column "Column 19" (67) to column "Column 19" (404). The conversion returned status value 2 and status text "The value could not be converted because of a potential loss of data.". (SQL Server Import and Export Wizard) The column...
3
1862
by: Markus Dehmann | last post by:
I think this is a question about automatic type conversion, but I didn't find the answer after googling for these words ... I have a class called Value (source see below) which can hold an int or a string: Value i(23); Value s("blah"); Now I want an implicit conversion that automatically returns the correct type, even in a context like this:
0
7981
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8284
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8392
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8262
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6711
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5847
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5437
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3894
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
1245
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.