473,320 Members | 1,831 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

DOC -> TXT component

Hi all,

I am trying to extract text from an uploaded DOC file so that I can do
some regex on the text in order to fill up some textboxes on the ASP
page. I know that Writely (www.writely.com) does this but I think the
use C# or something special to preserve the formatting of the uploaded
DOC file.

My questions:

(a) Does anybody know of a way to extract text from DOC files? Any
component out there that's cheap?

(b) Is there any way to be able to preview an uploaded DOC file? Maybe
convert it into XML and use some default styles or something.

(c) If DOC->Text isn't possible, is DOC->RTF possible? Since RTF is
ASCII, something could still be done..

Thanks a lot for your time / any response.

Vince

Nov 1 '06 #1
13 1435

"LtCommander" <Lt*********@gmail.comwrote in message
news:11*********************@e64g2000cwd.googlegro ups.com...
Hi all,

I am trying to extract text from an uploaded DOC file so that I can do
some regex on the text in order to fill up some textboxes on the ASP
page. I know that Writely (www.writely.com) does this but I think the
use C# or something special to preserve the formatting of the uploaded
DOC file.

My questions:

(a) Does anybody know of a way to extract text from DOC files? Any
component out there that's cheap?
Cheapest is the office DLLs. It is also the most perf heavy and there is a
potential of licensing issues. There are libraries to go from text to word,
but I am not certain in the other direction (sure they are out there, but I
do not know of them personally).
(b) Is there any way to be able to preview an uploaded DOC file? Maybe
convert it into XML and use some default styles or something.
I am sure there is a component (componentsource.com) or an open source
library (sourceforge.net is a good resource, as is codeplex.com).
(c) If DOC->Text isn't possible, is DOC->RTF possible? Since RTF is
ASCII, something could still be done..
It is pulling from DOC that is the issue. DOC RTF is easy enough with the
Office libs, but you go back to the weight of having Office components on
your system.

I would look at third party libraries. I know there are components like Word
Writer (not sure if it goes both ways, however). I would also look at the
open source community. You might still have to pay (depending on license),
but you should be able to work out a reasonable deal.

--
Gregory A. Beamer
MVP; MCP: +I, SE, SD, DBA
http://gregorybeamer.spaces.live.com

*************************************************
Think outside of the box!
*************************************************
Nov 1 '06 #2
Hi,

Cowboy (Gregory A. Beamer) wrote:
Cheapest is the office DLLs. It is also the most perf heavy and there is a
potential of licensing issues.
also, MS recommends not to install Office on a server. Worth reading in
this context:
http://support.microsoft.com/default...en-us%3B257757

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 1 '06 #3

Olaf Rabbachin wrote:
Hi,

Cowboy (Gregory A. Beamer) wrote:
Cheapest is the office DLLs. It is also the most perf heavy and there is a
potential of licensing issues.

also, MS recommends not to install Office on a server. Worth reading in
this context:
http://support.microsoft.com/default...en-us%3B257757

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Thanks Greg and Olaf.
Having the Office libraries on the server is certainly the last resort.
I don't even think that our hist provider will allow that. I will look
at componetsource.com to see if I can find something. If something else
suddenly strikes you, please let me know.
Thanks again for your help.

Vince

Nov 2 '06 #4
Hi,

LtCommander wrote:
Having the Office libraries on the server is certainly the last resort.
I don't even think that our hist provider will allow that. I will look
at componetsource.com to see if I can find something. If something else
suddenly strikes you, please let me know.
I know this might not really apply to your current problem, but just FYI:
You don't need Office installed on the server to actually *create*
Office-files. There's a couple of ways to create files (at least starting
with Office2003), for Word that'd be i.e. via HTML (streamed as .Doc to the
client), XML-templates (i.e. a Word-template saved as XML, placeholders
within, those replaced by your code), or with XML/XSLT.

For the latter, check out this KB-article:
http://support.microsoft.com/Default.aspx?id=311461

However, if you actually need to *extract* text, then you'll probably be
forced to use some 3rdP-component. Why is it that you get DOCs uploaded (as
opposed to plain text)? That is, could you change that to "readable" text?

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 2 '06 #5
Thanks Olaf. That's because we are doing resumes. The jobseekers would
normally upload DOC files as opposed to TXT or ASCII files. We want to
be able to fill up most of the online textboxes like Name, Age and so
on, by actually extracting text from the DOC file and doing a Regex on
it. So far, I found some components that claim to do this but nothing
really concrete. I have to keep looking!
If you can think of something else, please let me know. I'll continue
my component research and see if I can find something useful. Many
charge over USD 900!!

Cheers!
Vince

Olaf Rabbachin wrote:
Hi,

LtCommander wrote:
Having the Office libraries on the server is certainly the last resort.
I don't even think that our hist provider will allow that. I will look
at componetsource.com to see if I can find something. If something else
suddenly strikes you, please let me know.

I know this might not really apply to your current problem, but just FYI:
You don't need Office installed on the server to actually *create*
Office-files. There's a couple of ways to create files (at least starting
with Office2003), for Word that'd be i.e. via HTML (streamed as .Doc to the
client), XML-templates (i.e. a Word-template saved as XML, placeholders
within, those replaced by your code), or with XML/XSLT.

For the latter, check out this KB-article:
http://support.microsoft.com/Default.aspx?id=311461

However, if you actually need to *extract* text, then you'll probably be
forced to use some 3rdP-component. Why is it that you get DOCs uploaded (as
opposed to plain text)? That is, could you change that to "readable" text?

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 3 '06 #6
Hi,

LtCommander wrote:
If you can think of something else, please let me know.
sorry, guess I don't have to offer much on extracting ...
I'll continue my component research and see if I can find something
useful. Many charge over USD 900!!
I'd appreciate your posting here when you found your solution.

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 3 '06 #7
Sure. Thanks for your help and I'll post my findings here if they are
useful.

Cheers!

Olaf Rabbachin wrote:
Hi,

LtCommander wrote:
If you can think of something else, please let me know.

sorry, guess I don't have to offer much on extracting ...
I'll continue my component research and see if I can find something
useful. Many charge over USD 900!!

I'd appreciate your posting here when you found your solution.

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 4 '06 #8
Olaf,
I found this:
http://www.convertspot.com/convertwordtotextcom.htm
They charge USD 250 and I haven't tried them yet. I am looking for
cheaper alternatives first..

Vince

LtCommander wrote:
Sure. Thanks for your help and I'll post my findings here if they are
useful.

Cheers!

Olaf Rabbachin wrote:
Hi,

LtCommander wrote:
If you can think of something else, please let me know.
sorry, guess I don't have to offer much on extracting ...
I'll continue my component research and see if I can find something
useful. Many charge over USD 900!!
I'd appreciate your posting here when you found your solution.

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 4 '06 #9
Hi Olaf,
just wanted to tell you that the component works like a charm. Now, the
mission is to find a cheaper one!!
vince
LtCommander wrote:
Olaf,
I found this:
http://www.convertspot.com/convertwordtotextcom.htm
They charge USD 250 and I haven't tried them yet. I am looking for
cheaper alternatives first..

Vince

LtCommander wrote:
Sure. Thanks for your help and I'll post my findings here if they are
useful.

Cheers!

Olaf Rabbachin wrote:
Hi,
>
LtCommander wrote:
>
If you can think of something else, please let me know.
>
sorry, guess I don't have to offer much on extracting ...
>
I'll continue my component research and see if I can find something
useful. Many charge over USD 900!!
>
I'd appreciate your posting here when you found your solution.
>
Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 4 '06 #10
Hi,

LtCommander wrote:
just wanted to tell you that the component works like a charm. Now, the
mission is to find a cheaper one!!
question is what you want to do with the component. If you plan on
integrating it into your application for selling it, you'd have to buy the
$1200-license. If all the tool does is extracting the readable text-portion
of a .doc (without i.e. tables or field-names, etc), then I guess you could
dig into the format and find out about that yourself. Also I'm wondering
what such a tool would do with the hidden stuff. I.e., create a new DOC,
type in some text and save it. Then overwrite part or all of that text and
save again. When you open that document with a text-editor, you'll find
that the original text (after first save) will still be there unless you
removed the "hidden data" (check out ...
http://www.microsoft.com/downloads/d...displaylang=en
....).

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 4 '06 #11
thanks olaf. you know, resumes are pretty standard stuff, maybe tables
here and there but certainly no hidden text or comments. what we want
to do is to:

a) get the job seeker to upload his resume and fill up most of the
usual boxes (name, age, sex, dob..) ourselves. so, the incentive to
upload the resume inlcudes filling up an online profile for the seeker
automatically

i did not realize that we would have to buy the 1200 thing. that's
crazy! thanks! i am sure we won't we using the component now!

thanks for the article olaf. hidden text / comments aren't going to be
that big a problem compared to the regex stuff we would need to extract
various fields like name and so on from the text fiile. at 1200, i
think this feature of ours has to be aborted.

Olaf Rabbachin wrote:
Hi,

LtCommander wrote:
just wanted to tell you that the component works like a charm. Now, the
mission is to find a cheaper one!!

question is what you want to do with the component. If you plan on
integrating it into your application for selling it, you'd have to buy the
$1200-license. If all the tool does is extracting the readable text-portion
of a .doc (without i.e. tables or field-names, etc), then I guess you could
dig into the format and find out about that yourself. Also I'm wondering
what such a tool would do with the hidden stuff. I.e., create a new DOC,
type in some text and save it. Then overwrite part or all of that text and
save again. When you open that document with a text-editor, you'll find
that the original text (after first save) will still be there unless you
removed the "hidden data" (check out ...
http://www.microsoft.com/downloads/d...displaylang=en
...).

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 5 '06 #12
Hi,

LtCommander wrote:
thanks olaf. you know, resumes are pretty standard stuff, maybe tables
here and there but certainly no hidden text or comments. what we want
to do is to:

a) get the job seeker to upload his resume and fill up most of the
usual boxes (name, age, sex, dob..) ourselves. so, the incentive to
upload the resume inlcudes filling up an online profile for the seeker
automatically
if that means that the files being uploaded to your server will be DOCs
that were downloaded from the same server before, meaning that you created
those DOCs, then create them as XML! In that case, you receive an "XML'd"
DOC which you may parse without any other component.
i did not realize that we would have to buy the 1200 thing. that's
crazy! thanks! i am sure we won't we using the component now!
That license would only make sense if you need to deploy that component on
more than 4 servers as it's 250.-/server.

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 6 '06 #13
We would create our resume (the blanks that the user fills in on our
site) as XML but for a first time user, we don't want to bother them
with filling all the stuff in. So, an incentive for them to upload
their DOC would be that the most common textboxes would be filled in
automatically. So...
I have been informed of some 50 buck thing that's supposed to do the
same thing. Have to test that next!

cheers

Olaf Rabbachin wrote:
Hi,

LtCommander wrote:
thanks olaf. you know, resumes are pretty standard stuff, maybe tables
here and there but certainly no hidden text or comments. what we want
to do is to:

a) get the job seeker to upload his resume and fill up most of the
usual boxes (name, age, sex, dob..) ourselves. so, the incentive to
upload the resume inlcudes filling up an online profile for the seeker
automatically

if that means that the files being uploaded to your server will be DOCs
that were downloaded from the same server before, meaning that you created
those DOCs, then create them as XML! In that case, you receive an "XML'd"
DOC which you may parse without any other component.
i did not realize that we would have to buy the 1200 thing. that's
crazy! thanks! i am sure we won't we using the component now!

That license would only make sense if you need to deploy that component on
more than 4 servers as it's 250.-/server.

Cheers,
Olaf
--
My .02: www.Resources.IntuiDev.com
Nov 7 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Karuppasamy | last post by:
H I have created a Windows Application using C#. This Windows application makes a call to a Remoting Object which runs as a Windows Service. This Remoting Component makes a call to a Com+(Name :...
0
by: Patrick | last post by:
hello i wrote a simple serviced component with c# and i registred sucessfully in my serviced component catalog. now, i'd like to enable webservice access via the checkbox soap in the component...
5
by: Stan | last post by:
Here is my scenario: Web server ------------ Framework 1.1 Application proxy for the serviced component Component server ------------------- Framework 1.0
0
by: Dave | last post by:
Hi everyone, (I already posted this to the VS.NET IDE news group without any responses, so I'm attempting one more time in this group) The issue I'm having is occuring in the IDE of VS.NET...
1
by: Jim H | last post by:
I coded my first asp form and attempted to display a text message from a component. The static text "Our component says:" shows on the aspx page, but the text line from the component does not show....
1
by: Ed. | last post by:
I apologize if this is a stupid question. When working with the <asp:Table> component in design mode, how do you select controls placed in the table cells? I can only seem to click on the table...
6
by: Just D. | last post by:
Does anybody know if there any open source project written in C# as a class library allowing to convert RTF string into HTML string? I'd like to join this project for the interest of both sides. I...
0
by: Henry Jan | last post by:
Hi All, In VS2005, when I use compilation directive for namespace that contains a control Control1 class, e.g.: #if _Version_A_ namespace Component.Windows.ASuite #elif _Version_B_ ...
4
by: Christofer Dutz | last post by:
Hi, I am having a small problem, that is driving me nuts. My application reads some Xml and runs 2 Xsl Transformations to generate HTML. As soon as my second XSL introduces some <br/tags, the...
1
by: gary.bernstein | last post by:
Any idea why line 57 fails? http://rafb.net/p/86JdGg61.html gs = MyShutdown<Component, T1, T2, T3>(this); Errors: shutdown1.cpp: In constructor `Component<T1, T2, T3>::Component(int)':...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.