473,625 Members | 3,253 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Editor to clean up MS Word-generated HTML table

I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?

I tried a few freeware editors, but wasn't able to find a way to clean
it up.
Thanks,

Greg

Oct 24 '07 #1
10 9805
On 2007-10-24, Greg Lovern wrote:
>

I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?

I tried a few freeware editors, but wasn't able to find a way to clean
it up.
Use "lynx -dump" to extract the text, then mark it up in any text
editor.

--
Chris F.A. Johnson <http://cfaj.freeshell. org>
=============== =============== =============== =============== =======
Author:
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Oct 24 '07 #2
Greg Lovern wrote:
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?

First--don't!

1) In Word elect the table:
2) Convert table to text and use tabs for the table cells
3) Use Word's Search and Replace feature:
3a) Find what: ^t
Replace with: </td></td>
Replace all
3b) Find what: ^p
Replace with: </td></tr>^p<tr><td>
Replace all
4) Add to the beginning of your formal table:
<table>
<tr><td>
5) Add to end:
</table>
6) Select all and paste into your template HTML with any text editor.
Style to taste...

--
Take care,

Jonathan
-------------------
LITTLE WORKS STUDIO
http://www.LittleWorksStudio.com
Oct 24 '07 #3
In article <4e************ *************** @NAXS.COM>,
"Jonathan N. Little" <lw*****@centra lva.netwrote:
Greg Lovern wrote:
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?


First--don't!
Agreed - if at all possible, avoid using Word to generate any html.
1) In Word elect the table:
2) Convert table to text and use tabs for the table cells
3) Use Word's Search and Replace feature:
3a) Find what: ^t
Replace with: </td></td>
I think you mean </td><td??

As an alternative, the OP could look at something like
Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

Depending on the flavour of OS and tastes/talents of the
user, there's always grep of course...
Oct 24 '07 #4
Greg Lovern wrote:
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?
I wrote this Win32 program that might work for you.
www.industrologic.com/basic/ program called xtag

Oct 24 '07 #5
Greg Lovern wrote:
Then xtag.exe crashed while I was writing this.
Whoops! Oh, well, I asked for feedback, and I got it didn't I?

If you are in a hurry you might try splitting the file into
smaller files and running them through it.

Send me your file if you want and I'll see what the problem is.
pe**@industrolo gic.com
Oct 24 '07 #6
On Oct 24, 3:31 pm, "Jonathan N. Little" <lws4...@centra lva.net>
wrote:
Then you need to convert the table to text formatted at tabs.
"Table Convert Table to Text..." with "Separate text with Tabs"
Thanks, but because there are carriage returns (thousands of them)
within table cells, converting to text, then later trying to convert
back to a table, mangles it.

I found that nvu will remove some of the formatting. After doing that
to the small file, I was able to completely clean it manually in
notepad. I'm going to try again with the large file. I had tried that
with the large file before but it seemed like hardly any of the
strings to delete were duplicated. This time I'll try running it
through nvu's cleanup first.

Once clean, I'll work with them going forward in nvu. I found that nvu
adds absolutely no formatting to the table, at least after removing
some formatting with its settings.

Time to catch the bus now; I'll be back on this tomorrow morning.

Thanks to all for the help.
Thanks,

Greg

Oct 24 '07 #7
Greg Lovern wrote:
On Oct 24, 3:31 pm, "Jonathan N. Little" <lws4...@centra lva.net>
wrote:
>Then you need to convert the table to text formatted at tabs.
"Table Convert Table to Text..." with "Separate text with Tabs"

Thanks, but because there are carriage returns (thousands of them)
within table cells, converting to text, then later trying to convert
back to a table, mangles it.
Sounds like the original source data is far too large for a single
webpage. You should break it up into smaller, logical, "digestible " pages.
--
Take care,

Jonathan
-------------------
LITTLE WORKS STUDIO
http://www.LittleWorksStudio.com
Oct 24 '07 #8
On Oct 23, 11:37 pm, Greg Lovern <gr...@gregl.ne twrote:
Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?
Is there anything out there that will do the html equivalent of
Notepad -- remove all formatting, leaving only the bare html table and
it's bare text contents?
Thanks,

Greg

Oct 25 '07 #9
On Nov 7, 10:15 pm, Rob Hick <rsjh...@google mail.comwrote:
[...]
A very nice little utility. I used it for something entirely
different (to clean SPSS HTML output) and it worked great. I did have
to run it in firefox though because in IE just pasting the text in
made IE crash (there was a lot of text ~1mb)!

So thanks RobG
Glad it got some use. :-)

--
Rob

Nov 8 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2037
by: mail | last post by:
How do i save some text thats is stored in an array... as a Word file ? Any help appreciated :)
8
2508
by: Preston Crawford | last post by:
I'm working on a site where one of the requirements is the ability to do a newsletter. This newsletter would either be composed with or pasted into some kind of WYSIWYG web-based editor OR they'd upload a word doc and I'd process it. I've done similar things with ASP in the past, but not with PHP. Could anyone tell me which of the above ways is going to be more perilous before I embark? I get a feeling the word doc one will be, but that's...
2
4447
by: Hatem KNANI | last post by:
Hi, I want to find a component that I can integrate in my application and that is an XML Editor which is WYSIWYG Word processor-like !! So that the user can easily create or modify XML document based on given DTD. thanks Hatem
2
1704
by: word9smith | last post by:
In other messages I have read that Word Perfect 10 can be us­ed for sgml/xml editing. Is this still true for Word Perfect 11?
9
8913
by: Stud Muffin | last post by:
Hey Basically, I'm trying to take objects created in microsoft word using equation editor (for creating clean looking math/physics equations) and putting them into some sort of webpage format. But they come out grossly unalligned and ugly when I try to directly copy and paste into microsoft frontpage 2000. Few things I could do is place them directly using x/y coord (which i don't know how to do), or just taking screenshots and use...
2
1360
by: Ondine | last post by:
Hi I hope someone might be able to help me with this, because having paid for support from Microsoft I'm still getting nowhere. I don't know if this is due to a file corruption or virus (I have no viruses now), but I cannot get in to the VB Editor via any Office (2000) program. In Access, when trying to open a module the message is "Unexpected error; quitting". In Excel the error message is "Out of memory". In
4
6784
by: groast | last post by:
Hi guys, I'm trying to design a word editor, something similar to "Microsoft Word". This is my first time designing with Visual C#, so not really good will all the features. I wonder how to actually create a work space ( which is set to the paper size chosen by the user) and able to handle automatically for multiple pages ( changes to next page when the first one is used). I found this very confusing since I'm playing with pixel in...
0
782
by: VJ | last post by:
Is there a way I can create a Windows forms just as the Outlook Word Email Editor?. I know we can do this IE control where I can display document files. I like the Outlook email Editor way which seems more neat and clean. VJ
232
13240
by: robert maas, see http://tinyurl.com/uh3t | last post by:
I'm working on examples of programming in several languages, all (except PHP) running under CGI so that I can show both the source files and the actually running of the examples online. The first set of examples, after decoding the HTML FORM contents, merely verifies the text within a field to make sure it is a valid representation of an integer, without any junk thrown in, i.e. it must satisfy the regular expression: ^ *?+ *$ If the...
0
895
by: =?Utf-8?B?QWJieQ==?= | last post by:
i have a project where i need to transfer the contents from word files to database. The documents are divided into groups which have different formats...for example some have vector images embeded in them, some have the mathematical equations build using the equation editor in word. Though i can deal with the documents containing no images or equations by opening the word file and reading the values and then storing in the database. My...
0
8256
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8694
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
7184
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6118
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5570
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4193
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2621
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1803
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1500
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.