By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,292 Members | 1,338 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,292 IT Pros & Developers. It's quick & easy.

Editor to clean up MS Word-generated HTML table

P: n/a
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?

I tried a few freeware editors, but wasn't able to find a way to clean
it up.
Thanks,

Greg

Oct 24 '07 #1
Share this Question
Share on Google+
10 Replies


P: n/a
On 2007-10-24, Greg Lovern wrote:
>

I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?

I tried a few freeware editors, but wasn't able to find a way to clean
it up.
Use "lynx -dump" to extract the text, then mark it up in any text
editor.

--
Chris F.A. Johnson <http://cfaj.freeshell.org>
================================================== =================
Author:
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Oct 24 '07 #2

P: n/a
Greg Lovern wrote:
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?

First--don't!

1) In Word elect the table:
2) Convert table to text and use tabs for the table cells
3) Use Word's Search and Replace feature:
3a) Find what: ^t
Replace with: </td></td>
Replace all
3b) Find what: ^p
Replace with: </td></tr>^p<tr><td>
Replace all
4) Add to the beginning of your formal table:
<table>
<tr><td>
5) Add to end:
</table>
6) Select all and paste into your template HTML with any text editor.
Style to taste...

--
Take care,

Jonathan
-------------------
LITTLE WORKS STUDIO
http://www.LittleWorksStudio.com
Oct 24 '07 #3

P: n/a
In article <4e***************************@NAXS.COM>,
"Jonathan N. Little" <lw*****@centralva.netwrote:
Greg Lovern wrote:
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?


First--don't!
Agreed - if at all possible, avoid using Word to generate any html.
1) In Word elect the table:
2) Convert table to text and use tabs for the table cells
3) Use Word's Search and Replace feature:
3a) Find what: ^t
Replace with: </td></td>
I think you mean </td><td??

As an alternative, the OP could look at something like
Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

Depending on the flavour of OS and tastes/talents of the
user, there's always grep of course...
Oct 24 '07 #4

P: n/a
Greg Lovern wrote:
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.

Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?
I wrote this Win32 program that might work for you.
www.industrologic.com/basic/ program called xtag

Oct 24 '07 #5

P: n/a
Greg Lovern wrote:
Then xtag.exe crashed while I was writing this.
Whoops! Oh, well, I asked for feedback, and I got it didn't I?

If you are in a hurry you might try splitting the file into
smaller files and running them through it.

Send me your file if you want and I'll see what the problem is.
pe**@industrologic.com
Oct 24 '07 #6

P: n/a
On Oct 24, 3:31 pm, "Jonathan N. Little" <lws4...@centralva.net>
wrote:
Then you need to convert the table to text formatted at tabs.
"Table Convert Table to Text..." with "Separate text with Tabs"
Thanks, but because there are carriage returns (thousands of them)
within table cells, converting to text, then later trying to convert
back to a table, mangles it.

I found that nvu will remove some of the formatting. After doing that
to the small file, I was able to completely clean it manually in
notepad. I'm going to try again with the large file. I had tried that
with the large file before but it seemed like hardly any of the
strings to delete were duplicated. This time I'll try running it
through nvu's cleanup first.

Once clean, I'll work with them going forward in nvu. I found that nvu
adds absolutely no formatting to the table, at least after removing
some formatting with its settings.

Time to catch the bus now; I'll be back on this tomorrow morning.

Thanks to all for the help.
Thanks,

Greg

Oct 24 '07 #7

P: n/a
Greg Lovern wrote:
On Oct 24, 3:31 pm, "Jonathan N. Little" <lws4...@centralva.net>
wrote:
>Then you need to convert the table to text formatted at tabs.
"Table Convert Table to Text..." with "Separate text with Tabs"

Thanks, but because there are carriage returns (thousands of them)
within table cells, converting to text, then later trying to convert
back to a table, mangles it.
Sounds like the original source data is far too large for a single
webpage. You should break it up into smaller, logical, "digestible" pages.
--
Take care,

Jonathan
-------------------
LITTLE WORKS STUDIO
http://www.LittleWorksStudio.com
Oct 24 '07 #8

P: n/a
On Oct 23, 11:37 pm, Greg Lovern <gr...@gregl.netwrote:
Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?
Is there anything out there that will do the html equivalent of
Notepad -- remove all formatting, leaving only the bare html table and
it's bare text contents?
Thanks,

Greg

Oct 25 '07 #9

P: n/a
On Nov 7, 10:15 pm, Rob Hick <rsjh...@googlemail.comwrote:
[...]
A very nice little utility. I used it for something entirely
different (to clean SPSS HTML output) and it worked great. I did have
to run it in firefox though because in IE just pasting the text in
made IE crash (there was a lot of text ~1mb)!

So thanks RobG
Glad it got some use. :-)

--
Rob

Nov 8 '07 #10

P: n/a
On 24 oct, 18:09, Greg Lovern <gr...@gregl.netwrote:
On Oct 24, 3:31 pm, "Jonathan N. Little" <lws4...@centralva.net>
wrote:
Then you need to convert the table to text formatted at tabs.
"Table Convert Table to Text..." with "Separate text with Tabs"

Thanks, but because there are carriage returns (thousands of them)
within table cells, converting to text, then later trying to convert
back to a table, mangles it.

I found that nvu will remove some of the formatting. After doing that
to the small file, I was able to completely clean it manually in
notepad. I'm going to try again with the large file.
Greg,

Nvu 1.0 and KompoZer 0.77 won't do miracles. HTML Tidy won't either. I
know lots of advanced text editors (including open source ones, multi-
platform ones, free, etc) which can "find and replace" a string of
text into whatever you want, including character controls like
carriage returns.

http://en.wikipedia.org/wiki/Text_ed...ch_and_replace

Best is to avoid using FrontPage and MS-Word HTML exporting feature.

Regards, Gérard
P.S. Note that KompoZer 0.77 is more advanced, more recent with more
bug fixes in comparison to Nvu 1.0

Nov 9 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.