Remove spaces and line wraps from html?

RiGGa

Hi,

I have a html file that I need to process and it contains text in this
format:

<TD><SPAN class=xf id=EmployeeNo
title="Employee Number">0123456</SPAN></TD></TR>

(Note split over two lines is as it appears in the source file.)

I would like to use Python (or anything else really) to have it all on one
line i.e.

<TD><SPAN class=xf id=EmployeeNo title="Employee
Number">0123456</SPAN></TD></TR>

(Note this has wrapped to the 2nd line)

Reason I would like to do this is so it is easier to pull back the
information from the file, I am interested in the contents of the title=
field and the data immediately after the > (in this case 0123456). I have
a basic Python program I have written to handle this however with the
script in its current format it goes wrong when its split over a line like
my first example.

Hope this all makes sense.

Any help appreciated.

Jul 18 '05 #1

Subscribe Post Reply

3051

Paramjit Oberoi

> I have a html file that I need to process and it contains text in this

format:

Try:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")

Jul 18 '05 #2

RiGGa

Paramjit Oberoi wrote:

I have a html file that I need to process and it contains text in this
format:
Try:

http://groups.google.com/groups?q=HT...2.05.55.384482
40hotmail.com&rnum=1
(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to use
that example :(

Jul 18 '05 #3

Paramjit Oberoi

>> http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Jul 18 '05 #4

RiGGa

Paramjit Oberoi wrote:

http://groups.google.com/groups?q=HT...ail.com&rnum=1
(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

Jul 18 '05 #5

RiGGa

RiGGa wrote:

Paramjit Oberoi wrote:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R

Jul 18 '05 #6

RiGGa

RiGGa wrote:

RiGGa wrote:
Paramjit Oberoi wrote:
>

http://groups.google.com/groups?q=HT...ail.com&rnum=1

>
> (or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how
to use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R

Ignore that, I retyped it manually and it now works, must have been a hidden
chatracter that my IDE didnt like.

Thanks again for your help, no doubt I will post back later with more
questions :)

Thanks
R

Jul 18 '05 #7

Peter Otten

RiGGa wrote:

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?

You get a deprecation warning when your source code contains non-ascii
characters and you have no encoding declared (read the PEP for details).
Those characters have a different meaning depending on the encoding, which
makes the code ambiguous.

However, what's really going on in your case is that (some) space characters
in the source code were replaced by chr(160), which happens sometimes with
newsgroup postings for reasons unknown to me. What makes that nasty is that
chr(160) looks just like the normal space character.

If you run the following from the command line with a space after python
(replace xxx.py with the source file and yyy.py with the name of the new
cleaned-up file), Paramjit's code should work as expected.

python-c'file("yyy.py","w").write(file("xxx.py").read().r eplace(chr(160),chr(32)))'

Peter

Jul 18 '05 #8

Similar topics

remove strings from source

by: qwweeeit | last post by:

For a python code I am writing I need to remove all strings definitions from source and substitute them with a place-holder. To make clearer: line 45 sVar="this is the string assigned to sVar"...

Python

Showing all spaces without using the  

by: Nathan Sokalski | last post by:

I am using ASP to read code from a text file that I am displaying on my page. Because I do not want the code from the text file to be executed, I used the Server.HTMLEncode() method to display it...

ASP / Active Server Pages

add and remove whitespace

by: Purdy | last post by:

I have an asp.net application. i export to excel, in exporting to excel i use an xslt to define the columns and look. on one of the fields i need the word 'qty' but i need it to look like ' ...

.NET Framework

asp.net remove white space in html

by: Aaron | last post by:

is there a way to remove all white spaces in the html output? ie. everything as one line.

.NET Framework

Can I stop VWD inserting spaces at the beginning of the line?

by: Alan Silver | last post by:

Hello, VWD Express seems obsessed with inserting loads of spaces at the start of each line whenever you drag controls into the source view. I guess this is MS' idea of making the source code...

ASP.NET

RichTextBox line count?

by: M O J O | last post by:

Hi, I'm using a RichTextBox with WordWrap=True and MultiLine=False. When the text is to long, it fills more lines. How do I get the number of lines the text uses? Thanks!

Visual Basic .NET

Problem with System.Process and quotes/spaces in arguments

by: WALDO | last post by:

I wrote a console application that basically consumes arguments and starts other command line apps via System.Process. Let's call it XCompile for now. I wrote a Visual basic add-in that does pretty...

Visual Basic .NET

135

Tabs versus Spaces in Source Code

by: Xah Lee | last post by:

Tabs versus Spaces in Source Code Xah Lee, 2006-05-13 In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about...

Python

Open text file and remove 2 spaces from every row

by: buggtb | last post by:

Hi Guys, I've been given the joyous task of updating some very old scripts at work and I could do with a little help. Our Unix system dumps text files to pseudo spoolers that our windows...

Visual Basic 4 / 5 / 6

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing