472,126 Members | 1,529 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,126 software developers and data experts.

Remove spaces and line wraps from html?

Hi,

I have a html file that I need to process and it contains text in this
format:

<TD><SPAN class=xf id=EmployeeNo
title="Employee Number">0123456</SPAN></TD></TR>

(Note split over two lines is as it appears in the source file.)

I would like to use Python (or anything else really) to have it all on one
line i.e.

<TD><SPAN class=xf id=EmployeeNo title="Employee
Number">0123456</SPAN></TD></TR>

(Note this has wrapped to the 2nd line)

Reason I would like to do this is so it is easier to pull back the
information from the file, I am interested in the contents of the title=
field and the data immediately after the > (in this case 0123456). I have
a basic Python program I have written to handle this however with the
script in its current format it goes wrong when its split over a line like
my first example.

Hope this all makes sense.

Any help appreciated.
Jul 18 '05 #1
7 2968
> I have a html file that I need to process and it contains text in this
format:


Try:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")
Jul 18 '05 #2
Paramjit Oberoi wrote:
I have a html file that I need to process and it contains text in this
format:
Try:

http://groups.google.com/groups?q=HT...2.05.55.384482
40hotmail.com&rnum=1
(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to use
that example :(
Jul 18 '05 #3
>> http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param
Jul 18 '05 #4
Paramjit Oberoi wrote:
http://groups.google.com/groups?q=HT...ail.com&rnum=1
(or search c.l.p for "HTMLPrinter")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga
Jul 18 '05 #5
RiGGa wrote:
Paramjit Oberoi wrote:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R
Jul 18 '05 #6
RiGGa wrote:
RiGGa wrote:
Paramjit Oberoi wrote:
>

http://groups.google.com/groups?q=HT...ail.com&rnum=1
>
> (or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how
to use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R

Ignore that, I retyped it manually and it now works, must have been a hidden
chatracter that my IDE didnt like.

Thanks again for your help, no doubt I will post back later with more
questions :)

Thanks
R
Jul 18 '05 #7
RiGGa wrote:
I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?


You get a deprecation warning when your source code contains non-ascii
characters and you have no encoding declared (read the PEP for details).
Those characters have a different meaning depending on the encoding, which
makes the code ambiguous.

However, what's really going on in your case is that (some) space characters
in the source code were replaced by chr(160), which happens sometimes with
newsgroup postings for reasons unknown to me. What makes that nasty is that
chr(160) looks just like the normal space character.

If you run the following from the command line with a space after python
(replace xxx.py with the source file and yyy.py with the name of the new
cleaned-up file), Paramjit's code should work as expected.

python-c'file("yyy.py","w").write(file("xxx.py").read().r eplace(chr(160),chr(32)))'

Peter

Jul 18 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by qwweeeit | last post: by
4 posts views Thread by Purdy | last post: by
3 posts views Thread by Aaron | last post: by
4 posts views Thread by M O J O | last post: by
135 posts views Thread by Xah Lee | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.