473,386 Members | 1,819 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Remove spaces and line wraps from html?

Hi,

I have a html file that I need to process and it contains text in this
format:

<TD><SPAN class=xf id=EmployeeNo
title="Employee Number">0123456</SPAN></TD></TR>

(Note split over two lines is as it appears in the source file.)

I would like to use Python (or anything else really) to have it all on one
line i.e.

<TD><SPAN class=xf id=EmployeeNo title="Employee
Number">0123456</SPAN></TD></TR>

(Note this has wrapped to the 2nd line)

Reason I would like to do this is so it is easier to pull back the
information from the file, I am interested in the contents of the title=
field and the data immediately after the > (in this case 0123456). I have
a basic Python program I have written to handle this however with the
script in its current format it goes wrong when its split over a line like
my first example.

Hope this all makes sense.

Any help appreciated.
Jul 18 '05 #1
7 3051
> I have a html file that I need to process and it contains text in this
format:


Try:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")
Jul 18 '05 #2
Paramjit Oberoi wrote:
I have a html file that I need to process and it contains text in this
format:
Try:

http://groups.google.com/groups?q=HT...2.05.55.384482
40hotmail.com&rnum=1
(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to use
that example :(
Jul 18 '05 #3
>> http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param
Jul 18 '05 #4
Paramjit Oberoi wrote:
http://groups.google.com/groups?q=HT...ail.com&rnum=1
(or search c.l.p for "HTMLPrinter")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga
Jul 18 '05 #5
RiGGa wrote:
Paramjit Oberoi wrote:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R
Jul 18 '05 #6
RiGGa wrote:
RiGGa wrote:
Paramjit Oberoi wrote:
>

http://groups.google.com/groups?q=HT...ail.com&rnum=1
>
> (or search c.l.p for "HTMLPrinter")

Thanks, I forgot to mention I am new to Python so I dont yet know how
to use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag

my_parser=MyHTMLParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed(html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation.

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R

Ignore that, I retyped it manually and it now works, must have been a hidden
chatracter that my IDE didnt like.

Thanks again for your help, no doubt I will post back later with more
questions :)

Thanks
R
Jul 18 '05 #7
RiGGa wrote:
I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag(self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarning: Non-ASCII character '\xa0'

What does this mean?


You get a deprecation warning when your source code contains non-ascii
characters and you have no encoding declared (read the PEP for details).
Those characters have a different meaning depending on the encoding, which
makes the code ambiguous.

However, what's really going on in your case is that (some) space characters
in the source code were replaced by chr(160), which happens sometimes with
newsgroup postings for reasons unknown to me. What makes that nasty is that
chr(160) looks just like the normal space character.

If you run the following from the command line with a space after python
(replace xxx.py with the source file and yyy.py with the name of the new
cleaned-up file), Paramjit's code should work as expected.

python-c'file("yyy.py","w").write(file("xxx.py").read().r eplace(chr(160),chr(32)))'

Peter

Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: qwweeeit | last post by:
For a python code I am writing I need to remove all strings definitions from source and substitute them with a place-holder. To make clearer: line 45 sVar="this is the string assigned to sVar"...
6
by: Nathan Sokalski | last post by:
I am using ASP to read code from a text file that I am displaying on my page. Because I do not want the code from the text file to be executed, I used the Server.HTMLEncode() method to display it...
4
by: Purdy | last post by:
I have an asp.net application. i export to excel, in exporting to excel i use an xslt to define the columns and look. on one of the fields i need the word 'qty' but i need it to look like ' ...
3
by: Aaron | last post by:
is there a way to remove all white spaces in the html output? ie. everything as one line.
2
by: Alan Silver | last post by:
Hello, VWD Express seems obsessed with inserting loads of spaces at the start of each line whenever you drag controls into the source view. I guess this is MS' idea of making the source code...
4
by: M O J O | last post by:
Hi, I'm using a RichTextBox with WordWrap=True and MultiLine=False. When the text is to long, it fills more lines. How do I get the number of lines the text uses? Thanks!
7
by: WALDO | last post by:
I wrote a console application that basically consumes arguments and starts other command line apps via System.Process. Let's call it XCompile for now. I wrote a Visual basic add-in that does pretty...
135
by: Xah Lee | last post by:
Tabs versus Spaces in Source Code Xah Lee, 2006-05-13 In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about...
1
by: buggtb | last post by:
Hi Guys, I've been given the joyous task of updating some very old scripts at work and I could do with a little help. Our Unix system dumps text files to pseudo spoolers that our windows...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.