473,581 Members | 2,302 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Remove spaces and line wraps from html?

Hi,

I have a html file that I need to process and it contains text in this
format:

<TD><SPAN class=xf id=EmployeeNo
title="Employee Number">0123456 </SPAN></TD></TR>

(Note split over two lines is as it appears in the source file.)

I would like to use Python (or anything else really) to have it all on one
line i.e.

<TD><SPAN class=xf id=EmployeeNo title="Employee
Number">0123456 </SPAN></TD></TR>

(Note this has wrapped to the 2nd line)

Reason I would like to do this is so it is easier to pull back the
information from the file, I am interested in the contents of the title=
field and the data immediately after the > (in this case 0123456). I have
a basic Python program I have written to handle this however with the
script in its current format it goes wrong when its split over a line like
my first example.

Hope this all makes sense.

Any help appreciated.
Jul 18 '05 #1
7 3063
> I have a html file that I need to process and it contains text in this
format:


Try:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinte r")
Jul 18 '05 #2
Paramjit Oberoi wrote:
I have a html file that I need to process and it contains text in this
format:
Try:

http://groups.google.com/groups?q=HT...2.05.55.384482
40hotmail.com&r num=1
(or search c.l.p for "HTMLPrinte r")

Thanks, I forgot to mention I am new to Python so I dont yet know how to use
that example :(
Jul 18 '05 #3
>> http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinte r")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):
def handle_starttag (self, tag, attrs):
print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param
Jul 18 '05 #4
Paramjit Oberoi wrote:
http://groups.google.com/groups?q=HT...ail.com&rnum=1
(or search c.l.p for "HTMLPrinte r")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):
def handle_starttag (self, tag, attrs):
print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga
Jul 18 '05 #5
RiGGa wrote:
Paramjit Oberoi wrote:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinte r")

Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):

print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag (self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarn ing: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R
Jul 18 '05 #6
RiGGa wrote:
RiGGa wrote:
Paramjit Oberoi wrote:
>

http://groups.google.com/groups?q=HT...ail.com&rnum=1
>
> (or search c.l.p for "HTMLPrinte r")

Thanks, I forgot to mention I am new to Python so I dont yet know how
to use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):

print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag (self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarn ing: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R

Ignore that, I retyped it manually and it now works, must have been a hidden
chatracter that my IDE didnt like.

Thanks again for your help, no doubt I will post back later with more
questions :)

Thanks
R
Jul 18 '05 #7
RiGGa wrote:
I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag (self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarn ing: Non-ASCII character '\xa0'

What does this mean?


You get a deprecation warning when your source code contains non-ascii
characters and you have no encoding declared (read the PEP for details).
Those characters have a different meaning depending on the encoding, which
makes the code ambiguous.

However, what's really going on in your case is that (some) space characters
in the source code were replaced by chr(160), which happens sometimes with
newsgroup postings for reasons unknown to me. What makes that nasty is that
chr(160) looks just like the normal space character.

If you run the following from the command line with a space after python
(replace xxx.py with the source file and yyy.py with the name of the new
cleaned-up file), Paramjit's code should work as expected.

python-c'file("yyy.py" ,"w").write(fil e("xxx.py").rea d().replace(chr (160),chr(32))) '

Peter

Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
1771
by: qwweeeit | last post by:
For a python code I am writing I need to remove all strings definitions from source and substitute them with a place-holder. To make clearer: line 45 sVar="this is the string assigned to sVar" must be converted in: line 45 sVar=s00001 Such substitution is recorded in a file under: s0001="this is the string assigned to sVar"
6
3164
by: Nathan Sokalski | last post by:
I am using ASP to read code from a text file that I am displaying on my page. Because I do not want the code from the text file to be executed, I used the Server.HTMLEncode() method to display it as it is in the file. However, the spaces used to indent lines is still removed by the browser. I cannot use VBScript's replace function to replace...
4
8468
by: Purdy | last post by:
I have an asp.net application. i export to excel, in exporting to excel i use an xslt to define the columns and look. on one of the fields i need the word 'qty' but i need it to look like ' qty' with 5 white spaces. i am able to do that in my xslt and the export to Excel looks fine also. but then i have another app i feed this...
3
8088
by: Aaron | last post by:
is there a way to remove all white spaces in the html output? ie. everything as one line.
2
1743
by: Alan Silver | last post by:
Hello, VWD Express seems obsessed with inserting loads of spaces at the start of each line whenever you drag controls into the source view. I guess this is MS' idea of making the source code readable by indenting it, but it annoys me intensely. Apart from anything else, it causes extra line wrap as the spaces make the line too long to fit...
4
8442
by: M O J O | last post by:
Hi, I'm using a RichTextBox with WordWrap=True and MultiLine=False. When the text is to long, it fills more lines. How do I get the number of lines the text uses? Thanks!
7
6998
by: WALDO | last post by:
I wrote a console application that basically consumes arguments and starts other command line apps via System.Process. Let's call it XCompile for now. I wrote a Visual basic add-in that does pretty much the same thing to XCompile. Let's call it MyAddin. XCompile collects information to send to vbc.exe. When it comes across any arguments...
135
7428
by: Xah Lee | last post by:
Tabs versus Spaces in Source Code Xah Lee, 2006-05-13 In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about which is better. It has become what's known as “religious war” — a heated fight over trivia. In this essay, i like to explain what is the...
1
2504
by: buggtb | last post by:
Hi Guys, I've been given the joyous task of updating some very old scripts at work and I could do with a little help. Our Unix system dumps text files to pseudo spoolers that our windows machines can then pick up and process. Now the guy who wrote one of the reports is obviously a bit clueless and the certain lines don't match up so for...
0
7862
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main...
0
8301
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7894
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5670
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5361
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3803
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3820
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1400
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1132
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.