473,782 Members | 2,505 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Remove spaces and line wraps from html?

Hi,

I have a html file that I need to process and it contains text in this
format:

<TD><SPAN class=xf id=EmployeeNo
title="Employee Number">0123456 </SPAN></TD></TR>

(Note split over two lines is as it appears in the source file.)

I would like to use Python (or anything else really) to have it all on one
line i.e.

<TD><SPAN class=xf id=EmployeeNo title="Employee
Number">0123456 </SPAN></TD></TR>

(Note this has wrapped to the 2nd line)

Reason I would like to do this is so it is easier to pull back the
information from the file, I am interested in the contents of the title=
field and the data immediately after the > (in this case 0123456). I have
a basic Python program I have written to handle this however with the
script in its current format it goes wrong when its split over a line like
my first example.

Hope this all makes sense.

Any help appreciated.
Jul 18 '05 #1
7 3075
> I have a html file that I need to process and it contains text in this
format:


Try:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinte r")
Jul 18 '05 #2
Paramjit Oberoi wrote:
I have a html file that I need to process and it contains text in this
format:
Try:

http://groups.google.com/groups?q=HT...2.05.55.384482
40hotmail.com&r num=1
(or search c.l.p for "HTMLPrinte r")

Thanks, I forgot to mention I am new to Python so I dont yet know how to use
that example :(
Jul 18 '05 #3
>> http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinte r")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):
def handle_starttag (self, tag, attrs):
print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param
Jul 18 '05 #4
Paramjit Oberoi wrote:
http://groups.google.com/groups?q=HT...ail.com&rnum=1
(or search c.l.p for "HTMLPrinte r")


Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):
def handle_starttag (self, tag, attrs):
print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga
Jul 18 '05 #5
RiGGa wrote:
Paramjit Oberoi wrote:

http://groups.google.com/groups?q=HT...ail.com&rnum=1

(or search c.l.p for "HTMLPrinte r")

Thanks, I forgot to mention I am new to Python so I dont yet know how to
use that example :(


Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):

print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag (self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarn ing: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R
Jul 18 '05 #6
RiGGa wrote:
RiGGa wrote:
Paramjit Oberoi wrote:
>

http://groups.google.com/groups?q=HT...ail.com&rnum=1
>
> (or search c.l.p for "HTMLPrinte r")

Thanks, I forgot to mention I am new to Python so I dont yet know how
to use that example :(

Python has a HTMLParser module in the standard library:

http://www.python.org/doc/lib/module-HTMLParser.html
http://www.python.org/doc/lib/htmlparser-example.html

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):

print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param

Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag (self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarn ing: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R

Ignore that, I retyped it manually and it now works, must have been a hidden
chatracter that my IDE didnt like.

Thanks again for your help, no doubt I will post back later with more
questions :)

Thanks
R
Jul 18 '05 #7
RiGGa wrote:
I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag (self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarn ing: Non-ASCII character '\xa0'

What does this mean?


You get a deprecation warning when your source code contains non-ascii
characters and you have no encoding declared (read the PEP for details).
Those characters have a different meaning depending on the encoding, which
makes the code ambiguous.

However, what's really going on in your case is that (some) space characters
in the source code were replaced by chr(160), which happens sometimes with
newsgroup postings for reasons unknown to me. What makes that nasty is that
chr(160) looks just like the normal space character.

If you run the following from the command line with a space after python
(replace xxx.py with the source file and yyy.py with the name of the new
cleaned-up file), Paramjit's code should work as expected.

python-c'file("yyy.py" ,"w").write(fil e("xxx.py").rea d().replace(chr (160),chr(32))) '

Peter

Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
1787
by: qwweeeit | last post by:
For a python code I am writing I need to remove all strings definitions from source and substitute them with a place-holder. To make clearer: line 45 sVar="this is the string assigned to sVar" must be converted in: line 45 sVar=s00001 Such substitution is recorded in a file under: s0001="this is the string assigned to sVar"
6
3170
by: Nathan Sokalski | last post by:
I am using ASP to read code from a text file that I am displaying on my page. Because I do not want the code from the text file to be executed, I used the Server.HTMLEncode() method to display it as it is in the file. However, the spaces used to indent lines is still removed by the browser. I cannot use VBScript's replace function to replace all spaces with &nbsp; because that would replace all spaces, including ones that I do not want...
4
8480
by: Purdy | last post by:
I have an asp.net application. i export to excel, in exporting to excel i use an xslt to define the columns and look. on one of the fields i need the word 'qty' but i need it to look like ' qty' with 5 white spaces. i am able to do that in my xslt and the export to Excel looks fine also. but then i have another app i feed this excel sheet to, and in that
3
8096
by: Aaron | last post by:
is there a way to remove all white spaces in the html output? ie. everything as one line.
2
1750
by: Alan Silver | last post by:
Hello, VWD Express seems obsessed with inserting loads of spaces at the start of each line whenever you drag controls into the source view. I guess this is MS' idea of making the source code readable by indenting it, but it annoys me intensely. Apart from anything else, it causes extra line wrap as the spaces make the line too long to fit in the window. Can I switch off this, erm "feature"? I haven't found a way yet. Please put me out...
4
8458
by: M O J O | last post by:
Hi, I'm using a RichTextBox with WordWrap=True and MultiLine=False. When the text is to long, it fills more lines. How do I get the number of lines the text uses? Thanks!
7
7012
by: WALDO | last post by:
I wrote a console application that basically consumes arguments and starts other command line apps via System.Process. Let's call it XCompile for now. I wrote a Visual basic add-in that does pretty much the same thing to XCompile. Let's call it MyAddin. XCompile collects information to send to vbc.exe. When it comes across any arguments that are file paths, it wraps them in quotes. For example: Dim prms As String
135
7525
by: Xah Lee | last post by:
Tabs versus Spaces in Source Code Xah Lee, 2006-05-13 In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about which is better. It has become what's known as “religious war” — a heated fight over trivia. In this essay, i like to explain what is the situation behind it, and which is proper.
1
2521
by: buggtb | last post by:
Hi Guys, I've been given the joyous task of updating some very old scripts at work and I could do with a little help. Our Unix system dumps text files to pseudo spoolers that our windows machines can then pick up and process. Now the guy who wrote one of the reports is obviously a bit clueless and the certain lines don't match up so for example our % line is a couple of spaces too far across. A09 62371 __7.20%
0
9639
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
9479
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10146
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9942
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8967
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7492
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
1
4043
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3639
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2874
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.