473,769 Members | 8,134 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

A better webpage filter

Since a few days I've been experimenting with a construct that enables
me to send the sourcecode of the web page I'm reading through a Python
script and then into a new tab in Mozilla. The new tab is automatically
opened so the process feels very natural, although there's a lot of
reading, filtering and writing behind the scene.

I want to do three things with this post:

A) Explain the process so that people can try it for themselves and say
"Hey stupid, I've been doing the same thing with greasemonkey for ages",
or maybe "You're great, this is easy to see, since the crux of the
biscuit is the apostrophe." Both kind of comments are very welcome.

B) Explain why I want such a thing.

C) If this approach is still valid after all the before, ask help for
writing a better Python htmlfilter.py

So here we go:

A) Explain the process

We need :

- mozilla firefox http://en-us.www.mozilla.com/en-US/
- add-on viewsourcewith https://addons.mozilla.org/firefox/394/
- batch file (on windows):
(htmfilter.bat)
d:\python25\pyt hon.exe D:\Python25\Scr ipts\htmlfilter .py "%1" out.html
start out.html
- a python script:
#htmfilter.py

import sys

def htmlfilter(fnam e, skip = []):
f = file(fname)
data = f.read()
L = []
for i,x in enumerate(data) :
if x == '<':
j = i
elif x =='>':
L.append((j,i))
R = list(data)
for i,j in reversed(L):
s = data[i:j+1]
for x in skip:
if x in s:
R[i:j+1] = ' '
break
return ''.join(R)

def test():
if len(sys.argv) == 2:
skip = ['div','table']
fname = sys.argv[1].strip()
print htmlfilter(fnam e,skip)

if __name__=='__ma in__':
test()

Now install the htmlfilter.py file in your Python scripts dir and adapt
the batchfile to point to it.

To use the viewsourcewith add-on to open the batchfile: Go to some
webpage, left click and view the source with the batchfile.

B) Explain why I want such a thing.

OK maybe this should have been the thing to start with, but hey it's
such an interesting technique it's almost a waste no to give it a chance
before my idea is dissed :-)

Most web pages I visit lately are taking so much room for ads (even with
adblocker installed) that the mere 20 columns of text that are available
for reading are slowing me down unacceptably. I have tried clicking
'print this' or 'printer friendly' or using 'no style' from the mozilla
menu and switching back again for other pages but it was tedious to say
the least. Every webpage has different conventions. In the end I just
started editing web pages' source code by hand, cutting out the beef and
saving it as a html file with only text, no scripts or formatting. But
that was also not very satisfying because raw web pages are *big*.

Then I found out I often could just replace all 'table' or 'div'
elements with a space and the page -although not very html compliant any
more- still loads and often the text looks a lot better. This worked for
at least 50 percent of the pages and restored my autonomy and
independence in reading web pages! (Which I do a lot by the way, maybe
for most people the problem is not very irritating, because they don't
read as much? Tell me that too, I want to know :-)

C) Ask help writing a better Python htmlfilter.py

Please. You see the code for yourself, this must be done better :-)

A.
Mar 24 '07 #1
6 1453
En Sat, 24 Mar 2007 15:45:41 -0300, Anton Vredegoor
<an************ *@gmail.comescr ibió:
Since a few days I've been experimenting with a construct that enables
me to send the sourcecode of the web page I'm reading through a Python
script and then into a new tab in Mozilla. The new tab is automatically
opened so the process feels very natural, although there's a lot of
reading, filtering and writing behind the scene.

I want to do three things with this post:

A) Explain the process so that people can try it for themselves and say
"Hey stupid, I've been doing the same thing with greasemonkey for ages",
or maybe "You're great, this is easy to see, since the crux of the
biscuit is the apostrophe." Both kind of comments are very welcome.
I use the Opera browser: http://www.opera.com
Among other things (like having tabs for ages!):
- enable/disable tables and divs (like you do)
- enable/disable images with a keystroke, or only show cached images.
- enable/disable CSS
- banner supressing (aggressive)
- enable/disable scripting
- "fit to page width" (for those annoying sites that insist on using a
fixed width of about 400 pixels, less than 1/3 of my actual screen size)
- apply your custom CSS or javascript on any page
- edit the page source and *refresh* the original page to reflect your
changes

All of this makes a very smooth web navigation - specially on a slow
computer or slow connection.

--
Gabriel Genellina

Mar 24 '07 #2
Gabriel Genellina wrote:
I use the Opera browser: http://www.opera.com
Among other things (like having tabs for ages!):
- enable/disable tables and divs (like you do)
- enable/disable images with a keystroke, or only show cached images.
- enable/disable CSS
- banner supressing (aggressive)
- enable/disable scripting
- "fit to page width" (for those annoying sites that insist on using a
fixed width of about 400 pixels, less than 1/3 of my actual screen size)
- apply your custom CSS or javascript on any page
- edit the page source and *refresh* the original page to reflect your
changes

All of this makes a very smooth web navigation - specially on a slow
computer or slow connection.
Thanks! I forgot about that one. It does what I want natively so I will
go that route for now. Still I think there must be some use for my
method of filtering. It's just too good to not have some use :-) Maybe
in the future -when web pages will add new advertisement tactics faster
than web browser builders can change their toolbox or instruct their
users. After all, I was editing the filter script on one screen and
another screen was using the new filter as soon as I had saved it.

Maybe someday someone will write a GUI where one can click some radio
buttons that would define what goes through and what not. Possibly such
a filter could be collectively maintained on a live webpage with an
update frequency of a few seconds or something. Just to make sure we're
prepared for the worst :-)

A.
Mar 24 '07 #3
Anton Vredegoor <an************ *@gmail.comwrit es:
[...]
Most web pages I visit lately are taking so much room for ads (even
with adblocker installed) that the mere 20 columns of text that are
available for reading are slowing me down unacceptably. I have tried
[...]

http://webcleaner.sourceforge.net/
Not actually tried it myself, though did browse some of the code once
or twice -- does some clever stuff.

Lots of other Python-implemented HTTP proxies, some of which are
relevant (though AFAIK all less sophisticated than webcleaner), are
listed on Alan Kennedy's nice page here:

http://xhaus.com/alan/python/proxies.html
A surprising amount of diversity there.
John
Mar 25 '07 #4
John J. Lee wrote:
http://webcleaner.sourceforge.net/
Thanks, I will look into it sometime. Essentially my problem has been
solved by switching to opera, but old habits die hard and I find myself
using Mozilla and my little script more often than would be logical.

Maybe the idea of having a *Python* script open at all times to which
all content goes through is just too tempting. I mean if there's some
possible irritation on a site theoretically I could just write a
specific function to get rid of it. This mental setting works as a
placebo on my web browsing experience so that the actual problems don't
always even need to be solved ... I hope I'm not losing all traditional
programmers here in this approach :-)
Not actually tried it myself, though did browse some of the code once
or twice -- does some clever stuff.

Lots of other Python-implemented HTTP proxies, some of which are
relevant (though AFAIK all less sophisticated than webcleaner), are
listed on Alan Kennedy's nice page here:

http://xhaus.com/alan/python/proxies.html
A surprising amount of diversity there.
At least now I know what general category seems to be nearest to my
solution so thanks again for that. However my solution is not really
doing anything like the programs on this page (although it is related to
removing ads), instead it tries to modulate a copy of the page after
it's been saved on disk. This removes all kinds of links and enables one
to definitely and finally reshape the form the page will take. As such
it is more concerned with the metaphysical image the page makes on the
users brain and less with the actual content or the security aspects.

One thing I noticed though on that (nice!) Alan Kennedy page is that
there was a script that was so small that it didn't even have a homepage
but instead it just relied on a google groups post! I guess you can see
that I liked that one :-)

My filter is even smaller. I've tried to make it smaller still by
removing the batch file and using webbrowser.open (some cStringIO object)
but that didn't work on windows.

regards,

A.
Mar 26 '07 #5
En Mon, 26 Mar 2007 06:06:00 -0300, Anton Vredegoor
<an************ *@gmail.comescr ibió:
Thanks, I will look into it sometime. Essentially my problem has been
solved by switching to opera, but old habits die hard and I find myself
using Mozilla and my little script more often than would be logical.

Maybe the idea of having a *Python* script open at all times to which
all content goes through is just too tempting. I mean if there's some
possible irritation on a site theoretically I could just write a
specific function to get rid of it. This mental setting works as a
If you don't mind using JavaScript instead of Python, UserJS is for you:
http://www.opera.com/support/tutorials/userjs/

--
Gabriel Genellina

Mar 26 '07 #6
Gabriel Genellina wrote:
If you don't mind using JavaScript instead of Python, UserJS is for you:
http://www.opera.com/support/tutorials/userjs/
My script loads a saved copy of a page and uses it to open an extra tab
with a filtered view. It also works when javascript is disabled.

A.
Mar 26 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
8651
by: Michele Simionato | last post by:
I often feel the need to extend the string method ".endswith" to tuple arguments, in such a way to automatically check for multiple endings. For instance, here is a typical use case: if filename.endswith(('.jpg','.jpeg','.gif','.png')): print "This is a valid image file" Currently this is not valid Python and I must use the ugly if filename.endswith('.jpg') or filename.endswith('.jpeg') \
14
5419
by: Sean C. | last post by:
Helpful folks, Most of my previous experience with DB2 was on s390 mainframe systems and the optimizer on this platform always seemed very predictable and consistent. Since moving to a WinNT/UDB 7.2 environment, the choices the optimizer makes often seem flaky. But this last example really floored me. I was hoping someone could explain why I get worse response time when the optimizer uses two indexes, than when it uses one. Some context:
6
1244
by: Peter | last post by:
Hi, I have two simple classes called 'User' and 'Users', the entire code for both classes is shown below. ****======== User.cs ========**** public class User { private string _displayName, _email;
2
1123
by: Mike P | last post by:
On my webpage I want to have an image which dissolves into another image, for example like the 'HotSip' image on www.companywire.net. Does anybody know how to do this? Is it done using Flash or can it be done another way? Any assistance would be really appreciated.
24
1412
by: markscala | last post by:
Problem: You have a list of unknown length, such as this: list = . You want to extract all and only the X's. You know the X's are all up front and you know that the item after the last X is an O, or that the list ends with an X. There are never O's between X's. I have been using something like this: _____________________
4
1583
by: James | last post by:
Basically I have a DataGrid that I'm binding to the results of a stored procedure call. The recordset is fairly small. Initially I'm creating a DataSet from the results and binding it. There's a DropDownList on my page that filters the records that are displayed in the grid. How I'm currently handling this is when I initially bind, I create a DataView from the table in the dataset. When the DropDownList changes selection, I get the...
19
1871
by: Alexandre Badez | last post by:
I'm just wondering, if I could write a in a "better" way this code lMandatory = lOptional = for arg in cls.dArguments: if arg is True: lMandatory.append(arg) else: lOptional.append(arg) return (lMandatory, lOptional)
6
1343
by: Christopher Vogt | last post by:
Hej everybody, I built something for myself that might help some of you as well. Looking at a couple of PHP template engines made me think. I have two main requirements for a presentation layer framework: - use PHP as the template language - effective XSS prevention without betting on discipline Plain PHP only satisfies the first. I could not find a PHP template
25
1876
by: tmallen | last post by:
I'm parsing some text files, and I want to strip blank lines in the process. Is there a simpler way to do this than what I have here? lines = filter(lambda line: len(line.strip()) 0, lines) Thomas
0
9589
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10219
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10049
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9865
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5448
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3967
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.