Question concerning this list

Thomas Ploch

Hello fellow pythonists,

I have a question concerning posting code on this list.

I want to post source code of a module, which is a homework for
university (yes yes, I know, please read on...).

It is a web crawler (which I will *never* let out into the wide world)
which uses regular expressions (and yes, I know, thats not good, too). I
have finished it (as far as I can), but since I need a good mark to
actually finish the course, I am wondering if I can post the code, and I
am wondering if anyone of you can review it and give me possible hints
on how to improve things.

So is this O.K.? Or is this a blatantly idiotic idea?

I hope I am not the idiot of the month right now...

Thanks in advance,
Thomas

P.S.:

I might give some of my Christmas chocolate away as a donation to this
list... :-)

Dec 31 '06 #1

Subscribe Reply

1294

Steven D'Aprano

On Sun, 31 Dec 2006 02:03:34 +0100, Thomas Ploch wrote:

Hello fellow pythonists,

I have a question concerning posting code on this list.

I want to post source code of a module, which is a homework for
university (yes yes, I know, please read on...).

So long as you understand your university's policy on collaborations.

It is a web crawler (which I will *never* let out into the wide world)

If you post it on Usenet, you will have let it out into the wide world.
People will see it. Some of those people will download it. Some of them
will run it. And some of them will run it, uncontrolled, on the WWW.

Out of curiosity, if your web crawler isn't going to be used on the web,
what were you intending to use it on?

which uses regular expressions (and yes, I know, thats not good, too).

Regexes are just a tool. Sometimes they are the right tool for the job.
Sometimes they aren't.

I have finished it (as far as I can), but since I need a good mark to
actually finish the course, I am wondering if I can post the code, and I
am wondering if anyone of you can review it and give me possible hints
on how to improve things.

That would be collaborating. What's your university's policy on
collaborating? Are you allowed to do so, if you give credit? Is it
forbidden?

It probably isn't a good idea to post a great big chunk of code and expect
people to read it all. If you have more specific questions than "how can
I make this better?", that would be good. Unless the code is fairly
short, it might be better to just post a few extracted functions and see
what people say about them, and then you can extend that to the rest of
your code.

--
Steven.

Dec 31 '06 #2

Thomas Ploch

Steven D'Aprano wrote:

On Sun, 31 Dec 2006 02:03:34 +0100, Thomas Ploch wrote:

>Hello fellow pythonists,

I have a question concerning posting code on this list.

I want to post source code of a module, which is a homework for
university (yes yes, I know, please read on...).

So long as you understand your university's policy on collaborations.

Well, collaborations are wanted by my prof, but I think he actually
meant it in a way of getting students bonding with each other and
establishing social contacts. He just said that he will reject copy &
paste stuff and works that actually have nothing to do with the topic
(when we were laughing, he said we couldn't imagine what sometimes is
handed in).

>It is a web crawler (which I will *never* let out into the wide world)

If you post it on Usenet, you will have let it out into the wide world.
People will see it. Some of those people will download it. Some of them
will run it. And some of them will run it, uncontrolled, on the WWW.

Out of curiosity, if your web crawler isn't going to be used on the web,
what were you intending to use it on?

It's a final homework, as I mentioned above, and it shouldn't be used
anywhere but our university server to test it (unless timing of requests
(i.e. only two fetches per second), handling of 'robots.txt' is
implemented). But you are right with the Usenet thing, havn't thought
about this actually, so I won't post the whole portion of the code.

>which uses regular expressions (and yes, I know, thats not good, too).

Regexes are just a tool. Sometimes they are the right tool for the job.
Sometimes they aren't.

Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.

>I have finished it (as far as I can), but since I need a good mark to
actually finish the course, I am wondering if I can post the code, and I
am wondering if anyone of you can review it and give me possible hints
on how to improve things.

It probably isn't a good idea to post a great big chunk of code and expect
people to read it all. If you have more specific questions than "how can
I make this better?", that would be good. Unless the code is fairly
short, it might be better to just post a few extracted functions and see
what people say about them, and then you can extend that to the rest of
your code.

You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
- Getting Handlers for different MIME/ContentTypes and specify callbacks
only for specific Content-Types / MIME-Types (a lot of work and complex
checks)
- Handle different encodings right.

I will follow your suggestions and post my code concerning specifically
these problems, and not the whole chunk.

Thanks,
Thomas

Dec 31 '06 #3

Marc 'BlackJack' Rintsch

In <ma************ *************** ************@py thon.org>, Thomas Ploch
wrote:

Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.

Problem is that much of the HTML in the wild is written in a structured
markup language but it's in many cases broken. If you just search some
words or patterns that appear somewhere in the documents then regular
expressions are good enough. If you want to actually *parse* HTML "from
the wild" better use the BeautifulSoup_ parser.

... _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)

If you need a queue then use one: take a look at `collections.de que` or
the `Queue` module in the standard library.

Ciao,
Marc 'BlackJack' Rintsch

Dec 31 '06 #4

Thomas Ploch

Marc 'BlackJack' Rintsch schrieb:

In <ma************ *************** ************@py thon.org>, Thomas Ploch
wrote:

>Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.

Problem is that much of the HTML in the wild is written in a structured
markup language but it's in many cases broken. If you just search some
words or patterns that appear somewhere in the documents then regular
expressions are good enough. If you want to actually *parse* HTML "from
the wild" better use the BeautifulSoup_ parser.

.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

Yes, I know about BeautifulSoup. But as I said it should be done with
regexes. I want to extract tags, and their attributes as a dictionary of
name/value pairs. I know that most of HTML out there is *not* validated
and bollocks.

This is how my regexes look like:

import re

class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePatter n = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(? P<name>%s)(?P<a ttr>[^>]*)>"
% self.namePatter n)
self.attrPatter n = re.compile(
r"\s+(?P<attrNa me>%s)\s*=\s*(? P<value>\"[^\"]*\"|'[^']*')"
% self.namePatter n)

>You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)

If you need a queue then use one: take a look at `collections.de que` or
the `Queue` module in the standard library.

Which of the two would you recommend for handling large queues with fast
response times?

Thomas

Dec 31 '06 #5

Marc 'BlackJack' Rintsch

In <ma************ *************** ************@py thon.org>, Thomas Ploch
wrote:

This is how my regexes look like:

import re

class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePatter n = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(? P<name>%s)(?P<a ttr>[^>]*)>"
% self.namePatter n)
self.attrPatter n = re.compile(
r"\s+(?P<attrNa me>%s)\s*=\s*(? P<value>\"[^\"]*\"|'[^']*')"
% self.namePatter n)

Have you tested this with tags inside comments?

>>You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)

If you need a queue then use one: take a look at `collections.de que` or
the `Queue` module in the standard library.

Which of the two would you recommend for handling large queues with fast
response times?

`Queue.Queue` builds on `collections.de que` and is thread safe. Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing. So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.

Ciao,
Marc 'BlackJack' Rintsch

Dec 31 '06 #6

Thomas Ploch

Marc 'BlackJack' Rintsch schrieb:

In <ma************ *************** ************@py thon.org>, Thomas Ploch
wrote:

>This is how my regexes look like:

import re

class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePatter n = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(? P<name>%s)(?P<a ttr>[^>]*)>"
% self.namePatter n)
self.attrPatter n = re.compile(
r"\s+(?P<attrNa me>%s)\s*=\s*(? P<value>\"[^\"]*\"|'[^']*')"
% self.namePatter n)

Have you tested this with tags inside comments?

No, but I already see your point that it will parse _all_ tags, even if
they are commented out. I am thinking about how to solve this. Probably
I just take the chunks between comments and feed it to the regular
expressions.

>>>You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
If you need a queue then use one: take a look at `collections.de que` or
the `Queue` module in the standard library.
Which of the two would you recommend for handling large queues with fast
response times?

`Queue.Queue` builds on `collections.de que` and is thread safe. Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing. So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.

I think I will go for collections.deq ue (since I have no intention of
making it multi-threaded) and have several queues, one for each server
in a list to actually finish one server before being directed to the
next one straight away (Is this a good approach?).

Thanks a lot,
Thomas

Dec 31 '06 #7

John Nagle

Thomas Ploch wrote:

Marc 'BlackJack' Rintsch schrieb:

>>In <ma************ *************** ************@py thon.org>, Thomas Ploch
wrote:

>>>Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.'

Very true. HTML is LALR(0), that is, you can parse it without
looking ahead. Parsers for LALR(0) languages are easy, and
work by repeatedly getting the next character and using that to
drive a single state machine. The first character-level parser
yields tokens, which are then processed by a grammar-level parser.
Any compiler book will cover this.

Using regular expressions for LALR(0) parsing is a vice inherited
from Perl, in which regular expressions are easy and "get next
character from string" is unreasonably expensive. In Python, at least
you can index through a string.

John Nagle

Dec 31 '06 #8

Thomas Ploch

John Nagle schrieb:

>
Very true. HTML is LALR(0), that is, you can parse it without
looking ahead. Parsers for LALR(0) languages are easy, and
work by repeatedly getting the next character and using that to
drive a single state machine. The first character-level parser
yields tokens, which are then processed by a grammar-level parser.
Any compiler book will cover this.

Using regular expressions for LALR(0) parsing is a vice inherited
>>from Perl, in which regular expressions are easy and "get next
character from string" is unreasonably expensive. In Python, at least
you can index through a string.

John Nagle

I take it with LALR(0) you mean that HTML is a language created by a
Chomsky-0 (regular language) Grammar?

Thomas

Dec 31 '06 #9

Diez B. Roggisch

Thomas Ploch schrieb:

John Nagle schrieb:
> Very true. HTML is LALR(0), that is, you can parse it without
looking ahead. Parsers for LALR(0) languages are easy, and
work by repeatedly getting the next character and using that to
drive a single state machine. The first character-level parser
yields tokens, which are then processed by a grammar-level parser.
Any compiler book will cover this.

Using regular expressions for LALR(0) parsing is a vice inherited
>from Perl, in which regular expressions are easy and "get next
character from string" is unreasonably expensive. In Python, at least
you can index through a string.

John Nagle

I take it with LALR(0) you mean that HTML is a language created by a
Chomsky-0 (regular language) Grammar?

Nope.

LALR is a context free grammar parsing technique.

Regular expressions can't express languages like

a^n b^n

but something like

<div><div></div></div>

is <div>^2</div>^2

Diez

Jan 1 '07 #10

Similar topics

1312

Concerning classes (Newb question)

by: Cyrille Lavigne | last post by:

Hi! I'm very new to the art of programming and I just learn OOP in python. I want to know why the following bit of code crash. Code: class Exemple: def __init__(self): self.list= self.var1=3 c=Exemple

Python

1380

Subject/Observer & Model-Controller-View Question

by: AMDIRT | last post by:

I have a few questions about IssueVision (from WindowsForms) concerning its scalability and performance. Rather, if I were to implement techniques described here into another application, how would it perform, how well will it scale, and what considerations should I be take into account? First, let me say that I appreciate the work done to get this application out there for me to look over. I continually refer to it and I think I am...

.NET Framework

1949

general question concerning dynamically created buttons

by: djc | last post by:

On the page_load event I am querying a database and binding data to some text boxes, list boxes, and a repeater control. When the page loads it uses the value of one of the database fields (status) to determine what options should be available for this particular item (which is an issue... small issue tracking system). Each of these options is an action that may be performed on the issue and I am dynamically creating LinkButtons for each...

ASP.NET

3577

Database design question

by: MP | last post by:

Greets, context: vb6/ado/.mdb/jet 4.0 (no access)/sql beginning learner, first database, planning stages (I think the underlying question here is whether to normalize or not to normalize this one data field - but i'm not sure) :-) Background info:

Microsoft Access / VBA

1763

few questions concerning classes

by: alternativa | last post by:

Hello, I have a few questions concerning classes. 1) Why some people use default constructos, i.e constructors with no parameters? To me it doesn't make any sense, is there something I should know? For example, I'd declare a class in a following way: class Sample { int number; string title;

C / C++

4368

time.localtime() Format Question

by: OleMacGeezer | last post by:

Hello Everyone, I am a brand new Python programmer with barely a month of experience under my belt. Here are my specs: Mac OSX Panther 10.3.9 Jython 2.1 implementation with Hermes BBS python module installed

Python

946

Simple Question Concerning Combo Boxes

by: James | last post by:

I just want to know how to limit a combo box so that a user can only select from the list of items. Specifically, I have a list of the 50 state abbrevs, but if you type in TX as opposed to selecting TX, it isn't registering it from the list below. Is there any way to make it so they can only select a choice? Or the choice they select recognizes the abbrev that is already provided? I hope this makes since. Thanks.

Visual Basic .NET

1836

CSS Question regarding postioning

by: Stang02GT | last post by:

Hello, I have posted a couple different questions in this thread concerning a menu i had been working on. I have run into another snag in the final stages of the development. I have my current CSS file for my menu(which is posted below), I need to center the menu on the page. My menu starts off at the far left of the screen and spans the whole length of the screen, and i would like to fit under the page heading, since it is quite larger. ...

HTML / CSS

2305

Question concerning array.array and C++

by: Fabio | last post by:

Hi All, I have a question concerning the use of array.array inside of C++ code I wrote. I am working with _big_ data files but some entries in these files are usually bounded say between -5 to 40. Returning a Python list makes no sense. In Python I always work with the array.array module which does the trick. But now that I wrote my own C++ module for some preprocessing I need the return array.array objects.

Python

9480

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10319

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8971

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7496

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5380

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5511

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4046

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3645

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2877

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General