473,396 Members | 2,129 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Taking data from a text file to parse html page

DH
Hi,

I'm trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I'm new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!

Aug 24 '06 #1
13 4153
DH,
Could you be more specific describing what you have and what you want? You are addressing people, many of whom are good at
stripping useless junk once you tell them what 'useless junk' is.
Also it helps to post some of you data that you need to process and a sample of the same data as it should look once it is
processed.

Frederic

----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 2:11 AM
Subject: Taking data from a text file to parse html page

Hi,

I'm trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I'm new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!

--
http://mail.python.org/mailman/listinfo/python-list
Aug 24 '06 #2
DH wrote:
Hi,

I'm trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I'm new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!
See Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/
it will parse even badly formed HTML and allow you to extract/change
information as you wish.

-Larry Bates
Aug 24 '06 #3
DH
Frederic,
Good points...

I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.

So the program would import the keywords, remove them from the html
file and save the html file as something.txt.

I would post the data but it's secret. I can post an example:

index.html (html page)

"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"
replace.txt (keywords)
"
<div id="quote" class="homepage-box">

<div><p><em>&quot;

&quot;</em></p>

<p>-- Peter Norvig, <a class="reference"

"

something.txt(file after editing)

"

Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"
Larry,

I've looked into using BeatifulSoup but came to the conculsion that my
idea would work better in the end.
Thanks for the help.
Anthra Norell wrote:
DH,
Could you be more specific describing what you have and what you want? You are addressing people, many of whom are good at
stripping useless junk once you tell them what 'useless junk' is.
Also it helps to post some of you data that you need to process and a sample of the same data as it should look once it is
processed.

Frederic

----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 2:11 AM
Subject: Taking data from a text file to parse html page

Hi,

I'm trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I'm new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!

--
http://mail.python.org/mailman/listinfo/python-list
Aug 24 '06 #4
DH wrote:
I'm trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space)
[...]
I've looked into using BeatifulSoup but came to the conculsion that my
idea would work better in the end.
You could use BeautifulSoup anyway for the junk-removal part and then do
your magic. Even if it is not exactly what you want, it is a good idea to
try to reuse modules that are good at what they do.

--
Roberto Bonvallet
Aug 24 '06 #5
DH wrote:
I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.

So the program would import the keywords, remove them from the html
file and save the html file as something.txt.

I would post the data but it's secret. I can post an example:

index.html (html page)

"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"

replace.txt (keywords)
"
<div id="quote" class="homepage-box">

<div><p><em>&quot;

&quot;</em></p>

<p>-- Peter Norvig, <a class="reference"

"

something.txt(file after editing)

"

Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"
reading and writing files is described in the tutorial; see

http://pytut.infogami.com/node9.html

(scroll down to "Reading and Writing Files")

to do the replacement, you can use repeated calls to the "replace" method

http://pyref.infogami.com/str.replace

but that may cause problems if the replacement text contains things that
should be replaced. for an efficient way to do a "parallel" replace, see:

http://effbot.org/zone/python-replace.htm#multiple
</F>

Aug 24 '06 #6
DH
I found this
http://groups.google.com/group/comp....0ac6b1ac8cff51

Credit Jeremy Moles
-----------------------------------------------

finds = ("{", "}", "(", ")")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------

I want something like
-----------------------------------------------

finds = file("replace.txt")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------

Fredrik Lundh wrote:
DH wrote:
I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.

So the program would import the keywords, remove them from the html
file and save the html file as something.txt.

I would post the data but it's secret. I can post an example:

index.html (html page)

"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"

replace.txt (keywords)
"
<div id="quote" class="homepage-box">

<div><p><em>&quot;

&quot;</em></p>

<p>-- Peter Norvig, <a class="reference"

"

something.txt(file after editing)

"

Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"

reading and writing files is described in the tutorial; see

http://pytut.infogami.com/node9.html

(scroll down to "Reading and Writing Files")

to do the replacement, you can use repeated calls to the "replace" method

http://pyref.infogami.com/str.replace

but that may cause problems if the replacement text contains things that
should be replaced. for an efficient way to do a "parallel" replace, see:

http://effbot.org/zone/python-replace.htm#multiple
</F>
Aug 24 '06 #7
You may also want to look at this stream editor:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

It allows multiple replacements in a definition format of utmost simplicity:
>>your_example = '''
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
'''
>>import SE
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes comments entirely even if they nest tags
''')
>>print Tag_Stripper (your_example)
&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig, <a class="reference"

Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):

Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # *** This deletes the fragment
# "-- Peter Norvig, <a class\="reference"=" # Or like this if Peter Norvig has to go too
''')
>>print Tag_Stripper (your_example)
&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig,

&quot; you can either translate or delete:

Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # This deletes the fragment
# "-- Peter Norvig, <a class=\\"reference\\"=" # Or like this if Peter Norvig has to go too
htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes.
# Naming the file is all you need to do to include the replacements which it defines.
''')
>>print Tag_Stripper (your_example)
'Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
'
-- Peter Norvig,

If instead of "htm2iso.se" you write "&quot;=" you delete it and your output will be:

Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.

-- Peter Norvig,

Your Tag_Stripper also does files:
>>print Tag_Stripper ('my_file.htm', 'my_file_without_tags')
'my_file_without_tags'
A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
parser does a lot of work which you don't need.

Regards

Frederic
----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 7:41 PM
Subject: Re: Taking data from a text file to parse html page

I found this
http://groups.google.com/group/comp....ce+text+file&r
num=8#ad0ac6b1ac8cff51
>
Credit Jeremy Moles
-----------------------------------------------

finds = ("{", "}", "(", ")")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------

I want something like
-----------------------------------------------

finds = file("replace.txt")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------

Fredrik Lundh wrote:
DH wrote:
I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.
>
So the program would import the keywords, remove them from the html
file and save the html file as something.txt.
>
I would post the data but it's secret. I can post an example:
>
index.html (html page)
>
"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"
>
replace.txt (keywords)
"
<div id="quote" class="homepage-box">
>
<div><p><em>&quot;
>
&quot;</em></p>
>
<p>-- Peter Norvig, <a class="reference"
>
"
>
something.txt(file after editing)
>
"
>
Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"
reading and writing files is described in the tutorial; see

http://pytut.infogami.com/node9.html

(scroll down to "Reading and Writing Files")

to do the replacement, you can use repeated calls to the "replace" method

http://pyref.infogami.com/str.replace

but that may cause problems if the replacement text contains things that
should be replaced. for an efficient way to do a "parallel" replace, see:

http://effbot.org/zone/python-replace.htm#multiple
</F>

--
http://mail.python.org/mailman/listinfo/python-list
Aug 24 '06 #8
DH
SE looks very helpful... I'm having a hell of a time installing it
though:

-----------------------------------------------------------------------------------------

foo@foo:~/Desktop/SE-2.2$ sudo python SETUP.PY install
running install
running build
running build_py
file SEL.py (for module SEL) not found
file SE.py (for module SE) not found
file SEL.py (for module SEL) not found
file SE.py (for module SE) not found

------------------------------------------------------------------------------------------
Anthra Norell wrote:
You may also want to look at this stream editor:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

It allows multiple replacements in a definition format of utmost simplicity:
>your_example = '''
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
'''
>import SE
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes comments entirely even if they nest tags
''')
>print Tag_Stripper (your_example)

&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig, <a class="reference"

Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):

Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # *** This deletes the fragment
# "-- Peter Norvig, <a class\="reference"=" # Or like this if Peter Norvig has to go too
''')
>print Tag_Stripper (your_example)

&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig,

&quot; you can either translate or delete:

Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # This deletes the fragment
# "-- Peter Norvig, <a class=\\"reference\\"=" # Or like this if Peter Norvig has to go too
htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes.
# Naming the file is all you need to do to include the replacements which it defines.
''')
>print Tag_Stripper (your_example)

'Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
'
-- Peter Norvig,

If instead of "htm2iso.se" you write "&quot;=" you delete it and your output will be:

Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.

-- Peter Norvig,

Your Tag_Stripper also does files:
>print Tag_Stripper ('my_file.htm', 'my_file_without_tags')
'my_file_without_tags'
A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
parser does a lot of work which you don't need.

Regards

Frederic
----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 7:41 PM
Subject: Re: Taking data from a text file to parse html page

I found this
http://groups.google.com/group/comp....ce+text+file&r
num=8#ad0ac6b1ac8cff51

Credit Jeremy Moles
-----------------------------------------------

finds = ("{", "}", "(", ")")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------

I want something like
-----------------------------------------------

finds = file("replace.txt")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------

Fredrik Lundh wrote:
DH wrote:
>
I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.

So the program would import the keywords, remove them from the html
file and save the html file as something.txt.

I would post the data but it's secret. I can post an example:

index.html (html page)

"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"

replace.txt (keywords)
"
<div id="quote" class="homepage-box">

<div><p><em>&quot;

&quot;</em></p>

<p>-- Peter Norvig, <a class="reference"

"

something.txt(file after editing)

"

Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"
>
reading and writing files is described in the tutorial; see
>
http://pytut.infogami.com/node9.html
>
(scroll down to "Reading and Writing Files")
>
to do the replacement, you can use repeated calls to the "replace" method
>
http://pyref.infogami.com/str.replace
>
but that may cause problems if the replacement text contains things that
should be replaced. for an efficient way to do a "parallel" replace, see:
>
http://effbot.org/zone/python-replace.htm#multiple
>
>
</F>
--
http://mail.python.org/mailman/listinfo/python-list
Aug 25 '06 #9
Surely you write your own programs. (program_name.py). You import and run them. You may put SE.PY and SEL.PY into the same
directory. That's all.
Or if you prefer to keep other people's stuff in a different directory, just make sure that directory is in "sys.path",
because that is where import looks. Check for that directory's presence in the sys.path list:
>>sys.path
['C:\\Python24\\Lib\\idlelib', 'C:\\', 'C:\\PYTHON24\\DLLs', 'C:\\PYTHON24\\lib', 'C:\\PYTHON24\\lib\\plat-win',
'C:\\PYTHON24\\lib\\lib-tk' (... etc) ]

Supposing it isn't there, add it:
>>sys.path.append ('/python/code/other_peoples_stuff')
import SE
That should do it. Let me know if it works. Else just keep asking.

Frederic
----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 4:40 AM
Subject: Re: Taking data from a text file to parse html page

SE looks very helpful... I'm having a hell of a time installing it
though:

-----------------------------------------------------------------------------------------

foo@foo:~/Desktop/SE-2.2$ sudo python SETUP.PY install
running install
running build
running build_py
file SEL.py (for module SEL) not found
file SE.py (for module SE) not found
file SEL.py (for module SEL) not found
file SE.py (for module SE) not found

------------------------------------------------------------------------------------------
Anthra Norell wrote:
You may also want to look at this stream editor:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

It allows multiple replacements in a definition format of utmost simplicity:
>>your_example = '''
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
'''
>>import SE
>>Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes comments entirely even if they nest tags
''')
>>print Tag_Stripper (your_example)
&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig, <a class="reference"

Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):

Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # *** This deletes the fragment
# "-- Peter Norvig, <a class\="reference"=" # Or like this if Peter Norvig has to go too
''')
>>print Tag_Stripper (your_example)
&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig,

&quot; you can either translate or delete:

Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # This deletes the fragment
# "-- Peter Norvig, <a class=\\"reference\\"=" # Or like this if Peter Norvig has to go too
htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes.
# Naming the file is all you need to do to include the replacements which it defines.
''')
>>print Tag_Stripper (your_example)
'Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
'
-- Peter Norvig,

If instead of "htm2iso.se" you write "&quot;=" you delete it and your output will be:

Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.

-- Peter Norvig,

Your Tag_Stripper also does files:
>>print Tag_Stripper ('my_file.htm', 'my_file_without_tags')
'my_file_without_tags'
A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
parser does a lot of work which you don't need.

Regards

Frederic
----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 7:41 PM
Subject: Re: Taking data from a text file to parse html page

I found this
>
http://groups.google.com/group/comp....ce+text+file&r
num=8#ad0ac6b1ac8cff51
>
Credit Jeremy Moles
-----------------------------------------------
>
finds = ("{", "}", "(", ")")
lines = file("foo.txt", "r").readlines()
>
for line in lines:
for find in finds:
if find in line:
line.replace(find, "")
>
print lines
>
-----------------------------------------------
>
I want something like
-----------------------------------------------
>
finds = file("replace.txt")
lines = file("foo.txt", "r").readlines()
>
for line in lines:
for find in finds:
if find in line:
line.replace(find, "")
>
print lines
>
-----------------------------------------------
>
>
>
Fredrik Lundh wrote:
DH wrote:

I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.
>
So the program would import the keywords, remove them from the html
file and save the html file as something.txt.
>
I would post the data but it's secret. I can post an example:
>
index.html (html page)
>
"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"
>
replace.txt (keywords)
"
<div id="quote" class="homepage-box">
>
<div><p><em>&quot;
>
&quot;</em></p>
>
<p>-- Peter Norvig, <a class="reference"
>
"
>
something.txt(file after editing)
>
"
>
Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"

reading and writing files is described in the tutorial; see

http://pytut.infogami.com/node9.html

(scroll down to "Reading and Writing Files")

to do the replacement, you can use repeated calls to the "replace" method

http://pyref.infogami.com/str.replace

but that may cause problems if the replacement text contains things that
should be replaced. for an efficient way to do a "parallel" replace, see:

http://effbot.org/zone/python-replace.htm#multiple


</F>
>
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list
Aug 25 '06 #10
DH
Yes I know how to import modules... I think I found the problem, Linux
handles upper and lower case differently, so for some reason you can't
import SE but if you rename it to se it gives you the error that it
can't find SEL which if you rename it will complain that that SEL isn't
defined... Are you running Linux? Have you tested it with Linux?
Surely you write your own programs. (program_name.py). You import and run them. You may put SE.PY and SEL.PY into the same
directory. That's all.
Or if you prefer to keep other people's stuff in a different directory, just make sure that directory is in "sys.path",
because that is where import looks. Check for that directory's presence in the sys.path list:
>sys.path
['C:\\Python24\\Lib\\idlelib', 'C:\\', 'C:\\PYTHON24\\DLLs', 'C:\\PYTHON24\\lib', 'C:\\PYTHON24\\lib\\plat-win',
'C:\\PYTHON24\\lib\\lib-tk' (... etc) ]

Supposing it isn't there, add it:
>sys.path.append ('/python/code/other_peoples_stuff')
import SE

That should do it. Let me know if it works. Else just keep asking.

Frederic
----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 4:40 AM
Subject: Re: Taking data from a text file to parse html page

SE looks very helpful... I'm having a hell of a time installing it
though:

-----------------------------------------------------------------------------------------

foo@foo:~/Desktop/SE-2.2$ sudo python SETUP.PY install
running install
running build
running build_py
file SEL.py (for module SEL) not found
file SE.py (for module SE) not found
file SEL.py (for module SEL) not found
file SE.py (for module SE) not found

------------------------------------------------------------------------------------------
Anthra Norell wrote:
You may also want to look at this stream editor:
>
http://cheeseshop.python.org/pypi/SE/2.2%20beta
>
It allows multiple replacements in a definition format of utmost simplicity:
>
>your_example = '''
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
'''
>import SE
>Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes comments entirely even if they nest tags
''')
>print Tag_Stripper (your_example)
>
&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig, <a class="reference"
>
Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):
>
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # *** This deletes the fragment
# "-- Peter Norvig, <a class\="reference"=" # Or like this if Peter Norvig has to go too
''')
>print Tag_Stripper (your_example)
>
&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;
-- Peter Norvig,
>
&quot; you can either translate or delete:
>
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # This deletes the fragment
# "-- Peter Norvig, <a class=\\"reference\\"=" # Or like this if Peter Norvig has to go too
htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes.
# Naming the file is all you need to do to include the replacements which it defines.
''')
>
>print Tag_Stripper (your_example)
>
'Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
'
-- Peter Norvig,
>
If instead of "htm2iso.se" you write "&quot;=" you delete it and your output will be:
>
Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
>
-- Peter Norvig,
>
Your Tag_Stripper also does files:
>
>print Tag_Stripper ('my_file.htm', 'my_file_without_tags')
'my_file_without_tags'
>
>
A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
parser does a lot of work which you don't need.
>
Regards
>
Frederic
>
>
----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 7:41 PM
Subject: Re: Taking data from a text file to parse html page
>
>
I found this

>
http://groups.google.com/group/comp....ce+text+file&r
num=8#ad0ac6b1ac8cff51

Credit Jeremy Moles
-----------------------------------------------

finds = ("{", "}", "(", ")")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------

I want something like
-----------------------------------------------

finds = file("replace.txt")
lines = file("foo.txt", "r").readlines()

for line in lines:
for find in finds:
if find in line:
line.replace(find, "")

print lines

-----------------------------------------------



Fredrik Lundh wrote:
DH wrote:
>
I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.

So the program would import the keywords, remove them from the html
file and save the html file as something.txt.

I would post the data but it's secret. I can post an example:

index.html (html page)

"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"

replace.txt (keywords)
"
<div id="quote" class="homepage-box">

<div><p><em>&quot;

&quot;</em></p>

<p>-- Peter Norvig, <a class="reference"

"

something.txt(file after editing)

"

Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"
>
reading and writing files is described in the tutorial; see
>
http://pytut.infogami.com/node9.html
>
(scroll down to "Reading and Writing Files")
>
to do the replacement, you can use repeated calls to the "replace" method
>
http://pyref.infogami.com/str.replace
>
but that may cause problems if the replacement text contains things that
should be replaced. for an efficient way to do a "parallel" replace, see:
>
http://effbot.org/zone/python-replace.htm#multiple
>
>
</F>

--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list
Aug 26 '06 #11
No, I am not running Linux to any extent. But I am very strict about case. There is not a single instance of "se.py" or "sel.py"
anywhere on my system. You' ll have to find out where lower case sneaks in on yours. The zip file preserves case and in the zip file
the names are upper case. I am baffled. But I believe that an import tripping up on the wrong case can't be a hard nut to crack.

Frederic

----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Saturday, August 26, 2006 5:47 AM
Subject: Re: Taking data from a text file to parse html page

Yes I know how to import modules... I think I found the problem, Linux
handles upper and lower case differently, so for some reason you can't
import SE but if you rename it to se it gives you the error that it
can't find SEL which if you rename it will complain that that SEL isn't
defined... Are you running Linux? Have you tested it with Linux?

Aug 26 '06 #12
Anthra Norell wrote:
No, I am not running Linux to any extent. But I am very strict about case. There is not a single instance of "se.py" or "sel.py"
anywhere on my system. You' ll have to find out where lower case sneaks in on yours. The zip file preserves case and in the zip file
the names are upper case. I am baffled. But I believe that an import tripping up on the wrong case can't be a hard nut to crack.
The problem is the extension:

SE.py is acceptable, while SE.PY is not.

Georg
Aug 26 '06 #13
Yes! It just occurred to my that this could be the problem. I have to change that. Thanks for the hint.

Frederic
----- Original Message -----
From: "Georg Brandl" <g.*************@gmx.net>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Saturday, August 26, 2006 1:59 PM
Subject: Re: Taking data from a text file to parse html page

Anthra Norell wrote:
No, I am not running Linux to any extent. But I am very strict about case. There is not a single instance of "se.py" or "sel.py"
anywhere on my system. You' ll have to find out where lower case sneaks in on yours. The zip file preserves case and in the zip
file
the names are upper case. I am baffled. But I believe that an import tripping up on the wrong case can't be a hard nut to crack.

The problem is the extension:

SE.py is acceptable, while SE.PY is not.

Georg
--
http://mail.python.org/mailman/listinfo/python-list
Aug 26 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Tesla | last post by:
Hey guys, i need help on writing PHP code to go to an HTML page (http://example:8343/index.html), and pass the text on that page onto a variable. I dont know how to do this. What i plan to do...
1
by: dawnunder | last post by:
eg. Someone fills out 3 fields. (There will be more but this is just to give you an idea) 1. Country? 2. State? 3. City I want this script to generate a web page and list the people by
3
by: cv | last post by:
Hi all, I have to copy two set of data from 2 files(notepad/excel) say, products and their corresponding prices to list/textarea/table. I should be able to retrieve the product and corresponding...
6
by: nate | last post by:
Hello, Does anyone know where I can find an ASP server side script written in JavaScript to parse text fields from a form method='POST' using enctype='multipart/form-data'? I'd also like it to...
3
by: Colin Young | last post by:
I'm having a bit of a problem with my DataList when I try to update from the user's input. I've included relevant excerpts at the end of this message. In the UpdateCommand code, the...
16
by: pmud | last post by:
Hi, I am using teh following code for sorting the data grid but it doesnt work. I have set the auto generate columns to false. & set the sort expression for each field as the anme of that...
7
by: hawat.thufir | last post by:
Given an xhtml file, how can I "export" the data to plain-text? That is, I want: google www.google.com Whereas, if I copy and paste what the browser shows, I lose the URL and end up with:...
8
by: Jeff | last post by:
I asked this question sometime ago. some help was given, but A.) i didn't understand the help and B.) I can't find the post. i am looking for a way to take a table of stats on an HTML page, and...
4
by: MissElegant | last post by:
Hi all, I have tried to do a test to a lesson which was in the internet, but it doesn't work? ANYBody here to help please?? The problem that what I enter in the textbox should be sent to the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.