473,387 Members | 1,548 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Removing comments... tokenize error

In analysing a very big application (pysol) made of almost
100 sources, I had the need to remove comments.

Removing the comments which take all the line is straightforward...

Instead for the embedded comments I used the tokenize module.

To my surprise the analysed output is different from the input
(the last tuple element should exactly replicate the input line)
The error comes out in correspondance of a triple string.
I don't know if this has already been corrected (I use Python 2.3)
or perhaps is a mistake on my part...

Next you find the script I use to replicate the strange behaviour:

import tokenize

Input = "pippo1"
Output = "pippo2"

f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
.. if nLastLine != (i[2])[0]: # the 3rd element of the tuple is
.. . nLastLine = (i[2])[0] # (startingRow, startingCol)
.. . fOut.write(i[4])

f.close()
fOut.close()

The file to be used (pippo1) contains an extract:

class SelectDialogTreeData:
.. img = None
.. def __init__(self):
.. . self.tree_xview = (0.0, 1.0)
.. . self.tree_yview = (0.0, 1.0)
.. . if self.img is None:
.. . . SelectDialogTreeData.img = (makeImage(dither=0, data="""
R0lGODlhEAAOAPIFAAAAAICAgMDAwP//AP///4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAOL
WLrcGxA6FoYYYoRZwhCDMAhDFCkBoa6sGgBFQAzCIAzCIAzCEA CFAEEwEAwEA8FAMBAEAIUAYSAY
CAaCgWAgGAQAhQBBMBAMBAPBQDAQBACFAGEgGAgGgoFgIBgEAA UBBAIDAgMCAwIDAgMCAQAFAQQD
AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""), makeImage(dither=0, data="""
R0lGODlhEAAOAPIFAAAAAICAgMDAwP//AP///4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAN3
WLrcHBA6Foi1YZZAxBCDQESREhCDMAiDcFkBUASEMAiDMAiDMA gBAGlIGgQAgZeSEAAIAoAAQTAQ
DAQDwUAwAEAAhQBBMBAMBAPBQBAABACFAGEgGAgGgoFgIAAEAA oBBAMCAwIDAgMCAwEAAApERI4L
jpWWlgkAOw=="""), makeImage(dither=0, data="""
R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP///wAAAAAAAAAAACwAAAAAEAAOAAADTii63DowyiiA
GCHrnQUQAxcQAAEQgAAIg+MCwkDMdD0LgDDUQG8LAMGg1gPYBA DBgFbs1QQAwYDWBNQEAMHABrAR
BADBwOsVAFzoqlqdAAA7"""), makeImage(dither=0, data="""
R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP8AAP///wAAAAAAACwAAAAAEAAOAAADVCi63DowyiiA
GCHrnQUQAxcUQAEUgAAIg+MCwlDMdD0LgDDQBE3UAoBgUCMUCD YBQDCwEWwFAUAwqBEKBJsAIBjQ
CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))

The output of tokenize (pippo2) gives instead:

class SelectDialogTreeData:
.. img = None
.. def __init__(self):
.. . self.tree_xview = (0.0, 1.0)
.. . self.tree_yview = (0.0, 1.0)
.. . if self.img is None:
.. . . SelectDialogTreeData.img = (makeImage(dither=0, data="""
AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""), makeImage(dither=0, data="""
jpWWlgkAOw=="""), makeImage(dither=0, data="""
BADBwOsVAFzoqlqdAAA7"""), makeImage(dither=0, data="""
CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))

.... with a big difference! Why?
Jul 18 '05 #1
16 2261
"qwweeeit" <qw******@yahoo.it> wrote:
I don't know if this has already been corrected (I use Python 2.3)
or perhaps is a mistake on my part...
it's a mistake on your part. adding a print statement to the for-
loop might help you figure it out:
nLastLine=0
for i in tokenize.generate_tokens(f.readline): print i . if nLastLine != (i[2])[0]: # the 3rd element of the tuple is
. . nLastLine = (i[2])[0] # (startingRow, startingCol)
. . fOut.write(i[4])


(hints: what happens if a token spans multiple lines? and how does
the tokenize module deal with comments?)

</F>

Jul 18 '05 #2
Thanks! If you answer to my posts one more time I could consider you as
my tutor...

It was strange to have found a bug...! In any case I will not go deeper
into the matter, because for me it's enough your explanatiom.
I corrected the problem by hand removing the tokens spanning multiple lines
(there were only 8 cases...).

Instead I haven't understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
.. if i[0]==52 and nLastLine != (i[2])[0]:
.. . fOut.write((i[4].replace(i[1],'')).rstrip()+'\n')
.. . nLastLine=(i[2])[0]
.. elif i[0]==4 and nLastLine != (i[2])[0]:
.. . fOut.write((i[4]))
.. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively
- the comment removing is obtained by clearing the comment (i[1]) in the
input line (i[4])
- I also right trimmed the line to get rid off the remaining blanks.
Jul 18 '05 #3
qwweeeit wrote:
Thanks! If you answer to my posts one more time I could consider you as my tutor...

It was strange to have found a bug...! In any case I will not go deeper into the matter, because for me it's enough your explanatiom.
I corrected the problem by hand removing the tokens spanning multiple lines (there were only 8 cases...).

Instead I haven't understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
. if i[0]==52 and nLastLine != (i[2])[0]:
. . fOut.write((i[4].replace(i[1],'')).rstrip()+'\n')
. . nLastLine=(i[2])[0]
. elif i[0]==4 and nLastLine != (i[2])[0]:
. . fOut.write((i[4]))
. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively - the comment removing is obtained by clearing the comment (i[1]) in the input line (i[4])
- I also right trimmed the line to get rid off the remaining blanks.

Tokenizer sends multiline strings and comments as a single token.

################################################## ####################
# python comment and whitespace stripper :)
################################################## ####################

import keyword, os, sys, traceback
import StringIO
import token, tokenize
__credits__ = 'just another tool that I needed'
__version__ = '.7'
__author__ = 'M.E.Farmer'
__date__ = 'Jan 15 2005, Oct 24 2004'

################################################## ####################

class Stripper:
"""python comment and whitespace stripper :)
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, spaces=1,
untabify=1, eol='unix'):
''' strip comments, strip extra whitespace,
convert EOL's from Python code.
'''
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
# Wrap text in a filelike object
self.pos = 0

text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## function for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)
def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
''' Token handler.
'''
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

#kill the comments
if not self.comments:
# Kill the comments ?
if toktype == tokenize.COMMENT:
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace, if needed
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
################################################## ####################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout, comments=1, untabify=1,
eol='win')

################################################## ####################

if __name__ == '__main__':
Main()

M.E.Farmer

Jul 18 '05 #4
My code, besides beeing cumbersome and criptic, has another quality:
it is
buggy!
I apologize for that; obviously I discovered it after posting (in the
best tradition of Murphy's law!).
When I will find the solution I let you know, also if the problem is
made difficult for the fact that the for cycle is indexed in terms of
a 5 element tuple, not very easy (at least for me!...).
Jul 18 '05 #5
Hi,
I have no more need to corret my code's bugs and send to clp group a
working application (I don't think that there was an eager
expectation...).
Your code is perfectly working (as you can expect from a guru...).
Thank you and bye.
Jul 19 '05 #6
Hi,

At last I succeded in implementing a cross reference tool!
(with your help and that of other gurus...).
Now I can face the problem (for me...) of understanding your
code (I have not grasped the classes and objects...).

I give you a brief example of the xref output (taken from your code,
also if the line numbers don't match, because I modified your code,
not beeing interested in eof's other than Linux).

and 076 if self.lasttoken<=self.spaces and
self.spaces:
append 046 self.lines.append(pos)
append 048 self.lines.append(len(self.raw))
argv 116 if sys.argv[1]:
argv 117 filein = open(sys.argv[1]).read()
__author__ 010 __author__ = s_
break 045 if not pos: break
__call__ 080 def __call__(self, toktype, toktext, (srow,scol),
.. .
(erow,ecol), line):
class 015 class Stripper:
COMMENT 092 if toktype == tokenize.COMMENT:
comments 021 def format(self, out=sys.stdout, comments=0,
spaces=1,untabify=1):
comments 033 self.comments = comments
comments 090 if not self.comments:
comments 118 Stripper(filein).format(out=sys.stdout,
comments=0, .
untabify=1)
__credits__ 008 __credits__ = s_
__date__ 011 __date__ = s_
DEDENT 105 if toktype in [token.INDENT, token.DEDENT]:
def 018 def __init__(self, raw):
def 021 def format(self, out=sys.stdout, comments=0,
.. spaces=1,untabify=1):
def 080 def __call__(self, toktype, toktext, (srow,scol),
(erow,ecol), line):
def 114 def Main():
ecol 080 def __call__(self, toktype, toktext, (srow,scol),
.. (erow,ecol), line):
erow 080 def __call__(self, toktype, toktext, (srow,scol),
.. (erow,ecol), line):
ex 059 except tokenize.TokenError, ex:
except 059 except tokenize.TokenError, ex:
expandtabs 036 self.raw = self.raw.expandtabs()
filein 117 filein = open(sys.argv[1]).read()
filein 118 Stripper(filein).format(out=sys.stdout,
comments=0,

untabify=1)
find 044 pos = self.raw.find(self.lineend, pos) + 1
format 021 def format(self, out=sys.stdout, comments=0,
spaces=1,untabify=1):
format 118 Stripper(filein).format(out=sys.stdout,
comments=0,
untabify=1)
import 005 import keyword, os, sys, traceback
import 006 import StringIO
import 007 import token, tokenize
import 115 import sys
INDENT 105 if toktype in [token.INDENT, token.DEDENT]:
__init__ 018 def __init__(self, raw):
isspace 071 if not line.isspace():
keyword 005 import keyword, os, sys, traceback
lasttoken 030 self.lasttoken = 1
lasttoken 072 self.lasttoken=0
lasttoken 075 self.lasttoken+=1
lasttoken 076 if self.lasttoken<=self.spaces and
self.spaces:
....

To obtain this output, you must remove comments and empty lines, move
strings in a db file, leaving as place holder s_ for normal strings
and m_ for triple strings.
See an example:

m_ """python comment and whitespace stripper :)""" #016
m_ ''' strip comments, strip extra whitespace, convert EOL's from
Python
code.'''#023
m_ ''' Token handler.''' #082

s_ 'just another tool that I needed' |008 __credits__ = 'just another
tool
that I needed'
s_ '.7' |009 __version__ = '.7'
s_ 'M.E.Farmer' |010 __author__ = 'M.E.Farmer'
s_ 'Jan 15 2005, Oct 24 2004' |011 __date__ = 'Jan 15 2005, Oct 24
2004'
s_ ' ' |037 self.raw = self.raw.rstrip()+'
'
s_ '\n' |040 self.lineend = '\n'
s_ '__main__' |122 if __name__ == '__main__':

I think that this tool is very useful.

Bye
Jul 19 '05 #7
Glad you are making progress ;)
I give you a brief example of the xref output (taken from your >code,
also if the line numbers don't match, because I modified >your code,
not beeing interested in eof's other than Linux).


What happens when you try to analyze a script from a diffrent os ? It
usually looks like a skewed mess, that is why I have added EOL
conversion so it is painless for you to convert to your eol of choice.
The code I posted consist of a class and a Main function.
The class has three methods.
__init__ is called by Python when you create an instance of the class
Stripper. All __init__ does here is just set a class variable self.raw
..
format is called explicitly with a few arguments to start the
tokenizer.
__call__ is special it is not easy to grasp how this even works.. at
first.
In Python when you treat an instance like a function, Python invokes
the __call__method of that instance if present and if it is callable().
example:
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()
The snippet above is from the Stripper class.
Notice that tokenize.tokenize is being feed a reference to self ( if
this code is running self is an instance of Stripper ).
tokenize.tokenize is really a hidden loop.
Each token generated is sent to self as five parts toktype, toktext,
(startrow,startcol), (endrow,endcol), and line. Self is callable and
has a __call__ method so tokenize sends really sends the five part
info to __call__ for every token.
If this was obvious then ignore it ;)

M.E.Farmer

Jul 19 '05 #8

Great tool, indeed! But doc strings stay in the source text.

If you do need to remove doc strings as well, add the following into
the __call__ method.

.... # kill doc strings
.... if not self.docstrings:
.... if toktype == tokenize.STRING and len(toktext) >= 6:
.... t = toktext.lstrip('rRuU')
.... if ((t.startswith("'''") and t.endswith("'''")) or
.... (t.startswith('"""') and t.endswith('"""'))):
.... return

as shown in the original post below. Also, set self.docstrings in the
format method, similar to self.comments as shown below in lines
starting with '...'.
/Jean Brouwers

M.E.Farmer wrote:
qwweeeit wrote:
Thanks! If you answer to my posts one more time I could consider
you as
my tutor...

It was strange to have found a bug...! In any case I will not go deeper
into the matter, because for me it's enough your explanatiom.
I corrected the problem by hand removing the tokens spanning

multiple lines
(there were only 8 cases...).

Instead I haven't understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it. is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
. if i[0]==52 and nLastLine != (i[2])[0]:
. . fOut.write((i[4].replace(i[1],'')).rstrip()+'\n')
. . nLastLine=(i[2])[0]
. elif i[0]==4 and nLastLine != (i[2])[0]:
. . fOut.write((i[4]))
. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively
- the comment removing is obtained by clearing the comment (i[1])

in the
input line (i[4])
- I also right trimmed the line to get rid off the remaining
blanks. Tokenizer sends multiline strings and comments as a single token.

################################################## #################### # python comment and whitespace stripper :)
################################################## ####################
import keyword, os, sys, traceback
import StringIO
import token, tokenize
__credits__ = 'just another tool that I needed'
__version__ = '.7'
__author__ = 'M.E.Farmer'
__date__ = 'Jan 15 2005, Oct 24 2004'

################################################## ####################
class Stripper:
"""python comment and whitespace stripper :)
"""
def __init__(self, raw):
self.raw = raw
.... def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
''' strip comments, strip extra whitespace,
convert EOL's from Python code.
'''
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments .... self.docstrings = docstrings
if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
# Wrap text in a filelike object
self.pos = 0

text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## function for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)
def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
''' Token handler.
'''
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

#kill the comments
if not self.comments:
# Kill the comments ?
if toktype == tokenize.COMMENT:
return
.... # kill doc strings
.... if not self.docstrings:
.... if toktype == tokenize.STRING and len(toktext) >= 6:
.... t = toktext.lstrip('rRuU')
.... if ((t.startswith("'''") and t.endswith("'''")) or
.... (t.startswith('"""') and t.endswith('"""'))):
.... return
# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace, if needed
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
################################################## ####################
def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout, comments=1, untabify=1, eol='win')

################################################## ####################
if __name__ == '__main__':
Main()

M.E.Farmer


Jul 19 '05 #9
Hi,

Importing a text file from another o.s. is not a problem : I convert
it immediately using the powerful shell functions of Linux (and Unix).

I thank you for the explanation about classes, but I am rather dumb
and
by now I resolved all my problems without them...
Speaking of problems..., I have yet an error in parsing for literal
strings,
when there is more than one literal string by source line.

Perhaps it's time to use classes...
Jul 19 '05 #10
Thanks Jean,
I have thought about adding docstrings several times, but I was stumped
at how to determine a docstring from a regular tripleqoted string ;)
I have been thinking hard about the problem and I think I have an idea.
If the line has nothing before the start of the string it must be a
docstring.
Sounds simple enough but in Python there are 12 or so 'types' of
strings .
Here is my crack at it feel free to improve it ;)
I reversed the logic on the comments and docstrings so I could add a
special mode to docstring stripping ...pep8 mode .
Pep8 mode only strips double triple quotes from your source code
leaving the offending single triple quotes behind. Probably just stupid
but someone might find it usefull.
################################################## ####################
# Python source stripper
################################################## ####################

import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = '''
Jürgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \
################################################## ####################

class Stripper:
"""Python source stripper
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

# kill doc strings
if self.docstrings:
# Assume if there is nothing on the
# left side it must be a docstring
if toktype == tokenize.STRING and \
line.lstrip(' rRuU')[0] in ["'",'"']:
t = toktext.lstrip('rRuU')
if (t.startswith('"""') and
(self.docstrings == 'pep8' or
self.docstrings =='8')):
return
elif t.startswith('"""') or t.startswith("'''"):
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
################################################## ####################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=0, docstrings=1, untabify=1, eol='win')
################################################## ####################

if __name__ == '__main__':
Main()

Jul 19 '05 #11
There is an issue with both my and your code: it only works if doc
strings are triple quoted and if there are no other triple quoted
strings in the Python code.

A triple quoted string used in an assignment will be removed, for
example this case

s = '''this string should not be removed'''
It is still unclear how to distinguish doc strings from other strings.
Also, I have not checked the precise Python syntax, but doc strings do
not need to be enclosed by triple quotes. A single quote may be
allowed too.

Maybe this rule will work: a doc string is any string preceded by a
COLON token followed by zero, one or more INDENT or NEWLINE tokens.
Untested!

/Jean Brouwers

M.E.Farmer wrote:
Thanks Jean,
I have thought about adding docstrings several times, but I was stumped at how to determine a docstring from a regular tripleqoted string ;)
I have been thinking hard about the problem and I think I have an idea. If the line has nothing before the start of the string it must be a
docstring.
Sounds simple enough but in Python there are 12 or so 'types' of
strings .
Here is my crack at it feel free to improve it ;)
I reversed the logic on the comments and docstrings so I could add a
special mode to docstring stripping ...pep8 mode .
Pep8 mode only strips double triple quotes from your source code
leaving the offending single triple quotes behind. Probably just stupid but someone might find it usefull.
################################################## #################### # Python source stripper
################################################## ####################
import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = '''
Jürgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \
################################################## ####################
class Stripper:
"""Python source stripper
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

# kill doc strings
if self.docstrings:
# Assume if there is nothing on the
# left side it must be a docstring
if toktype == tokenize.STRING and \
line.lstrip(' rRuU')[0] in ["'",'"']:
t = toktext.lstrip('rRuU')
if (t.startswith('"""') and
(self.docstrings == 'pep8' or
self.docstrings =='8')):
return
elif t.startswith('"""') or t.startswith("'''"):
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
################################################## ####################
def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=0, docstrings=1, untabify=1, eol='win')
################################################## ####################
if __name__ == '__main__':
Main()


Jul 19 '05 #12
MrJean1 wrote:
There is an issue with both my and your code: it only works if doc
strings are triple quoted and if there are no other triple quoted
strings in the Python code. I had not considered single quoted strings ;) A triple quoted string used in an assignment will be removed, for
example this case

s = '''this string should not be removed'''
It is still unclear how to distinguish doc strings from other strings. Also, I have not checked the precise Python syntax, but doc strings do not need to be enclosed by triple quotes. A single quote may be
allowed too.

Maybe this rule will work: a doc string is any string preceded by a
COLON token followed by zero, one or more INDENT or NEWLINE tokens.
Untested! Not needed , if you reread my post I explain that I had solved that
issue.
If you use the line argument that tokenizer supplies we can strip
whitespace and 'rRuU' from the start of the line and look for a single
quote or a double quote .
I have tested it and it works.
Reworked the 'pep8' thing and fixed the bug you mentioned here is the
changes.

################################################## ####################
# Python source stripper

################################################## ####################

import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = '''
Jürgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \

################################################## ####################

class Stripper:
"""Python source stripper
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

# kill doc strings
if self.docstrings:
# Assume if there is nothing on the
# left side it must be a docstring
if toktype == tokenize.STRING and \
line.lstrip(' rRuU')[0] in ["'",'"']:
t = toktext.lstrip('rRuU')
# pep8 frowns on triple single quotes
if ( self.docstrings == 'pep8' or
self.docstrings == 8):
# pep8 frowns on single triples
if not t.startswith('"""'):
return
else:
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
################################################## ####################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=0, docstrings='pep8', untabify=1, eol='win')
################################################## ####################

if __name__ == '__main__':
Main()
That should work like a charm for all types of docstrings without
disturbing others strings.

M.E.Farmer

Jul 19 '05 #13
Google has now 'fixed' there whitespace issue and now has an auto-quote
issue argggh!

The script is located at:
http://bellsouthpwp.net/m/e/mefjr75/python/stripper.py

M.E.Farmer

Jul 19 '05 #14
I found the bug and hope I have squashed it.
Single and qouble quoted strings that were assignments and spanned
multilines using \ , were chopped after the first line.
example:
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004'
became:
__date__ = 'Apr 16, 2005,' \

Not good :(

tokenizer sends this as:
name
operator
string
string
string
newline

I added test for string assignments that end in \.
A flag is set and then all strings till a newline are ignored.
Also rearranged the script a little.
Maybe that will do it ...
Updates available at
The script is located at:
http://bellsouthpwp.net/m/e/mefjr75/python/stripper.py

M.E.Farmer


Jul 19 '05 #15
Attached is another version of the stripper.py file. It contains my
change which seem to handle docstring correctly (at least on itself).
/Jean Brouwers

<pre>

################################################## ####################
# Python source stripper / cleaner ;)
################################################## ####################

import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = \
'''
J¸rgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \

'''this docstring should be removed
'''

################################################## ####################

class Stripper:
"""Python source stripper / cleaner
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
self.temp = StringIO.StringIO()
# Strips the first blank line if 1
self.lasttoken = 1
self.spaces = spaces
# 0 = no change, 1 = strip 'em
self.comments = comments # yep even these
# 0 = no change, 1 = strip 'em, 8 or 'pep8'= strip all but
"""'s
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0
self.lastOP = ''

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext, (srow,scol), (erow,ecol),
line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

##print "*token: %s text: %r line: %r" % \
(token.tok_name[toktype], toktext, line)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

# kill doc strings
if self.docstrings:
# a STRING must be a docstring
# if the most recent OP was ':'
if toktype == tokenize.STRING and self.lastOP == ':':
# pep8 frowns on triple single quotes
if (self.docstrings == 'pep8' or
self.docstrings == 8):
if not toktext.endswith('"""'):
return
else:
return
elif toktype == token.OP:
# remember most recent OP
self.lastOP = toktext
elif self.lastOP == ':':
# newline and indent are OK inside docstring
if toktype not in [token.NEWLINE, token.INDENT]:
# otherwise the docstring ends
self.lastOP = ''
elif toktype == token.NEWLINE:
# consider any string starting
# on a new line as a docstring
self.lastOP = ':'

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
################################################## ####################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=1, docstrings=1, untabify=1, eol='win')
################################################## ####################

if __name__ == '__main__':
Main()

</pre>

M.E.Farmer wrote:
I found the bug and hope I have squashed it.
Single and qouble quoted strings that were assignments and spanned
multilines using \ , were chopped after the first line.
example:
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004'
became:
__date__ = 'Apr 16, 2005,' \

Not good :(

tokenizer sends this as:
name
operator
string
string
string
newline

I added test for string assignments that end in \.
A flag is set and then all strings till a newline are ignored.
Also rearranged the script a little.
Maybe that will do it ...
Updates available at
The script is located at:
http://bellsouthpwp.net/m/e/mefjr75/python/stripper.py

M.E.Farmer


Jul 19 '05 #16
Hello Jean,
Glad to see your still playing along.
I have tested your script and it is broken too :(
Good idea about checking for the ':' , it just doesn't cover every
case.
This is the very reason I had not included docstring support before!
The problem is more diffcult than it first appears,
I am sure you have noticed ;)
Python is fairly flexible in it's layout and very dynamic in it's
execution.
This can lead to some hard to spot and hard to remove docstrings.

After staring at the problem for a day or so ( for the second time ),
*I am still stumped*

################################################## ###################
# this is a test I have put together for docstrings
################################################## ###################
"""This is a module doc it should be removed""" \
"This is really nasty but legal" \
'''Dang this is even worse''' + \
'this should be removed'#This is legal too
################################################## ###################
assignment = \
"""
this should stay
so should this
"""
more_assignment = 'keep me,' \
'keep me too,' \
'keep me.'

################################################## ###################
def func():
'This should be removed' \
"""This should be removed"""
pass
################################################## ####################
def funq(d = {'MyPass':
"""This belongs to a dict and should stay"""
,'MyOtherPass':
'Kepp this string %s'\
"Keep this too" % 42 + """dfgffdgfdgdfg"""}):
"""This docstring is ignored""" + ''' by Python introspection
why?'''
pass
################################################## ####################
def Usage():
"""This should be removed but how, removal will break the function.
This should be removed %s """# what do we do here
return Usage.__doc__% '42'
################################################## ####################
class Klass:
u"This should be removed" \
''' this too '''
def __init__(self, num):
""" This is should be removed but how ? %d """ % num
return None
'People do this sometime for a block comment type of thing' \
"This type of string should be removed also"
def func2(self):
r'erase/this\line\sdfdsf\sdf\dfsdf'
def inner():
u'''should be removed'''
return 42
return inner
################################################## ####################
u'People do this sometime for a block comment type of thing' \
r"This type of string should be removed also" \
""" and this one too! """
# did I forget anything obvious ?
################################################## ###################

When the docstring is removed it should also consume the blank line
that is left behind.
Got to go to work, I'll think about it over the weekend.
If anyone else wants to play you are welcome to join.
Can pyparsing do this easily?( Paul probably has a 'six-line' solution
tucked away somewhere ;)
M.E.Farmer

Jul 19 '05 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Andr? Roberge | last post by:
According to the Python documentation: 18.5 tokenize -- Tokenizer for Python source .... The primary entry point is a generator: generate_tokens(readline) .... An older entry point is...
5
by: Christopher Benson-Manica | last post by:
The function in question follows: vector<string>& tokenize( const string& s, vector<string>& v, char delimiter=',' ) { int delim_idx, begin_idx=0, len=s.length(); for(...
20
by: bubunia2000 | last post by:
Hi all, I heard that strtok is not thread safe. So I want to write a sample program which will tokenize string without using strtok. Can I get a sample source code for the same. For exp:...
1
by: Tim | last post by:
I ran into a problem with a script i was playing with to check code indents and need some direction. It seems to depend on if tabsize is set to 4 in editor and spaces and tabs indents are mixed on...
0
by: noobcprogrammer | last post by:
#include "IndexADT.h" int IndexInit(IndexADT* word) { word->head = NULL; word->wordCount = 0; return 1; } int IndexCreate(IndexADT* wordList,char* argv)
2
by: beatTheDevil | last post by:
Hey guys, As the title says I'm trying to make a regular expression (regex/regexp) for use in removing the comments from code. In this case, this particular regex is meant to match /* ... */...
1
by: Nicolas M | last post by:
Hi, i've got a problem : i want to iterate over a list of string created via tokenize(), but i also want to fetch an attribute value of a node that as an attribute with the value of my current...
6
by: Olagato | last post by:
I need to transform this: <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://localhost/index.php/index./Paths-for-the-extreme-player</ loc> </url> <url>...
1
by: George Sakkis | last post by:
The tokenize.generate_tokens function seems to handle in a context- sensitive manner the new line after a comment: .... # hello world .... x = ( .... # hello world .... ) .... ''' .... ...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.