By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,877 Members | 1,064 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,877 IT Pros & Developers. It's quick & easy.

remove strings from source

P: n/a
For a python code I am writing I need to remove all strings
definitions from source and substitute them with a place-holder.

To make clearer:
line 45 sVar="this is the string assigned to sVar"
must be converted in:
line 45 sVar=s00001

Such substitution is recorded in a file under:
s0001[line 45]="this is the string assigned to sVar"

For curious guys:
I am trying to implement a cross variable reference tool and the
variability (in lenght) of the string definitions (expecially if
multi-line) can cause display problems.

I need your help in correctly identifying the strings (also embedding
the r'xx..' or u'yy...' as part of the string definition). The problem
is mainly on the multi-line definitions or in cached strings
(embedding chr() definitions or escape sequences).
Jul 18 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
qwweeeit wrote:
For a python code I am writing I need to remove all strings
definitions from source and substitute them with a place-holder.

To make clearer:
line 45 sVar="this is the string assigned to sVar"
must be converted in:
line 45 sVar=s00001

Such substitution is recorded in a file under:
s0001[line 45]="this is the string assigned to sVar"

For curious guys:
I am trying to implement a cross variable reference tool and the
variability (in lenght) of the string definitions (expecially if
multi-line) can cause display problems.

I need your help in correctly identifying the strings (also embedding
the r'xx..' or u'yy...' as part of the string definition). The problem
is mainly on the multi-line definitions or in cached strings
(embedding chr() definitions or escape sequences).

Approach this in a test-driven development way. Create sample input and
output files. Write a unit test something like this (below) and run
it. You'll either solve the problem yourself or ask more specific
questions. ;-)

Cheers,

// m

#!/usr/bin/env python
import unittest

def substitute(data):
# As a first pass, just return the data itself--obviously, this
should fail.
return data

class Test(unittest.TestCase):
def test(self):
data = open("input.txt").read()
expected = open("expected.txt").read()
actual = substitute(data)
self.assertEquals(expected, actual)

if __name__ == '__main__':
unittest.main()
Jul 18 '05 #2

P: n/a
qwweeeit wrote:
I need your help in correctly identifying the strings (also embedding
the r'xx..' or u'yy...' as part of the string definition). The problem
is mainly on the multi-line definitions or in cached strings
(embedding chr() definitions or escape sequences).


Have a look at tokenize.generate_tokens() in the standard library. That
ought to give you enough information to identify the strings reliably and
output modified source.
Jul 18 '05 #3

P: n/a
qwweeeit wrote:
For a python code I am writing I need to remove all strings
definitions from source and substitute them with a place-holder.

To make clearer:
line 45 sVar="this is the string assigned to sVar"
must be converted in:
line 45 sVar=s00001

Such substitution is recorded in a file under:
s0001[line 45]="this is the string assigned to sVar"

For curious guys:
I am trying to implement a cross variable reference tool and the
variability (in lenght) of the string definitions (expecially if
multi-line) can cause display problems.

I need your help in correctly identifying the strings (also embedding
the r'xx..' or u'yy...' as part of the string definition). The problem is mainly on the multi-line definitions or in cached strings
(embedding chr() definitions or escape sequences).


Hello,
I have written a few python parsers before.
Here is my attempt :)
# string_mapper.py
from __future__ import generators# python 2.2
import keyword, os, sys, traceback
import cStringIO, token, tokenize

def StringNamer(num=0):
'''This is a name creating generator'''
while 1:
num += 1
stringname = 's'+str(num).zfill(6)
yield stringname

class ReplaceParser(object):
"""
filein = open('yourfilehere.py').read()
replacer = ReplaceParser(filein, out=sys.stdout)
replacer.format()
replacer.StringMap

"""

def __init__(self, raw, out=sys.stdout):
''' Store the source text.
'''
self.raw =raw.expandtabs().strip()
self.out = out
self.StringName = StringNamer()
self.StringMap = {}

def format(self):
''' Parse and send the source.
'''
self.lines = [0, 0]
pos = 0
self.temp = cStringIO.StringIO()
while 1:
pos = self.raw.find('\n', pos) + 1
if not pos: break
self.lines.append(pos)
self.lines.append(len(self.raw))
self.pos = 0
text = cStringIO.StringIO(self.raw)
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

def __call__(self, toktype, toktext, (srow,scol),
(erow,ecol), line):
''' Token handler.
'''
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)
if toktype in [token.NEWLINE, tokenize.NL]:
self.out.write('\n')
return
if newpos > oldpos:
self.out.write(self.raw[oldpos:newpos])
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return
if (toktype == token.STRING):
sname = self.StringName.next()
self.StringMap[sname] = toktext
toktext = sname
self.out.write(toktext)
self.out.flush()
return

hth,
M.E.Farmer

Jul 18 '05 #4

P: n/a
Thank you for your suggestion, but it is too complicated for me...
I decided to proceed in steps:
1. Take away all commented lines
2. Rebuild the multi-lines as single lines

I have already written the code and now I can face the problem of
mouving string definitions into a data base file...
Hopefully I will then build cross reference tables of the variables.
My project is also to implement the code for building functions' tree
..
Jul 18 '05 #5

P: n/a
qwweeeit wrote:
Thank you for your suggestion, but it is too complicated for me...
I decided to proceed in steps:
1. Take away all commented lines
2. Rebuild the multi-lines as single lines

ummm,
Ok all i can say is did you try this?
if not save it as a module then import it into the interperter and try
it.
This is a dead simple module to do *exactly* what you asked for :)
Like i said I have done this before so I will restate *I HAVE FAILED AT
THIS BEFORE, MANY TIMES*. Now I have a solution.
It handles stdio by default but can write to a filelike object if you
give it one.
Handles continued lines already, no need to futz around with some
solution.
Here is an example:
Py> filein = """
.... class Stripper:
.... '''python comment and whitespace stripper
.... '''
.... def __init__(self, raw):
.... ''' Store the source text & set some flags.
.... '''
.... self.raw = raw
....
.... def format(self, out=sys.stdout, comments=0,
.... spaces=1, untabify=1,eol='unix'):
.... '''Parse and send the colored source.'''
.... # Store line offsets in self.lines
.... self.lines = [0, 0]
.... pos = 0
.... # Strips the first blank line if 1
.... self.lasttoken = 1
.... self.temp = StringIO.StringIO()
.... self.spaces = spaces
.... self.comments = comments
....
.... if untabify:
.... self.raw = self.raw.expandtabs()
.... self.raw = self.raw.rstrip()+' '
.... self.out = out
.... """
Py> replacer = ReplaceParser(filein, out=sys.stdout)
Py> replacer.format()
class Stripper:
s000001
def __init__(self, raw):
s000002
self.raw = raw

def format(self, out=sys.stdout, comments=0,
spaces=1, untabify=1,eol=s000003):
s000004
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+s000005
self.out = out
Py> replacer.StringMap
{'s000004': "'''Parse and send the colored source.'''",
's000005': "' '",
's000001': "'''python comment and whitespace stripper :)\n '''",
's000002': "''' Store the source text & set some flags.\n '''",
's000003': "'unix'"}

You can also strip out comments with a few line.
It can easily get single comments or doubles.
add this in your __call__ function:
[snip]
self.pos = newpos
return
# kills comments
if (toktype == tokenize.COMMENT):
return
if (toktype == token.STRING):
sname = self.StringName.next()
[snip]

If you insist on writing something go ahead.
Let me know what your solution is, I am curious.
M.E.Farmer

Jul 18 '05 #6

P: n/a
I am in debt with you of an answer on " my" solution in removing
literal strings...
I apologize not to have followed your suggestions but I am just
learning Python, and your approach was too difficult for me!
I've already developed the cross reference tool, and for that I
identified two types of literals (the standard one which I named in
general s~ and the multi-line or triple quoted strins, which I called
m~).

You can see a step in my appproach to the solution in an answer to
Fredrik Lundh
http://groups.google.it/groups?q=qww...gle.com&rnum=2

After that I have almost completed the application, and better than
explanations you can see the result (a small extract).

052 PROGNAME: PROGNAME = sys.argv[0]
053 AUTHOR: AUTHOR = us~.encode(s~)
054 VERSION: VERSION = s~
056 URL_BASE: URL_BASE = s~
057 OUTPUT_HTML: OUTPUT_HTML = s~ etc...

The cross references are mainly useful for variables, but I use them
also for Python reserved words, to learn the language and also classes
and functions.

For small applications there is no need for my toool, but with a
source of almost 1 Mb... (like Pysol).
Excuse if I don't go deeper in my solution for removing strings, but
it is so standard that there is nothing to learn ...

Bye.
Jul 18 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.