I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find. However, the chunks have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()). Some
example data to clarify my problem:
py> text = """\
.... aaa bb ccc
.... dd eee. fff gggg
.... hh i.
.... jjj kk.
.... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
Note that the original "text" has a variety of whitespace between words,
but the corresponding "chunks" have only single space characters between
"words". I'm looking for the indices of each chunk, so for this
example, I'd like:
py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
Note that the indices correspond to the *original* text so that the
substrings in the given spans include the irregular whitespace:
py> for s, e in result:
.... print repr(text[s:e])
....
'aaa bb'
'ccc\ndd eee.'
'fff gggg\nhh i.'
'jjj'
'kk.'
I'm trying to write code to produce the indices. Here's what I have:
py> def get_indices(text, chunks):
.... chunks = iter(chunks)
.... chunk = None
.... for text_index, c in enumerate(text):
.... if c.isspace():
.... continue
.... if chunk is None:
.... chunk = chunks.next().replace(' ', '')
.... chunk_start = text_index
.... chunk_index = 0
.... if c != chunk[chunk_index]:
.... raise Exception('unmatched: %r %r' %
.... (c, chunk[chunk_index]))
.... else:
.... chunk_index += 1
.... if chunk_index == len(chunk):
.... yield chunk_start, text_index + 1
.... chunk = None
....
And it appears to work:
py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True
But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?
Thanks in advance,
STeVe
[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;) 6 1631
Steven Bethard wrote:
[snip] And it appears to work:
[snip] But it seems somewhat inelegant. Can anyone see an easier/cleaner/more Pythonic way[1] of writing this code?
Thanks in advance,
STeVe
[1] Yes, I'm aware that these are subjective terms. I'm looking for subjectively "better" solutions. ;)
Perhaps you should define "work" before you worry about """subjectively
"better" solutions""".
If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.
For example, text = 'foo bar', chunks = ['foobar']
John Machin wrote: If "work" is meant to detect *all* possibilities of 'chunks' not having been derived from 'text' in the described manner, then it doesn't work -- all information about the positions of the whitespace is thrown away by your code.
For example, text = 'foo bar', chunks = ['foobar']
This doesn't match the (admittedly vague) spec which said that chunks
are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.
STeVe
Steven Bethard wrote: I have a string with a bunch of whitespace in it, and a series of chunks of that string whose indices I need to find.**However,*the*chunks*have been whitespace-normalized, so that multiple spaces and newlines have been converted to single spaces as if by ' '.join(chunk.split()).**Some
If you are willing to get your hands dirty with regexps:
import re
_reLump = re.compile(r"\S+")
def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()
def main():
text = """\
aaa bb ccc
dd eee. fff gggg
hh i.
jjj kk.
"""
chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
47), (48, 51)]
if __name__ == "__main__":
main()
Not tested beyond what you see.
Peter
Steven Bethard wrote: John Machin wrote:
If "work" is meant to detect *all* possibilities of 'chunks' not having been derived from 'text' in the described manner, then it doesn't work -- all information about the positions of the whitespace is thrown away by your code.
For example, text = 'foo bar', chunks = ['foobar']
This doesn't match the (admittedly vague) spec
That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.
which said that chunks are created "as if by ' '.join(chunk.split())". For the text: 'foo bar' the possible chunk lists should be something like: ['foo bar'] ['foo', 'bar'] If it helps, you can think of chunks as lists of words, where the words have been ' '.join()ed.
If it helps, you can re-read my message. STeVe
John Machin wrote: Steven Bethard wrote:
John Machin wrote:
For example, text = 'foo bar', chunks = ['foobar']
This doesn't match the (admittedly vague) spec
That is *exactly* my point -- it is not valid input, and you are not reporting all cases of invalid input; you have an exception where the non-spaces are impossible, but no exception where whitespaces are impossible.
Well, the input should never look like the above. But if for some
reason it did, I wouldn't want the error; I'd want the indices. So:
text = 'foo bar'
chunks = ['foobar']
should produce:
[(0, 7)]
not an exception.
STeVe
Peter Otten wrote: import re _reLump = re.compile(r"\S+")
def indices(text, chunks): lumps = _reLump.finditer(text) for chunk in chunks: lump = [lumps.next() for _ in chunk.split()] yield lump[0].start(), lump[-1].end()
Thanks, that's a really nice, clean solution!
STeVe This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Hal Vaughan |
last post by:
I have a JComboBox with a list of numbers, from 1 digit to 5 digits.
Numbers with more than 3 digits have a comma in them.
I've been aligning them with leading spaces. Is there any simple and...
|
by: steve |
last post by:
Hello,
I'm reworking our company site, trying to do it without tables. It's been a
bit of an... adventure.
Site is starting to come along. Problem I've run into is aligning image
input...
|
by: Chris |
last post by:
Hello,
I have an unordered list with 3 levels, like this:
<ul class="normalList">
<li>blabla</li>
<ul>
<li>blabla</li>
</ul>
<li>blabla</li>
<ul>
|
by: Tony |
last post by:
I seem to be missing something in my understanding. If I leave off absolute
positioning, shouldn't nested DIV/SPAN be displayed inside the parent, and
ones outside that display separately?
Here...
|
by: Scott Teglasi |
last post by:
Hi all,
I was curious as to how others whom have dealt with this problem have
handled it.
The problem is that I have a site design of a fixed width and height and
want to center that content...
|
by: tfs |
last post by:
I have a table that is putting the object in the middle of my cell
(<td>).
I want to have the objects, such as my asp:textbox, to align towards
the top of the cell. I don't want it to center...
|
by: S |
last post by:
Here's a very basic question. . .
I have a DIV that contains content that I need to be bottom-justified.
What is the CSS code to do that?
Thanks,
---------------S
|
by: kiqyou_vf |
last post by:
I'm trying to pull data from 2 different tables and do a loop to
retrieve more than one row. I'm having problems with aligning the
information. Can someone lead me in the right direction? I've done...
|
by: Mark Wiewel |
last post by:
hi all,
i am a newbie in ASP.NET and i couldn't find the solution to this one:
i have a form with three datagrids on it. i would like to align them
vertically with a space between each grid of...
|
by: kestrel |
last post by:
I have a problem, im trying to make my mebsite, XHTML 1.0 strict, and im going great, but i need my images to be aligned, touching together.
with regular html, i can just put
<img...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Faith0G |
last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
| |