473,385 Members | 1,342 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

aligning text with space-normalized text

I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find. However, the chunks have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()). Some
example data to clarify my problem:

py> text = """\
.... aaa bb ccc
.... dd eee. fff gggg
.... hh i.
.... jjj kk.
.... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']

Note that the original "text" has a variety of whitespace between words,
but the corresponding "chunks" have only single space characters between
"words". I'm looking for the indices of each chunk, so for this
example, I'd like:

py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]

Note that the indices correspond to the *original* text so that the
substrings in the given spans include the irregular whitespace:

py> for s, e in result:
.... print repr(text[s:e])
....
'aaa bb'
'ccc\ndd eee.'
'fff gggg\nhh i.'
'jjj'
'kk.'

I'm trying to write code to produce the indices. Here's what I have:

py> def get_indices(text, chunks):
.... chunks = iter(chunks)
.... chunk = None
.... for text_index, c in enumerate(text):
.... if c.isspace():
.... continue
.... if chunk is None:
.... chunk = chunks.next().replace(' ', '')
.... chunk_start = text_index
.... chunk_index = 0
.... if c != chunk[chunk_index]:
.... raise Exception('unmatched: %r %r' %
.... (c, chunk[chunk_index]))
.... else:
.... chunk_index += 1
.... if chunk_index == len(chunk):
.... yield chunk_start, text_index + 1
.... chunk = None
....

And it appears to work:

py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True

But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;)
Jul 19 '05 #1
6 1631
Steven Bethard wrote:
[snip]
And it appears to work: [snip] But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;)


Perhaps you should define "work" before you worry about """subjectively
"better" solutions""".

If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']
Jul 19 '05 #2
John Machin wrote:
If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']


This doesn't match the (admittedly vague) spec which said that chunks
are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.

STeVe
Jul 19 '05 #3
Steven Bethard wrote:
I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find.**However,*the*chunks*have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()).**Some


If you are willing to get your hands dirty with regexps:

import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()
def main():
text = """\
aaa bb ccc
dd eee. fff gggg
hh i.
jjj kk.
"""
chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
47), (48, 51)]

if __name__ == "__main__":
main()

Not tested beyond what you see.

Peter

Jul 19 '05 #4
Steven Bethard wrote:
John Machin wrote:
If "work" is meant to detect *all* possibilities of 'chunks' not
having been derived from 'text' in the described manner, then it
doesn't work -- all information about the positions of the whitespace
is thrown away by your code.

For example, text = 'foo bar', chunks = ['foobar']

This doesn't match the (admittedly vague) spec


That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.
which said that chunks are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.
If it helps, you can re-read my message.

STeVe

Jul 19 '05 #5
John Machin wrote:
Steven Bethard wrote:
John Machin wrote:
For example, text = 'foo bar', chunks = ['foobar']


This doesn't match the (admittedly vague) spec


That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.


Well, the input should never look like the above. But if for some
reason it did, I wouldn't want the error; I'd want the indices. So:
text = 'foo bar'
chunks = ['foobar']
should produce:
[(0, 7)]
not an exception.

STeVe
Jul 19 '05 #6
Peter Otten wrote:
import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()


Thanks, that's a really nice, clean solution!

STeVe
Jul 19 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Hal Vaughan | last post by:
I have a JComboBox with a list of numbers, from 1 digit to 5 digits. Numbers with more than 3 digits have a comma in them. I've been aligning them with leading spaces. Is there any simple and...
1
by: steve | last post by:
Hello, I'm reworking our company site, trying to do it without tables. It's been a bit of an... adventure. Site is starting to come along. Problem I've run into is aligning image input...
18
by: Chris | last post by:
Hello, I have an unordered list with 3 levels, like this: <ul class="normalList"> <li>blabla</li> <ul> <li>blabla</li> </ul> <li>blabla</li> <ul>
5
by: Tony | last post by:
I seem to be missing something in my understanding. If I leave off absolute positioning, shouldn't nested DIV/SPAN be displayed inside the parent, and ones outside that display separately? Here...
7
by: Scott Teglasi | last post by:
Hi all, I was curious as to how others whom have dealt with this problem have handled it. The problem is that I have a site design of a fixed width and height and want to center that content...
4
by: tfs | last post by:
I have a table that is putting the object in the middle of my cell (<td>). I want to have the objects, such as my asp:textbox, to align towards the top of the cell. I don't want it to center...
12
by: S | last post by:
Here's a very basic question. . . I have a DIV that contains content that I need to be bottom-justified. What is the CSS code to do that? Thanks, ---------------S
28
by: kiqyou_vf | last post by:
I'm trying to pull data from 2 different tables and do a loop to retrieve more than one row. I'm having problems with aligning the information. Can someone lead me in the right direction? I've done...
3
by: Mark Wiewel | last post by:
hi all, i am a newbie in ASP.NET and i couldn't find the solution to this one: i have a form with three datagrids on it. i would like to align them vertically with a space between each grid of...
2
kestrel
by: kestrel | last post by:
I have a problem, im trying to make my mebsite, XHTML 1.0 strict, and im going great, but i need my images to be aligned, touching together. with regular html, i can just put <img...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.