aligning SGML to text

Steven Bethard

I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:

TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

And the corresponding SGML looks like:

<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .

Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.
I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::

def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end

text = '''TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.''' sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
<PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
''' list(align(text, sgml))

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::

Traceback (most recent call last):
...
AssertionError: sorry, but this version only supports 100 named
groups

I also played around with difflib.SequenceMatcher for a while, but
couldn't get a solution based on that working. Any suggestions?
[1]http://mail.python.org/pipermail/python-list/2005-December/313388.html

Thanks,

STeVe

Jun 18 '06 #1

Subscribe Post Reply

1289

Gerard Flanagan

Steven Bethard wrote:

I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:

TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

And the corresponding SGML looks like:

<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .

Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.
I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::

def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end

[...]

>>> list(align(text, sgml))

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::

[...]

Steve

This is probably an abuse of itertools...

---8<---
text = '''TNF binding induces release of AIP1 (DAB2IP) from
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.'''

sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
<PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
<PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
'''

import itertools as it
import string

def scan(line):
if not line: return
line = line.strip()
parts = string.split(line, '>', maxsplit=1)
return parts[0]

def align(txt,sml):
i = 0
for k,g in it.groupby(sml.split('<'),scan):
g = list(g)
if not g[0]: continue
text = g[0].split('>')[1]#.replace('\n','')
if k.startswith('/'):
i += len(text)
else:
offset = len(text.strip())
yield k, i, i+offset
i += offset

print list(align(text,sgml))

------------

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]

It's off because of the punctuation possibly, can't figure it out.
maybe you can tweak it?

hth

Gerard

Jun 18 '06 #2

Steven Bethard

Gerard Flanagan wrote:

Steven Bethard wrote:
I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:

TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

And the corresponding SGML looks like:

<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .

Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.
I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::

def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end

[...]
>>> list(align(text, sgml))

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::

[...]

Steve

This is probably an abuse of itertools...

---8<---
text = '''TNF binding induces release of AIP1 (DAB2IP) from
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.'''

sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
<PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
<PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
'''

import itertools as it
import string

def scan(line):
if not line: return
line = line.strip()
parts = string.split(line, '>', maxsplit=1)
return parts[0]

def align(txt,sml):
i = 0
for k,g in it.groupby(sml.split('<'),scan):
g = list(g)
if not g[0]: continue
text = g[0].split('>')[1]#.replace('\n','')
if k.startswith('/'):
i += len(text)
else:
offset = len(text.strip())
yield k, i, i+offset
i += offset

print list(align(text,sgml))

------------

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]

It's off because of the punctuation possibly, can't figure it out.

Thanks for taking a look. Yeah, the alignment's a big part of the
problem. It'd be really nice if the thing that gives me SGML didn't add
whitespace haphazardly. ;-)

STeVe

Jun 19 '06 #3

Steven Bethard

Steven Bethard wrote:

I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.) [snip] Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation. [snip] I need to determine the indices in the original text that each SGML
element corresponds to.

Ok, below is a working version that doesn't use regular expressions.
It's far from concise, but at least it doesn't fail like re does when I
have more than 100 words. =)

import elementtree.ElementTree as etree
def align(text, sgml): .... # convert SGML tree to words, and assemble a list of the
.... # start word index and end word index for each SGML element
.... sgml = sgml.replace('&', '&')
.... tree = etree.fromstring('<xml>%s</xml>' % sgml)
.... words = []
.... if tree.text is not None:
.... words.extend(tree.text.split())
.... word_spans = []
.... for elem in tree:
.... elem_words = elem.text.split()
.... start = len(words)
.... end = start + len(elem_words)
.... word_spans.append((start, end, elem.tag))
.... words.extend(elem_words)
.... if elem.tail is not None:
.... words.extend(elem.tail.split())
.... # determine the start character index and end character index
.... # for each word from the SGML
.... char_spans = []
.... start = 0
.... for word in words:
.... while text[start:start + 1].isspace():
.... start += 1
.... end = start + len(word)
.... assert text[start:end] == word, (text[start:end], word)
.... char_spans.append((start, end))
.... start = end
.... # convert the word indices for each SGML element to
.... # character indices
.... for word_start, word_end, label in word_spans:
.... start, _ = char_spans[word_start]
.... _, end = char_spans[word_end - 1]
.... yield label, start, end
.... text = '''TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in cytoplasmic translocation and concomitant formation of an
intracellular signaling complex comprised of TRADD, RIP1, TRAF2, and
AIPl.''' sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN>
TNFR1 </PROTEIN> , resulting in cytoplasmic translocation and
concomitant formation of an <PROTEIN> intracellular signaling complex
</PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1
</PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
.... ''' list(align(text, sgml))

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

STeVe

Jun 19 '06 #4

Gerard Flanagan

Steven Bethard wrote:

Gerard Flanagan wrote:
Steven Bethard wrote:
I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:
[...]
Steve

This is probably an abuse of itertools...
[snip hammering]
Thanks for taking a look. Yeah, the alignment's a big part of the
problem. It'd be really nice if the thing that gives me SGML didn't add
whitespace haphazardly. ;-)

STeVe

I see, the problem was different than I thought. When all you have is a
hammer... :-)

Gerard

Jun 19 '06 #5

Similar topics

SX- sgml to xml conversion problems

by: Usman | last post by:

Dear friends, I would like to ask about James Clark sx.exe parser from SGML to XML. I write the batch file like this : "E:\Project\sx\sx.exe" -wall "-DE:\Project\sx\entities"...

.NET Framework

???SGML support for Unicode???

by: krammer | last post by:

Hello, I have the following questions that I have not been able to find any *good* answers for. Your help would me much appreciated!, fyi, I am a Java XML guy and I have no experience with SGML...

.NET Framework

???XML vs SGML for unicode support???

by: krammer | last post by:

Hello, Can any one please give me a short but concise pros and cons list of Unicode support in both SGML and XML? long story short, we are gonna port our leagacy SGML files to XML and the new...

.NET Framework

non SGML Character €?????

by: Lars | last post by:

Why doesn't the W3C's HTML Validator recognize € and what do I have to do to make my html-file valid?

HTML / CSS

sgml vs unicode notation

by: S. | last post by:

if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode...

HTML / CSS

XML to SGML conversion

by: jimmy.williamson | last post by:

Hi, I'm currently working on a project where I am required to investigate how to convert SGML to XML, and then back again. >From what I've seen on the web so far, James Clark's SP software can...

.NET Framework

Need to extract XML or SGML entities from a Unicode text

by: Frantic | last post by:

I'm working on a list of japaneese entities that contain the entity, the unicode hexadecimal code and the xml/sgml entity used for that entity. A unicode document is read into the program, then the...

.NET Framework

CSS position is not properly aligning in Firefox

by: agbee1 | last post by:

Hello: I've finally made the effort to ween myself from overly using tables and use CSS for my positioning. However, I am having a problem with my navigational menu properly aligning in Firefox,...

HTML / CSS

Aligning Text

by: unstoppablekatia | last post by:

I have a website that has images on it, and underneath the images are text. My only option of aligning the images and text separate from each other, is to do <div align="right">, <center>, or <div...

HTML / CSS

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice