Display context snippet for search phrase match optimisation request

Follower

Hi,

I am working on a function to return extracts from a text document
with a specific phrase highlighted (i.e. display the context of the
matched phrase).

The requirements are:

* Match should be case-insensitive, but extract should have case
preserved.

* The extract should show N words or characters of context on both
sides of the match.

* The phrase should be highlighted. i.e. Bracketed by arbitrary text
e.g. "<b>" & "</b>".

* The phrase is simple. e.g. "double" or "Another option"

* Only the first N matches should be returned.

* There will always be at least one match. (The extracts are only
requested if another process has determined a match exists in the
document.)

* The size of the text document varies from a few hundred
kilobytes to a little under 20 megabytes.

I've found two alternative methods (included below) and was wondering
if anyone had suggestions for improvements. One method uses the string
"index()" method and the other uses a regular expression.

As tested, there seems to be a lack of consistency in results, in
"real world" testing the regular expression seems to be faster most of
the time, but in "timeit" tests it almost always seems to considerably
slower which surprised me. I'm beginning to think the only time when
the regular expression method is faster is when the matches are near
the beginning of the document.

I'm using the Python Swish-E binding
<http://jibe.freeshell.org/bits/SwishE/> on Windows with Python 2.3.
The purpose of the code is to display context for each found document
(which is actually a PDF file with the content converted to text).

In "real world" practice for a set of fifteen results there's only
around two to five seconds difference between the two methods, so I
should probably stop worrying about it. I really just wanted to know
if any other approach is likely to be significantly better. And if
not, then anyone else can feel free to use this code. :-)

--Phil.

# ========== Test Output ============

For the following test output:

"getNext_A" uses the "index()" string method.
"getNext_B" uses regular expressions.
# 17MB file with phrase "simply"
------ getNext_B ------
1.2671 sec/pass
126.72300005

------ getNext_A ------
0.5441 sec/pass
54.4189999104

# 17MB file with phrase "auckland"
------ getNext_B ------
0.0054 sec/pass
0.530999898911

------ getNext_A ------
0.4429 sec/pass
44.2940001488

# 132KB file with phrase "simply"
------ getNext_B ------
0.0111 sec/pass
1.12199997902

------ getNext_A ------
0.0041 sec/pass
0.411000013351

# 132KB file with phrase "auckland"
------ getNext_B ------
0.0109 sec/pass
1.10099983215

------ getNext_A ------
0.0041 sec/pass
0.411000013351
# ========== Script file "test_context.py" ============
# Remove first two comment characters in each line used to preserve
white space
# for Usenet post.

###!/usr/bin/python
##
##FILENAME = r"17MB_document.txt"
##
###PHRASE = "auckland"
###PHRASE = "simply"
##PHRASE = "proceedings"
##
##MAX_MATCHES = 7
##
##RANGE = 40
##
##
##def getNext_A(content, phrase, limit):
## """
## """
## lowContent = content.lower()
## lowPhrase = phrase.lower()
## phraseLen = len(phrase)
##
## idx = -1
## for matchCount in range(limit):
## try:
## idx = lowContent.index(lowPhrase, idx + 1)
## except ValueError:
## break
##
## yield (content[max([idx - RANGE, 0]): idx].lstrip(),
## content[idx: idx + phraseLen],
## content[idx + phraseLen : idx + phraseLen +
RANGE].rstrip())
##
##
##def getNext_B(content, phrase, limit):
## """
## """
## matcher = re.compile(phrase, re.IGNORECASE) # TODO: Escape
"phrase"?
##
## for match in itertools.islice(matcher.finditer(content), limit):
## start, end = match.span()
## yield (content[max([start - RANGE, 0]): start].lstrip(),
## content[start: end],
## content[end: end + RANGE].rstrip())
##
##
##def getContext(content, phrase, func):
## """
## """
## results = []
## for match in func(content, phrase, MAX_MATCHES):
## results.append("...%s<b>%s</b>%s..." % match)
## return "".join(results)
##
##
##import timeit
##import time
##if __name__ == "__main__":
## print
## content = open(FILENAME).read()
##
## for (f, n) in [(getNext_B, "getNext_B"), (getNext_A,
"getNext_A") ]:
## print "------ %s ------" % n
## ta = time.time()
##
## t = timeit.Timer(stmt="getContext(content, PHRASE, %s)" % n,
## setup="from __main__ import getContext,
content,
## PHRASE, getNext_A, getNext_B")
## print "%.4f sec/pass" % (t.timeit(number=100)/100)
##
## print time.time() - ta
##
## print
## #print getContext(content, PHRASE, f)

Jul 18 '05 #1

Subscribe Post Reply

1919

Similar topics

Query XML file and display within HTML

by: Bart Van der Donck | last post by:

Jomo (jomo@dynamicdeveloper.co.uk) wrote: > How do I generate a request from an HTML search form > (or button/hyperlink) to search/query an XML data file, > using CGI scripts, and have the...

.NET Framework

Search results to a non-existent page

by: George | last post by:

Hi, Anyone has the background for explaining? I have made a search on my name and I have got a link to another search engine. The link's title was the search phrase for the other search engine...

HTML / CSS

Unable to Display Data

by: HartSA | last post by:

Here is the relevent part of my code. I am trying to learn ASP. <body> <table border="0" cellspacing="0" cellpadding="0"> <form name="frmSearch1a" method="get"> <tr> <td align=center>...

ASP / Active Server Pages

website doc search is extremely SLOW

by: D. Dante Lorenso | last post by:

Trying to use the 'search' in the docs section of PostgreSQL.org is extremely SLOW. Considering this is a website for a database and databases are supposed to be good for indexing content, I'd...

PostgreSQL Database

Can't display file in Browser. Need help please! Thank you.

by: shapper | last post by:

Hello, I am trying to convert an Asp.Net XML sitemap file in a Google XMl sitemap file using a XSL file using an HttpHandler. Everything seems well in my code but I am getting an error: XML...

ASP.NET

combining the path and fileinput modules

by: wo_shi_big_stomach | last post by:

Newbie to python writing a script to recurse a directory tree and delete the first line of a file if it contains a given string. I get the same error on a Mac running OS X 10.4.8 and FreeBSD 6.1. ...

Python

Enhanced search facility

by: joedunn7 | last post by:

Hi, I'm new to this forum and to TSDN but I have been working in Access from v2. My current task is to create a 'super search' to allow users to specify certain key words that I can search rhough a...

Microsoft Access / VBA

How to call an aspx that accepts parameters through HTTP POST and returns an image, and then display the image in my html?

by: computer_guy | last post by:

Hi Everyone, I run into a problem. I am trying to write an aspx that can dynamically generate an image based on some input parameters. Things are very simple if the size of the parameters is...

ASP.NET

Please Help Advanse Search, two parts are working, the 3rd to connect.

by: iahamed | last post by:

Hi Everyone, I got two parts of my advance search to work, I am running out of Logic to connect the third. My mind is in swing! Pleaseeeeeeeee Help me. I have 3 Fiels to search, the First two...

ASP / Active Server Pages

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA