"Tom Thackrey" <to***@creative-light.com> wrote in message
news:w%*****************@newssvr23.news.prodigy.co m...
On 28-Jul-2003, ke***@iqtv.com (Keith Morris) wrote:
Hi all! I'm creating a mini CMS that will store content in a MySQL
database. What I am trying to do is parse the content and replace
certain keywords with a link. The keywords and associated links are
kept in a MySQL table.
Here is an example.
$keyword = "Widgets Technology Co.";
$location = "http://www.widgets.com/about";
$keyword2 = "Widgets";
$location2 = "http://www.widgets.com";
$content = "We have the best Widgets at Widgets Technology Co.";
I want to parse through $content looking for $keyword and replacing it
with:
<a href="$location">$keyword</a> (this I can do with no problem)
but I am going to be looping through a series of keywords (phrases)
sorted by length (longest to shortest) that may or may not contain
other keywords such as the values above for $keyword2 which would
cause nested links and other nonsense.
so what I'm needing is a regular expression that will find the
$keyword (phrase) that is not already between "<a href =" and "</a>"
so that it will not try to relink it.
so far, this is the regular expression that I have, but does not work
properly:
[^(^\<a href=)][^(\>)]($keyword)[^(a\>)$]
If there is a better way of doing this, I would appreciate any
insights.
You probably need a two step process. First replace the key strings
(longest to shortest) with a unique tag like "!!!$recordid!!!" then replace the
tags with the "<a href...</a>". If you have a large database and only want to
make one pass through it, you could use a hashed string as the tag, then
search for the tags and un-hash the strings.
I've used Tom's method myself in several situations and it seems to work
really well. However, your regexp should work if you just check for word
boundaries on either side of your keyword:
$content_string = preg_replace('/\b'.preg_quote($keyword).'\b/i', '<a
href="'.$url.'">'.$keyword.'</a>', $content_string);
Now, if any non-whitespace character exists on either side of your keyword,
the match will fail. This means that items that look like this: <a
href="url">Something Company</a> won't match the keyword "Something" since
there's an angle bracket before the "S."
Going back to Tom's suggestion, I prefer this method for a couple reasons.
On some CMSs, I do a lot of content filtering (checking for web links and
email addresses, allowing/denying HTML, replacing BB Code style markup, XML
data conversion, etc). For this reason, I pull some tricks using content
"highlighting" to get things like your example to work effectively.
However, I'm sure you'll agree that making multiple passes on your content
can be a waste of time and slow things down for your visitors. How do you
solve it? Well, I used to think that storing the filtered content was the
way to go. So, you filter your content coming form an "administrative"
user, store it in the DB, and serve the pages with little overhead. It
doesn't take long until you eventually have to edit this filtered content.
Then, that scheme goes out the window (unless all of your authors are HTML
gurus--which none of mine are). Here's my solution that has worked like a
charm:
Create two tables for your data: a "front-end" table and a "back-end" table.
On the front-end table store all the prefiltered content. On the back-end
table store the original content entered by author. Assuming your filters
never change, the front-end content will always be updated the same way, so
the back-end is essentially just an "editable" version of what's being
served to your visitors. The biggest drawback is that as your references
database grows, the front-end content stays the same. To combat this, I've
also written an update utility for the CMSs that employ this technique. It
allows an administrator to just run a complete "update" of the front-end
database (run all the back-end info through the filters again) whenever
he/she deems necessary. (If your content DB is kind of small (<1000
records), this utility can be piggy-backed on any updates to the references
DB.) I know this disobeys some of the geat rules of RDB theory, but I think
a faster page load is sometimes more important than following all the rules.
Now we have both problems solved: you can have extensive filters without
making page generation run longer _and_ you can still edit the stuff when it
comes time to make changes.
HTH,
Zac