Help needed with a regular expression...

Keith Morris

Hi all! I'm creating a mini CMS that will store content in a MySQL
database. What I am trying to do is parse the content and replace
certain keywords with a link. The keywords and associated links are
kept in a MySQL table.

Here is an example.

$keyword = "Widgets Technology Co.";
$location = "http://www.widgets.com/about";

$keyword2 = "Widgets";
$location2 = "http://www.widgets.com";

$content = "We have the best Widgets at Widgets Technology Co.";

I want to parse through $content looking for $keyword and replacing it
with:
<a href="$location">$keyword</a> (this I can do with no problem)

but I am going to be looping through a series of keywords (phrases)
sorted by length (longest to shortest) that may or may not contain
other keywords such as the values above for $keyword2 which would
cause nested links and other nonsense.

so what I'm needing is a regular expression that will find the
$keyword (phrase) that is not already between "<a href =" and "</a>"
so that it will not try to relink it.

so far, this is the regular expression that I have, but does not work
properly:

[^(^\<a href=)][^(\>)]($keyword)[^(a\>)$]

If there is a better way of doing this, I would appreciate any
insights.

Thanks.

Keith

Jul 16 '05 #1

Subscribe Post Reply

2601

Tom Thackrey

On 28-Jul-2003, ke***@iqtv.com (Keith Morris) wrote:

Hi all! I'm creating a mini CMS that will store content in a MySQL
database. What I am trying to do is parse the content and replace
certain keywords with a link. The keywords and associated links are
kept in a MySQL table.

Here is an example.

$keyword = "Widgets Technology Co.";
$location = "http://www.widgets.com/about";

$keyword2 = "Widgets";
$location2 = "http://www.widgets.com";

$content = "We have the best Widgets at Widgets Technology Co.";

I want to parse through $content looking for $keyword and replacing it
with:
<a href="$location">$keyword</a> (this I can do with no problem)

but I am going to be looping through a series of keywords (phrases)
sorted by length (longest to shortest) that may or may not contain
other keywords such as the values above for $keyword2 which would
cause nested links and other nonsense.

so what I'm needing is a regular expression that will find the
$keyword (phrase) that is not already between "<a href =" and "</a>"
so that it will not try to relink it.

so far, this is the regular expression that I have, but does not work
properly:

[^(^\<a href=)][^(\>)]($keyword)[^(a\>)$]

If there is a better way of doing this, I would appreciate any
insights.

You probably need a two step process. First replace the key strings (longest
to shortest) with a unique tag like "!!!$recordid!!!" then replace the tags
with the "<a href...</a>". If you have a large database and only want to
make one pass through it, you could use a hashed string as the tag, then
search for the tags and un-hash the strings.

--
Tom Thackrey
www.creative-light.com

Jul 16 '05 #2

Zac Hester

"Tom Thackrey" <to***@creative-light.com> wrote in message
news:w%*****************@newssvr23.news.prodigy.co m...

On 28-Jul-2003, ke***@iqtv.com (Keith Morris) wrote:
Hi all! I'm creating a mini CMS that will store content in a MySQL
database. What I am trying to do is parse the content and replace
certain keywords with a link. The keywords and associated links are
kept in a MySQL table.

Here is an example.

$keyword = "Widgets Technology Co.";
$location = "http://www.widgets.com/about";

$keyword2 = "Widgets";
$location2 = "http://www.widgets.com";

$content = "We have the best Widgets at Widgets Technology Co.";

I want to parse through $content looking for $keyword and replacing it
with:
<a href="$location">$keyword</a> (this I can do with no problem)

but I am going to be looping through a series of keywords (phrases)
sorted by length (longest to shortest) that may or may not contain
other keywords such as the values above for $keyword2 which would
cause nested links and other nonsense.

so what I'm needing is a regular expression that will find the
$keyword (phrase) that is not already between "<a href =" and "</a>"
so that it will not try to relink it.

so far, this is the regular expression that I have, but does not work
properly:

[^(^\<a href=)][^(\>)]($keyword)[^(a\>)$]

If there is a better way of doing this, I would appreciate any
insights.
You probably need a two step process. First replace the key strings

(longest to shortest) with a unique tag like "!!!$recordid!!!" then replace the tags with the "<a href...</a>". If you have a large database and only want to
make one pass through it, you could use a hashed string as the tag, then
search for the tags and un-hash the strings.

I've used Tom's method myself in several situations and it seems to work
really well. However, your regexp should work if you just check for word
boundaries on either side of your keyword:

$content_string = preg_replace('/\b'.preg_quote($keyword).'\b/i', '<a
href="'.$url.'">'.$keyword.'</a>', $content_string);

Now, if any non-whitespace character exists on either side of your keyword,
the match will fail. This means that items that look like this: <a
href="url">Something Company</a> won't match the keyword "Something" since
there's an angle bracket before the "S."

Going back to Tom's suggestion, I prefer this method for a couple reasons.
On some CMSs, I do a lot of content filtering (checking for web links and
email addresses, allowing/denying HTML, replacing BB Code style markup, XML
data conversion, etc). For this reason, I pull some tricks using content
"highlighting" to get things like your example to work effectively.
However, I'm sure you'll agree that making multiple passes on your content
can be a waste of time and slow things down for your visitors. How do you
solve it? Well, I used to think that storing the filtered content was the
way to go. So, you filter your content coming form an "administrative"
user, store it in the DB, and serve the pages with little overhead. It
doesn't take long until you eventually have to edit this filtered content.
Then, that scheme goes out the window (unless all of your authors are HTML
gurus--which none of mine are). Here's my solution that has worked like a
charm:

Create two tables for your data: a "front-end" table and a "back-end" table.
On the front-end table store all the prefiltered content. On the back-end
table store the original content entered by author. Assuming your filters
never change, the front-end content will always be updated the same way, so
the back-end is essentially just an "editable" version of what's being
served to your visitors. The biggest drawback is that as your references
database grows, the front-end content stays the same. To combat this, I've
also written an update utility for the CMSs that employ this technique. It
allows an administrator to just run a complete "update" of the front-end
database (run all the back-end info through the filters again) whenever
he/she deems necessary. (If your content DB is kind of small (<1000
records), this utility can be piggy-backed on any updates to the references
DB.) I know this disobeys some of the geat rules of RDB theory, but I think
a faster page load is sometimes more important than following all the rules.

Now we have both problems solved: you can have extensive filters without
making page generation run longer _and_ you can still edit the stuff when it
comes time to make changes.

HTH,
Zac

Jul 16 '05 #3

by: Steve | last post by:

Hello, I am writing a script that calls a URL and reads the resulting HTML into a function that strips out everthing and returns ONLY the links, this is so that I can build a link index of various...

PHP

Help needed with problem

by: Jack Smith | last post by:

Can someone help me out with this problem. Any help is appreciated. Thanks. Let doubleswap(x) be the string formed by replacing each a in x by the substring bb and each b by the substring aa....

Java

Regular Languages. Help Needed.

by: Jack Smith | last post by:

I posted this question earlier, but I got no responses. Can anyone help me out here...any hints or even how to start? Thanks in advance. Let doubleswap(x) be the string formed by replacing each...

Java

Help needed: cryptic perl regular expression in python syntax

by: pekka niiranen | last post by:

Hi there, I have perl script that uses dynamically constructed regular in this way: ------perl code starts ---- $result ""; $key = AAA\?01; $key = quotemeta $key; $line = " ...

Python

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Regular Expression Help

by: tmeister | last post by:

I am in need of a regular expression that tests and fails if there are 14 or more of a character in the test string. There can be up to 13 of these characters in the string and any other...

ASP.NET

regular expression help needed

by: henrik | last post by:

Hi I have a regex question. I want to find all content of a <td class="someclass"> tag. This means the expression should include all other tags included between <td class="someclass"> and </td>....

C# / C Sharp

anybody help me

by: Rahul | last post by:

Hi Everybody I have some problem in my script. please help me. This is script file. I have one *.inq file. I want run this script in XML files. But this script errors shows . If u want i am...

Python

Help Badly needed- FSA and ASCII stuff

by: Willing 2 Learn | last post by:

Hey, I'm trying to teach myself C++ and I came across 3 problems. I understand the concept of FSA but getting the C++ code to do it as become an issue. Only thing is im clueless as to how to do...

C / C++

Regular Expression pattern help needed!

by: BHPexpert | last post by:

Regular Expression help needed -------------------------------------------------------------------------------- I want to extract all text that is contained inside the brackets after the word...

Perl

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Help needed with a regular expression...

Similar topics