473,395 Members | 1,856 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Help needed with a regular expression...

Hi all! I'm creating a mini CMS that will store content in a MySQL
database. What I am trying to do is parse the content and replace
certain keywords with a link. The keywords and associated links are
kept in a MySQL table.

Here is an example.

$keyword = "Widgets Technology Co.";
$location = "http://www.widgets.com/about";

$keyword2 = "Widgets";
$location2 = "http://www.widgets.com";

$content = "We have the best Widgets at Widgets Technology Co.";

I want to parse through $content looking for $keyword and replacing it
with:
<a href="$location">$keyword</a> (this I can do with no problem)

but I am going to be looping through a series of keywords (phrases)
sorted by length (longest to shortest) that may or may not contain
other keywords such as the values above for $keyword2 which would
cause nested links and other nonsense.

so what I'm needing is a regular expression that will find the
$keyword (phrase) that is not already between "<a href =" and "</a>"
so that it will not try to relink it.

so far, this is the regular expression that I have, but does not work
properly:

[^(^\<a href=)][^(\>)]($keyword)[^(a\>)$]

If there is a better way of doing this, I would appreciate any
insights.

Thanks.

Keith
Jul 16 '05 #1
2 2601

On 28-Jul-2003, ke***@iqtv.com (Keith Morris) wrote:
Hi all! I'm creating a mini CMS that will store content in a MySQL
database. What I am trying to do is parse the content and replace
certain keywords with a link. The keywords and associated links are
kept in a MySQL table.

Here is an example.

$keyword = "Widgets Technology Co.";
$location = "http://www.widgets.com/about";

$keyword2 = "Widgets";
$location2 = "http://www.widgets.com";

$content = "We have the best Widgets at Widgets Technology Co.";

I want to parse through $content looking for $keyword and replacing it
with:
<a href="$location">$keyword</a> (this I can do with no problem)

but I am going to be looping through a series of keywords (phrases)
sorted by length (longest to shortest) that may or may not contain
other keywords such as the values above for $keyword2 which would
cause nested links and other nonsense.

so what I'm needing is a regular expression that will find the
$keyword (phrase) that is not already between "<a href =" and "</a>"
so that it will not try to relink it.

so far, this is the regular expression that I have, but does not work
properly:

[^(^\<a href=)][^(\>)]($keyword)[^(a\>)$]

If there is a better way of doing this, I would appreciate any
insights.


You probably need a two step process. First replace the key strings (longest
to shortest) with a unique tag like "!!!$recordid!!!" then replace the tags
with the "<a href...</a>". If you have a large database and only want to
make one pass through it, you could use a hashed string as the tag, then
search for the tags and un-hash the strings.

--
Tom Thackrey
www.creative-light.com
Jul 16 '05 #2
"Tom Thackrey" <to***@creative-light.com> wrote in message
news:w%*****************@newssvr23.news.prodigy.co m...

On 28-Jul-2003, ke***@iqtv.com (Keith Morris) wrote:
Hi all! I'm creating a mini CMS that will store content in a MySQL
database. What I am trying to do is parse the content and replace
certain keywords with a link. The keywords and associated links are
kept in a MySQL table.

Here is an example.

$keyword = "Widgets Technology Co.";
$location = "http://www.widgets.com/about";

$keyword2 = "Widgets";
$location2 = "http://www.widgets.com";

$content = "We have the best Widgets at Widgets Technology Co.";

I want to parse through $content looking for $keyword and replacing it
with:
<a href="$location">$keyword</a> (this I can do with no problem)

but I am going to be looping through a series of keywords (phrases)
sorted by length (longest to shortest) that may or may not contain
other keywords such as the values above for $keyword2 which would
cause nested links and other nonsense.

so what I'm needing is a regular expression that will find the
$keyword (phrase) that is not already between "<a href =" and "</a>"
so that it will not try to relink it.

so far, this is the regular expression that I have, but does not work
properly:

[^(^\<a href=)][^(\>)]($keyword)[^(a\>)$]

If there is a better way of doing this, I would appreciate any
insights.
You probably need a two step process. First replace the key strings

(longest to shortest) with a unique tag like "!!!$recordid!!!" then replace the tags with the "<a href...</a>". If you have a large database and only want to
make one pass through it, you could use a hashed string as the tag, then
search for the tags and un-hash the strings.


I've used Tom's method myself in several situations and it seems to work
really well. However, your regexp should work if you just check for word
boundaries on either side of your keyword:

$content_string = preg_replace('/\b'.preg_quote($keyword).'\b/i', '<a
href="'.$url.'">'.$keyword.'</a>', $content_string);

Now, if any non-whitespace character exists on either side of your keyword,
the match will fail. This means that items that look like this: <a
href="url">Something Company</a> won't match the keyword "Something" since
there's an angle bracket before the "S."

Going back to Tom's suggestion, I prefer this method for a couple reasons.
On some CMSs, I do a lot of content filtering (checking for web links and
email addresses, allowing/denying HTML, replacing BB Code style markup, XML
data conversion, etc). For this reason, I pull some tricks using content
"highlighting" to get things like your example to work effectively.
However, I'm sure you'll agree that making multiple passes on your content
can be a waste of time and slow things down for your visitors. How do you
solve it? Well, I used to think that storing the filtered content was the
way to go. So, you filter your content coming form an "administrative"
user, store it in the DB, and serve the pages with little overhead. It
doesn't take long until you eventually have to edit this filtered content.
Then, that scheme goes out the window (unless all of your authors are HTML
gurus--which none of mine are). Here's my solution that has worked like a
charm:

Create two tables for your data: a "front-end" table and a "back-end" table.
On the front-end table store all the prefiltered content. On the back-end
table store the original content entered by author. Assuming your filters
never change, the front-end content will always be updated the same way, so
the back-end is essentially just an "editable" version of what's being
served to your visitors. The biggest drawback is that as your references
database grows, the front-end content stays the same. To combat this, I've
also written an update utility for the CMSs that employ this technique. It
allows an administrator to just run a complete "update" of the front-end
database (run all the back-end info through the filters again) whenever
he/she deems necessary. (If your content DB is kind of small (<1000
records), this utility can be piggy-backed on any updates to the references
DB.) I know this disobeys some of the geat rules of RDB theory, but I think
a faster page load is sometimes more important than following all the rules.

Now we have both problems solved: you can have extensive filters without
making page generation run longer _and_ you can still edit the stuff when it
comes time to make changes.

HTH,
Zac
Jul 16 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Steve | last post by:
Hello, I am writing a script that calls a URL and reads the resulting HTML into a function that strips out everthing and returns ONLY the links, this is so that I can build a link index of various...
2
by: Jack Smith | last post by:
Can someone help me out with this problem. Any help is appreciated. Thanks. Let doubleswap(x) be the string formed by replacing each a in x by the substring bb and each b by the substring aa....
2
by: Jack Smith | last post by:
I posted this question earlier, but I got no responses. Can anyone help me out here...any hints or even how to start? Thanks in advance. Let doubleswap(x) be the string formed by replacing each...
4
by: pekka niiranen | last post by:
Hi there, I have perl script that uses dynamically constructed regular in this way: ------perl code starts ---- $result ""; $key = AAA\?01; $key = quotemeta $key; $line = " ...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
5
by: tmeister | last post by:
I am in need of a regular expression that tests and fails if there are 14 or more of a character in the test string. There can be up to 13 of these characters in the string and any other...
4
by: henrik | last post by:
Hi I have a regex question. I want to find all content of a <td class="someclass"> tag. This means the expression should include all other tags included between <td class="someclass"> and </td>....
1
by: Rahul | last post by:
Hi Everybody I have some problem in my script. please help me. This is script file. I have one *.inq file. I want run this script in XML files. But this script errors shows . If u want i am...
3
by: Willing 2 Learn | last post by:
Hey, I'm trying to teach myself C++ and I came across 3 problems. I understand the concept of FSA but getting the C++ code to do it as become an issue. Only thing is im clueless as to how to do...
1
by: BHPexpert | last post by:
Regular Expression help needed -------------------------------------------------------------------------------- I want to extract all text that is contained inside the brackets after the word...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.