472,143 Members | 1,346 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,143 software developers and data experts.

Searching/replacing multiple words in input.

I need to search and replace multiple words in one pass of an input
stream or string. For example, given the input:

"The quick brown fox jumped over the lazy dog's back"

and given the replacements

quick -> slow
jump -> walk

I'd like to get the string

"The slow brown fox walked over the lazy dog's back."

I've done some Google searching and learned that the Aho-Corasick and
Commentz-Walter algorithms do this, but I can't find any Java
implementations on the net. These algorithms look time-consuming to
implement and test, and I don't want to go ahead and implement them if
I could just download an already working and debugged implementation.

Thank you.

---Chris

=========================
Good Fig - News for Christians
http://www.goodfig.org
Jul 17 '05 #1
3 13497
Using JUnit to test (and why wouldn't you?):

import junit.framework.TestCase;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexTest extends TestCase
{
public RegexTest(String s)
{
super(s);
}

public void testRegex()
{
String s = "The quick brown fox jumped over the lazy dog's back";
String check = "The slow brown fox walked over the lazy dog's back";

Pattern p = Pattern.compile("quick");
Matcher m = p.matcher(s);
s = m.replaceFirst("slow");

p = Pattern.compile("jump");
m = p.matcher(s);
s = m.replaceFirst("walk");

assertEquals(check, s);
}
}

/qb

Christopher R. Barry wrote:
I need to search and replace multiple words in one pass of an input
stream or string. For example, given the input:

"The quick brown fox jumped over the lazy dog's back"

and given the replacements

quick -> slow
jump -> walk

I'd like to get the string

"The slow brown fox walked over the lazy dog's back."

I've done some Google searching and learned that the Aho-Corasick and
Commentz-Walter algorithms do this, but I can't find any Java
implementations on the net. These algorithms look time-consuming to
implement and test, and I don't want to go ahead and implement them if
I could just download an already working and debugged implementation.

Thank you.

---Chris

=========================
Good Fig - News for Christians
http://www.goodfig.org


Jul 17 '05 #2
John Kordyback <jk********@hotmail.com> wrote in message news:<KHJhb.65806$pl3.6728@pd7tw3no>...
[...]

Your example scans the input once for each word to replace though.
However, once I realized that java.util.regex.Pattern supports
patterns like "foo|bar" to mean match "foo" or match "bar", I realized
a somewhat inefficient solution to my problem would be to build a
string containing all the words I want to replace delimited by "|" and
use that as the match pattern.

Here's some example code that's rather wasteful with memory but works.
This will be adequate for my purposes in the short term but eventually
I'll want an efficient version that works on streams. You construct a
MultipleSearch object by passing it a hashtable containing as keys the
words to replace and as values the replacements, and then you call the
resulting MultipleSearch object's replaceAll method on the input
string. I've included a main method with an example. The
MultipleSearch class doesn't bother to copy the Hashtable, so the
hashtable shouldn't be modified after a MultipleSearch is constructed
from it unless you pass a copy to the MultipleSearch
constructor. Since Hashtable is synchronized MultipleSearch should be
thread-safe. This code is placed in the public domain.

---Chris

import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.Hashtable;
import java.util.Enumeration;
import java.lang.StringBuffer;

public class MultipleSearch
{
private Pattern wordsToReplacePattern;
private Hashtable replacementsTable;

public MultipleSearch(Hashtable ht) {

StringBuffer regexbuf = new StringBuffer(ht.size()*7);
Enumeration wordsToReplace = ht.keys();
if (wordsToReplace.hasMoreElements()) {
regexbuf.append(wordsToReplace.nextElement());
}
while (wordsToReplace.hasMoreElements()) {
regexbuf.append("|" + wordsToReplace.nextElement());
}
wordsToReplacePattern = Pattern.compile(regexbuf.toString());
replacementsTable = ht;
}

public String replaceAll(String input) {
StringBuffer sb = new StringBuffer(input.length());
Matcher m = wordsToReplacePattern.matcher(input);

int end = 0;
while (m.find()) {
String wordToReplace = m.group();
String replacement =
(String)replacementsTable.get(wordToReplace);
m.appendReplacement(sb, replacement);
end = m.end();
}
sb.append(input.substring(end));

return sb.toString();
}

public static void main(String args[]) {
String s =
"The quick brown fox jumped over the lazy dog's back.";

Hashtable ht = new Hashtable(10);
ht.put("quick", "slow");
ht.put("jump", "walk");
ht.put("lazy", "hard working");
ht.put("brown", "red");

MultipleSearch ms = new MultipleSearch(ht);
System.out.println(ms.replaceAll(s));
}
}
Jul 17 '05 #3
Oh absolutely. My example wasn't meant to be really useful or efficient.
IntelliJ almost screamed "Extract the method you moron!" at me as I wrote it.

Good luck...qb

Christopher R. Barry wrote:
John Kordyback <jk********@hotmail.com> wrote in message news:<KHJhb.65806$pl3.6728@pd7tw3no>...
[...]

Your example scans the input once for each word to replace though.
However, once I realized that java.util.regex.Pattern supports
patterns like "foo|bar" to mean match "foo" or match "bar", I realized
a somewhat inefficient solution to my problem would be to build a
string containing all the words I want to replace delimited by "|" and
use that as the match pattern.

Here's some example code that's rather wasteful with memory but works.
This will be adequate for my purposes in the short term but eventually
I'll want an efficient version that works on streams. You construct a
MultipleSearch object by passing it a hashtable containing as keys the
words to replace and as values the replacements, and then you call the
resulting MultipleSearch object's replaceAll method on the input
string. I've included a main method with an example. The
MultipleSearch class doesn't bother to copy the Hashtable, so the
hashtable shouldn't be modified after a MultipleSearch is constructed
from it unless you pass a copy to the MultipleSearch
constructor. Since Hashtable is synchronized MultipleSearch should be
thread-safe. This code is placed in the public domain.

---Chris

import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.Hashtable;
import java.util.Enumeration;
import java.lang.StringBuffer;

public class MultipleSearch
{
private Pattern wordsToReplacePattern;
private Hashtable replacementsTable;

public MultipleSearch(Hashtable ht) {

StringBuffer regexbuf = new StringBuffer(ht.size()*7);
Enumeration wordsToReplace = ht.keys();
if (wordsToReplace.hasMoreElements()) {
regexbuf.append(wordsToReplace.nextElement());
}
while (wordsToReplace.hasMoreElements()) {
regexbuf.append("|" + wordsToReplace.nextElement());
}
wordsToReplacePattern = Pattern.compile(regexbuf.toString());
replacementsTable = ht;
}

public String replaceAll(String input) {
StringBuffer sb = new StringBuffer(input.length());
Matcher m = wordsToReplacePattern.matcher(input);

int end = 0;
while (m.find()) {
String wordToReplace = m.group();
String replacement =
(String)replacementsTable.get(wordToReplace);
m.appendReplacement(sb, replacement);
end = m.end();
}
sb.append(input.substring(end));

return sb.toString();
}

public static void main(String args[]) {
String s =
"The quick brown fox jumped over the lazy dog's back.";

Hashtable ht = new Hashtable(10);
ht.put("quick", "slow");
ht.put("jump", "walk");
ht.put("lazy", "hard working");
ht.put("brown", "red");

MultipleSearch ms = new MultipleSearch(ht);
System.out.println(ms.replaceAll(s));
}
}


Jul 17 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by dpg | last post: by
1 post views Thread by Robert Oschler | last post: by
12 posts views Thread by Adam J. Schaff | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.