By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,361 Members | 3,185 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

How to build long Regular Expression

P: 1
Usually when you make regular expression to extract text you are starting from simple expression. When you got to know target text, you are extending your expression. Subsequently very hard to ready long set of special symbols and impossible to improve such expression.

We have to create ’smart’ regular expression. Instead of write one line expression we prepare multi line text from which we shall generate our long expression. Here is a simple example.

Expand|Select|Wrap|Line Numbers
  1. space                    [\s/-]+
  2. word                     \w+
  3. words                    (?:{word}{space})*?{word}
  4. birthday                 (?<birthday>\d+\.d+\.d+)
  5. title                    {word}\.
  6. name                     {words}
  7. person                   {title}{space}{name}{space}{birthday}
  8.  
This text consist of two columns separated by spaces. First column is pattern name and second column is easy to read regular expression. The resulting regular expression for pattern ‘person’ will be:
Expand|Select|Wrap|Line Numbers
  1. \w+\.[\s/-]+(?:\w+[\s/-]+)*?\w+[\s/-]+(?<birthday>\d+.\d+.\d+)
  2.  
You can do it using following class
Expand|Select|Wrap|Line Numbers
  1. public class Lexer
  2.     {
  3.         private NameValueCollection col;
  4.         public Lexer()
  5.         {
  6.             col = new NameValueCollection();
  7.         }
  8.  
  9.         public static Lexer Create(string resource)
  10.         {
  11.             StringReader sr = new StringReader(resource);
  12.             Lexer lex =new Lexer();
  13.             while (sr.Peek()>=0)
  14.             {
  15.                 string line = sr.ReadLine();
  16.                 Match m = Regex.Match(line,@"([\w_]+)\s+(.*)");
  17.                 if (m.Success) 
  18.                 {
  19.                     lex.col.Add(m.Groups[1].Value.Trim(), m.Groups[2].Value.Trim());
  20.                 }
  21.             }
  22.             sr.Close();
  23.  
  24.             return lex;
  25.         }
  26.  
  27.  
  28.         public string GetExpression(string name)
  29.         {
  30.             if (name == null || name.Length == 0) return string.Empty;
  31.             string res = col[name];
  32.             if (res == null) throw new ArgumentException("Template not found", name);
  33.  
  34.             bool needGroup = res.IndexOf('|') > 0;
  35.             Regex reg = new Regex(@"(?<!\\p){([a-zA-Z][\w_]+)}");
  36.             Match m = reg.Match(res);
  37.             while (m.Success)
  38.             {
  39.                 string token = m.Groups[1].Value;
  40.                 string exp = GetExpression(token); 
  41.                 if (exp != null && exp.Length>0)
  42.                     res = res.Replace(@"{"+token+"}",exp);
  43.                 m = m.NextMatch();
  44.             }
  45.             string result = res;
  46.             if (needGroup)
  47.             {
  48.                 result = "(?:" + res + ")";
  49.             }
  50.             result = "(?#" + name + ")" + result;
  51.  
  52.             return result;
  53.         }
  54.  
  55.     }
  56.  

Then we can create class instance and get regular expression
Expand|Select|Wrap|Line Numbers
  1. Lexer lex = Lexer.Create(txtLexerText.Text);
  2. string expr = lex.GetExpression("person");
  3. Regex reg = new Regex(expr);
  4.  
Oct 29 '07 #1
Share this Article
Share on Google+