473,569 Members | 2,480 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regex needed for splitting on commas (not inside quotes)

7 New Member
I'm having quite a time with this particular problem:

I have users that enter tag words as form input, let's say for a photo or a topic of discussion. They are allowed to delimit tags with spaces and commas, and can use quotes to encapsulate multiple words.

An example:

tag1, tag2 tag3, "tag4 tag4, tag4" tag5, "tag6 tag6"

So, as we can see here anything is allowed, but the problem is that splitting on commas obviously destroys tag4 (the tag inside quotes but with a comma).

I've tried tons of regex, using preg_split, preg_match, and more but cannot figure out a solution. If this cannot be done, then it's also completely okay to tell the user they are not allowed to use commas inside of quoted strings via error message - which I tried doing using preg_match but my regex fails.

My regex for preg_match to alert the user of the problem:

/".*,.*"/

This works fine, but if you put any other quotes after the first quoted string (referring to "tag6 tag6") then it fails when it should not, since there are no commas inside tag6.
May 28 '07 #1
9 8821
bergy
89 New Member
I'm no regex guru by any means, but here is something to try in case you haven't thought of thise already. Change your example to this:

tag1, tag2 tag3, \"tag4 tag4, tag4\" tag5, \"tag6 tag6\"

And see if that works in your expression. If so just do a addslashes() function on the tag string before you pass it to your regex.

I'm having quite a time with this particular problem:

I have users that enter tag words as form input, let's say for a photo or a topic of discussion. They are allowed to delimit tags with spaces and commas, and can use quotes to encapsulate multiple words.

An example:

tag1, tag2 tag3, "tag4 tag4, tag4" tag5, "tag6 tag6"

So, as we can see here anything is allowed, but the problem is that splitting on commas obviously destroys tag4 (the tag inside quotes but with a comma).

I've tried tons of regex, using preg_split, preg_match, and more but cannot figure out a solution. If this cannot be done, then it's also completely okay to tell the user they are not allowed to use commas inside of quoted strings via error message - which I tried doing using preg_match but my regex fails.

My regex for preg_match to alert the user of the problem:

/".*,.*"/

This works fine, but if you put any other quotes after the first quoted string (referring to "tag6 tag6") then it fails when it should not, since there are no commas inside tag6.
May 28 '07 #2
conspireagainst
7 New Member
That doesn't seem to make a difference in my particular case. Thanks for the reply, though! Any other thoughts?

I'm no regex guru by any means, but here is something to try in case you haven't thought of thise already. Change your example to this:

tag1, tag2 tag3, \"tag4 tag4, tag4\" tag5, \"tag6 tag6\"

And see if that works in your expression. If so just do a addslashes() function on the tag string before you pass it to your regex.
May 28 '07 #3
bergy
89 New Member
That doesn't seem to make a difference in my particular case. Thanks for the reply, though! Any other thoughts?
Well this is really dirty, but... what if you do a simple str_replace on quotes in the string (before regex) something like replacing the " with ~ or even ___QUOTE___, do your regex and then re-replace the ~ or ___QUOTE___ with " when you're done. Definitely not the best or fastest solution, but it should work.
May 28 '07 #4
pbmods
5,821 Recognized Expert Expert
Here's what I came up with:

Expand|Select|Wrap|Line Numbers
  1. $str = 'tag1, tag2 tag3, "tag4 tag4, tag4" tag5, "tag6 tag6"';
  2. print("$str<br /><br />\n\n");
  3.  
  4. $matches = preg_split('/(?<!\\\\)"/', $str);
  5.  
  6. $tags = array();
  7.  
  8. foreach($matches as $match) {
  9.     $innerTags = preg_split('/(?<!\\\\),/', $match);
  10.     foreach($innerTags as $tag)
  11.         if(preg_match('/\w/', $tag))
  12.             $tags[] = trim($tag);
  13. }
  14.  
  15. var_dump($tags);
  16.  
First, we split the string by '"', unless it's escaped (preceded by '\'). Now we know where the '"'s are.

Next, we run through each set of tags in between '"'s and split those by unescaped ',' (similarly to '\"', '\,' won't get split). Now we know where the individual tags are; all we have to do now is clean up.

We run through each potential tag and check to see if it has at least one alphanumeric character. If it does, we add it to $tags, sans leading/trailing whitespace.

This might not be the best way to do it; I whipped it up in about 10 minutes. But it works.

[EDIT: If you have magic_quotes turned on, you'll need to run stripslashes on your input first.]
May 28 '07 #5
conspireagainst
7 New Member
I have finally found a way to do what is needed, based on what everyone has said so far. It is complicated, that much is true, but it works. The last solution posted did not work still, so I am forgetting Regex for now and simply using PHP's string functions.

Test data: (string)$tag_li st = tag1, tag2, "tag3 tag3, tag3" tag4, "tag5, tag5", tag6 tag7

[PHP]
// ensure there is at least one beginning and one ending quote
if(strpos($tag_ list, '"') !== false && strrpos($tag_li st, '"', 1) !== false) {
$quotes_found = true; // initialize the while loop
// find the position of the last quote
$last_quote_pos = strrpos($tag_li st, '"', 1);
// initialize so that the first time we try and find the beginning quote we start from position 0
$end_quote_pos = 0;
// we go through the whole string until we reach the last quote
while($quotes_f ound) {
// find the position of the beginning quote for this tag string
$begin_quote_po s = strpos($tag_lis t, '"', ($end_quote_pos + 1));
// find the position of the end quote for this tag string
$end_quote_pos = strpos($tag_lis t, '"', ($begin_quote_p os + 1));
// have we reached the last quote?
if($begin_quote _pos == $last_quote_pos ||
$end_quote_pos == $last_quote_pos ) {
$quotes_found = false; // set the while loop to stop after this
}
// split out the quoted tag string
$original_tag = substr($tag_lis t, $begin_quote_po s, ($end_quote_pos - $begin_quote_po s + 1));
// remove commas from the inside of the quoted string (this is the whole point of this entire if/while combo)
$fixed_tag = strtr($original _tag, ',', ' ');
// replace the quoted string tags with the fixed ones (sans commas)
$tag_list = strtr($tag_list , array($original _tag=>$fixed_ta g));
} // end while going through the string with quotes
} // end making sure there is at least one beginning and end quote
[/PHP]

End result: (string)$tag_li st = tag1, tag2, "tag3 tag3 tag3" tag4, "tag5 tag5", tag6 tag7

I can now do what was originally sought out, which is to use explode() on $tag_list commas and/or spaces, giving me what I originally needed - which is a list of the tags in an array!
May 28 '07 #6
pbmods
5,821 Recognized Expert Expert
Heh. Oops. My code doesn't check to see if the comma is inside of quotes (which was the whole point).

Well at any rate, I'm glad you found a solution for your problem.
May 28 '07 #7
pbmods
5,821 Recognized Expert Expert
I couldn't deal with the fact that I couldn't figure this out, so I gave it one more go, and I think I've got it this time:

Expand|Select|Wrap|Line Numbers
  1. <?php
  2.     $str = 'tag1, tag2 tag3, "tag4 tag4, tag4" tag5, "tag6 tag6"';
  3.     print("$str<br /><br />\n\n");
  4.  
  5.     $matches = preg_split('/(?<!\\\\)"/', $str);
  6.  
  7.     $tags = array();
  8.  
  9.     foreach($matches as $idx => $match) {
  10.         if($idx % 2)
  11.             $tags[] = stripslashes(trim($match));
  12.         else {
  13.             $innerTags = preg_split('/(?<!\\\\),/', $match);
  14.             foreach($innerTags as $tag)
  15.                 if(preg_match('/\w/', $tag))
  16.                     $tags[] = stripslashes(trim($tag));
  17.         }
  18.     }
  19.  
  20.     var_dump($tags);
  21. ?>
  22.  
The only difference is that we directly copy odd-index matches instead of parsing them for commas.

Think about it. When you split by quote, matches 0, 2, 4, etc. are outside of the quotes, but matches 1, 3, 5 are inside. So matches 1, 3 and 5 should not be parsed, but matches 0, 2 and 4 should. If you don't believe me, var_dump $matches.

Also, I added stripslashes so that if you escape a quote or comma, it doesn't print the slash.

Here's what I get for output:
Expand|Select|Wrap|Line Numbers
  1. tag1, tag2 tag3, "tag4 tag4, tag4" tag5, "tag6 tag6"
  2.  
  3. array(5) {
  4.   [0]=>
  5.   string(4) "tag1"
  6.   [1]=>
  7.   string(9) "tag2 tag3"
  8.   [2]=>
  9.   string(15) "tag4 tag4, tag4"
  10.   [3]=>
  11.   string(4) "tag5"
  12.   [4]=>
  13.   string(9) "tag6 tag6"
  14. }
  15.  
[EDIT: If you want to split tags by space in addition to comma, use this regex instead:

Expand|Select|Wrap|Line Numbers
  1. $innerTags = preg_split('/(?<!\\\\)((,\s*)|((?<!,)\s+))/', $match);
This splits by either comma (with optional following whitespace character[s]) or else just a whitespace character[s] (with a negative lookbehind so that we don't end up catching '\,' properly only to then split at the following whitespace character).

This will result in the following output:

Expand|Select|Wrap|Line Numbers
  1. tag1, tag2 tag3, "tag4 tag4, tag4" tag5, "tag6 tag6"
  2.  
  3. array(6) {
  4.   [0]=>
  5.   string(4) "tag1"
  6.   [1]=>
  7.   string(4) "tag2"
  8.   [2]=>
  9.   string(4) "tag3"
  10.   [3]=>
  11.   string(15) "tag4 tag4, tag4"
  12.   [4]=>
  13.   string(4) "tag5"
  14.   [5]=>
  15.   string(9) "tag6 tag6"
  16. }
  17.  
Or try this on for size:

Expand|Select|Wrap|Line Numbers
  1. tag1, tag2\ tag3, "tag4 tag4, tag4" tag5\, \"tag6 tag6\"
  2.  
  3. array(5) {
  4.   [0]=>
  5.   string(4) "tag1"
  6.   [1]=>
  7.   string(9) "tag2 tag3"
  8.   [2]=>
  9.   string(15) "tag4 tag4, tag4"
  10.   [3]=>
  11.   string(11) "tag5, "tag6"
  12.   [4]=>
  13.   string(5) "tag6""
  14. }
  15.  
]

[EDIT EDIT: The one big flaw in these regexes is that they don't check for escaped slashes (e.g., \\"). But you know what? My brain hurts. You go figure it out. o_O]

[EDIT EDIT EDIT: Nevermind. I got that working, too:

Expand|Select|Wrap|Line Numbers
  1. <?php
  2.     $str = 'tag1, tag2\ tag3, "tag4 tag4, tag4" tag5\\\\, "tag6 tag6\"';
  3.     print("$str<br /><br />\n\n");
  4.  
  5.     $matches = preg_split('/((?<!\\\\)|(?<=\\\\\\\\))"/', $str);
  6.  
  7.     $tags = array();
  8.  
  9.     foreach($matches as $idx => $match) {
  10.         if($idx % 2)
  11.             $tags[] = stripslashes(trim($match));
  12.         else {
  13.             $innerTags = preg_split('/((?<!\\\\)|(?<=\\\\\\\\))((,\s*)|((?<!,)\s))/', $match);
  14.             foreach($innerTags as $tag)
  15.                 if(preg_match('/\w/', $tag))
  16.                     $tags[] = stripslashes(trim($tag));
  17.         }
  18.     }
  19.  
  20.     var_dump($tags);
  21. ?>
  22.  
The major difference is that we will split if the quote or comma and/or/xor space is either not preceded by a slash or is preceded by two slashes.

Sample output (note that the comma after tag5 gets parsed properly):
Expand|Select|Wrap|Line Numbers
  1. tag1, tag2\ tag3, "tag4 tag4, tag4" tag5\\, "tag6 tag6\"
  2.  
  3. array(5) {
  4.   [0]=>
  5.   string(4) "tag1"
  6.   [1]=>
  7.   string(9) "tag2 tag3"
  8.   [2]=>
  9.   string(15) "tag4 tag4, tag4"
  10.   [3]=>
  11.   string(5) "tag5\"
  12.   [4]=>
  13.   string(10) "tag6 tag6""
  14. }
  15.  
This is why pbmods should not be stuck in his apartment on a national holiday.]
May 28 '07 #8
conspireagainst
7 New Member
This is why pbmods should not be stuck in his apartment on a national holiday.
Wow, that's a lot of work you did! That code is probably better than what I have and accounts for things that I hope users won't do but I'm sure they will.

Thanks!
May 28 '07 #9
pbmods
5,821 Recognized Expert Expert
Thanks!
You are welcome.

And don't worry about me; my templating engine needed an overhaul anyway ~_^
May 28 '07 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

2
3270
by: afrinspray | last post by:
I'm writing a function that parses a nested list string that might look like this: ( "HELLO WORLD!" 1231231 awesome ( 1 2 ) ) I wrote the logic already and it starts by splitting the string by the space character (or tab or newline). ex... $tklist = preg_split("/\s|\n|\t/", $str); This works except in the string case, where "HELLO...
5
3737
by: Anders Dalvander | last post by:
os.popen does not work with parameters inside quotes, nor do os.popen. At least on Windows. import os cmd = '"c:\\command.exe" "parameter inside quotes"' os.popen4(cmd) Results in the following error message: 'c:\\command.exe" "parameter inside quotes' is not recognized as an
4
3672
by: beliavsky | last post by:
The code for text in open("file.txt","r"): print text.replace("foo","bar") replaces 'foo' with 'bar' in a file, but how do I avoid changing text inside single or double quotes? For making changes to Python code, I would also like to avoid changing text in comments, either the '#' or '""" ... """' kind.
2
17937
by: GIMME | last post by:
I can't figure an expression needed to parse a string. This problem arrises from parsing Excel csv files ... The expression must parse a string based upon comma delimiters, but if a comma appears in double quotes it should not be used for parsing. For example in the simple case we'd have : $a='hello,brave,world';
2
1505
by: John Perks and Sarah Mount | last post by:
I have to split some identifiers that are casedLikeThis into their component words. In this instance I can safely use to represent uppercase, but what pattern should I use if I wanted it to work more generally? I can envisage walking the string testing the unicodedata.category of each char, but is there a regex'y way to denote "uppercase"? ...
0
977
by: Søren M. Olesen | last post by:
Hi When using RegEx to replace a text within a string, how do I prevent it from replacing the text if it's inside a tag...For example if I want to change 'small' to 'large' in the following text: "This is <protect>a small test</protect> for a small application" but only it it's NOT inside the 'protect' tags ??
4
8547
by: VMI | last post by:
How can I split a string that looks like this: John, Doe, 37282, box 2, 10001, "My description, very important", X, Home If I use String.Split(), it'll split the string that's between the double-quotes, and I don't want that. How can I use the String.Split to split the string EXCEPT the string that has double-quotes? In my case, the...
4
12726
by: Michael Yanowitz | last post by:
Hello: If I have a long string (such as a Python file). I search for a sub-string in that string and find it. Is there a way to determine if that found sub-string is inside single-quotes or double-quotes or not inside any quotes? If so how? Thanks in advance: Michael Yanowitz
0
7701
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7615
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8130
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
7979
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5219
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3653
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3643
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2115
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1223
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.