By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,636 Members | 1,945 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,636 IT Pros & Developers. It's quick & easy.

parsing a comma-separatinf list

P: n/a
AK
Once upon a time there was a table:

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE CHAR(2))@

in a while the developers realized that a vendor may be present in
xseveral states, so the structure was changed to

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE_LIST
VARCHAR(150))@

STATE)LIST could contain a list of comma-separated values like this:
CA,WA,OR
Not the best approach, but out here in the field things like this
happen all the time. Everybody were absolutely happy until somebody
required a query like this:
SELECT * FROM VENDOR WHERE STATE IN ('IL', 'WI','MI')
SELECT COUNT(*), STATE FROM VENDOR GROUP BY STATE
and so on
Given a dayabase structure developed by a skilled professional, that
would be a piece of cake, but not this time.

Moving to a better structure

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50))@
CREATE TABLE VENDOR_IN_STATE(VENDOR_ID INT, STATE CHAR(2))@

nobody wanted to change the front end application.
As a workaround, I created a view and INSTEAD OF triggers that worked
like this:

INSERT INTO VENDOR_IN_STATE_VIEW VALUES(1, 'ABC INC.', 'AK,AL,IL')@

SELECT * FROM VENDOR
VENDOR_ID NAME
----------- --------------------------------------------------
1 ABC INC.
1 record(s) selected.

SELECT * FROM VENDOR_IN_STATE
VENDOR_ID STATE
----------- -----
1 AK
1 AL
1 IL
3 record(s) selected.

I used recursion to define both the view and the triggers:

CREATE FUNCTION PARSE_LIST(C_LIST VARCHAR(100))
RETURNS TABLE(TOKEN VARCHAR(100))
SPECIFIC PARSE_LIST
RETURN
WITH PARSED_LIST(STEP, TOKEN, REMAINDER,LEFTMOST_COMMA)
AS(
VALUES(0, '',C_LIST,3)
UNION ALL
SELECT STEP+1 AS STEP,
CASE WHEN LEFTMOST_COMMA>0 THEN
CAST(LEFT(REMAINDER,LEFTMOST_COMMA-1) AS CHAR(2))
ELSE
REMAINDER
END AS TOKEN,
CASE WHEN LEFTMOST_COMMA>0 THEN
CAST(SUBSTR(REMAINDER,LEFTMOST_COMMA+1) AS VARCHAR(100))
ELSE
NULL
END AS REMAINDER,
LOCATE(',',SUBSTR(REMAINDER,LOCATE(',',REMAINDER)+ 1)) AS
LEFTMOST_COMMA
FROM PARSED_LIST
WHERE REMAINDER IS NOT NULL
)
SELECT TOKEN FROM PARSED_LIST WHERE STEP>0

I wrapped the recursive query in UDF so that it could be reused:

It is very easy to invoke this table UDF:

SELECT * FROM TABLE(PARSE_LIST('AK,AR,IL,OH')) AS PARSE_LIST
TOKEN
------
AK
AR
IL

OH

I user the UDF in an INSTEAD OF trigger

CREATE TRIGGER VENDOR_IN_STATE_I
INSTEAD OF INSERT
ON VENDOR_IN_STATE_VIEW
REFERENCING NEW AS N
FOR EACH ROW MODE DB2SQL
BEGIN ATOMIC
DECLARE VENDOR_FOUND INT;
SET VENDOR_FOUND=(SELECT COUNT(*) FROM VENDOR WHERE NAME=N.NAME);
IF NOT(VENDOR_FOUND>0)
THEN
INSERT INTO VENDOR(VENDOR_ID, NAME) VALUES (N.VENDOR_ID, N.NAME);
END IF;
INSERT INTO VENDOR_IN_STATE(VENDOR_ID, STATE)
SELECT N.VENDOR_ID, PARSE_LIST.TOKEN FROM
TABLE(PARSE_LIST(N.STATE_LIST)) AS PARSE_LIST;
END @

I would really appreciate any feedback.
Are there any simpler approaches?
Nov 12 '05 #1
Share this Question
Share on Google+
15 Replies


P: n/a
AK <ak************@yahoo.com> wrote:
Once upon a time there was a table:

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE CHAR(2))@

in a while the developers realized that a vendor may be present in
xseveral states, so the structure was changed to [...] I would really appreciate any feedback.
Are there any simpler approaches?


That looks pretty much straight forward to me and similar to what I wrote
once:
http://www-106.ibm.com/developerwork...03stolze1.html

The only alternative that might be a bit faster would be to use an external
table function, written in C or Java. That way, you could avoid the
recursion in SQL. I really don't know if performance would improve or not,
however.

--
Knut Stolze
Information Integration
IBM Germany / University of Jena
Nov 12 '05 #2

P: n/a
why not keep the table as is (i.e. with comma separated states) and
create a view which will have the names of the states in different
rows. should be easier as u wont need a instead of trigger.

view can be created like

select vendor_id,substr(state,locate(',',state,a * 3)+1,2) from
vendor,table(values(1),(2),(3),(4),(5)) as vendor_ids(a) where a <=
length(state)/3

Knut Stolze <st****@de.ibm.com> wrote in message news:<bs**********@fsuj29.rz.uni-jena.de>...
AK <ak************@yahoo.com> wrote:
Once upon a time there was a table:

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE CHAR(2))@

in a while the developers realized that a vendor may be present in
xseveral states, so the structure was changed to

[...]
I would really appreciate any feedback.
Are there any simpler approaches?


That looks pretty much straight forward to me and similar to what I wrote
once:
http://www-106.ibm.com/developerwork...03stolze1.html

The only alternative that might be a bit faster would be to use an external
table function, written in C or Java. That way, you could avoid the
recursion in SQL. I really don't know if performance would improve or not,
however.

Nov 12 '05 #3

P: n/a
subaga <su******@yahoo.com> wrote:
why not keep the table as is (i.e. with comma separated states) and
create a view which will have the names of the states in different
rows. should be easier as u wont need a instead of trigger.

view can be created like

select vendor_id,substr(state,locate(',',state,a * 3)+1,2) from
vendor,table(values(1),(2),(3),(4),(5)) as vendor_ids(a) where a <=
length(state)/3


That only works if:
(1) the states are always exactly 2 characters long - if they are not, then
your calculation doesn't work.
(2) you have at most 5 states - ok, you could open this up by adding more
rows to "vendor_ids"; depending on the actual data, you might need several
1000 rows (VARCHARs can be up to 32K long!), and that might not be simpler
then.

--
Knut Stolze
Information Integration
IBM Germany / University of Jena
Nov 12 '05 #4

P: n/a
AK
Knut,

thanks for answering! I loved your article!
Nov 12 '05 #5

P: n/a
AK
su******@yahoo.com (subaga) wrote in message news:<f6************************@posting.google.co m>...
why not keep the table as is (i.e. with comma separated states) and
create a view which will have the names of the states in different
rows. should be easier as u wont need a instead of trigger.

view can be created like

select vendor_id,substr(state,locate(',',state,a * 3)+1,2) from
vendor,table(values(1),(2),(3),(4),(5)) as vendor_ids(a) where a <=
length(state)/3


In addition to what Knut has already writtenm,

if there is no table for states, htere are no indexes and statistics
on it, all the queries involving states run much slower. Also there
are quite a few books on the advantages of normalization, such as
written by Joe Celko
Nov 12 '05 #6

P: n/a
>> in a while the developers realized that a vendor may be present in
several states, so the structure was changed to :

CREATE TABLE Vendors
(vendor_id INTEGER NOT NULL PRIMARY KEY,
vendor_name VARCHAR(35) NOT NULL,
state_list VARCHAR(150)NOT NULL);

STATE_LIST could contain a list of comma-separated values <<

Hopefully, someone took the idiot out to a wall, read him the part in
the most basic RDBMS book they could find about First Normal Form and
shot him. I like stories with happy endings.

Passing a list of parameters to a stored procedure can be done by
putting them into a string with a separator. I like to use the
traditional comma. Let's assume that you have a whole table full of
such parameter lists:

CREATE TABLE InputStrings
(keycol CHAR(10) NOT NULL PRIMARY KEY,
input_string VARCHAR(255) NOT NULL);

INSERT INTO InputStrings VALUES ('first', '12,34,567,896');
INSERT INTO InputStrings VALUES ('second', '312,534,997,896');
...

It makes life easier if the lists in the input strings start and end
with a comma. You will also need a table called Sequence, which is a
set of integers from 1 to (n).

CREATE VIEW Breakdown (keycol, parm_nbr, parameter)
AS
SELECT keycol, COUNT(S2.seq),
CAST (SUBSTRING (',' + I1.input_string + ',', MAX(S1.seq + 1),
(S2.seq - MAX(S1.seq + 1)))
AS INTEGER)
FROM InputStrings AS I1, Sequence AS S1, Sequence AS S2
WHERE SUBSTRING (',' + I1.input_string + ',' , S1.seq, 1) = ','
AND SUBSTRING (',' + I1.input_string + ',' , S2.seq, 1) = ','
AND S1.seq < S2.seq
AND S2.seq <= LEN (I1.input_string) + 2
GROUP BY I1.keycol, I1.input_string, S2.seq;

The S1 and S2 copies of Sequence are used to locate bracketing pairs
of commas, and the entire set of substrings located between them is
extracts and cast as integers in one non-procedural step. The trick
is to be sure that the left hand comma of the bracketing pair is the
closest one to the second comma. The place column tells you the
relative position of the value in the input string.

YOu can put this into a VIEW until you can fix the schema and perform
the execution.
Nov 12 '05 #7

P: n/a
--CELKO-- wrote:
in a while the developers realized that a vendor may be present in


several states, so the structure was changed to :

CREATE TABLE Vendors
(vendor_id INTEGER NOT NULL PRIMARY KEY,
vendor_name VARCHAR(35) NOT NULL,
state_list VARCHAR(150)NOT NULL);

STATE_LIST could contain a list of comma-separated values <<

Hopefully, someone took the idiot out to a wall, read him the part in
the most basic RDBMS book they could find about First Normal Form and
shot him. I like stories with happy endings.

Passing a list of parameters to a stored procedure can be done by
putting them into a string with a separator. I like to use the
traditional comma. Let's assume that you have a whole table full of
such parameter lists:

CREATE TABLE InputStrings
(keycol CHAR(10) NOT NULL PRIMARY KEY,
input_string VARCHAR(255) NOT NULL);

INSERT INTO InputStrings VALUES ('first', '12,34,567,896');
INSERT INTO InputStrings VALUES ('second', '312,534,997,896');
...

It makes life easier if the lists in the input strings start and end
with a comma. You will also need a table called Sequence, which is a
set of integers from 1 to (n).

CREATE VIEW Breakdown (keycol, parm_nbr, parameter)
AS
SELECT keycol, COUNT(S2.seq),
CAST (SUBSTRING (',' + I1.input_string + ',', MAX(S1.seq + 1),
(S2.seq - MAX(S1.seq + 1)))
AS INTEGER)
FROM InputStrings AS I1, Sequence AS S1, Sequence AS S2
WHERE SUBSTRING (',' + I1.input_string + ',' , S1.seq, 1) = ','
AND SUBSTRING (',' + I1.input_string + ',' , S2.seq, 1) = ','
AND S1.seq < S2.seq
AND S2.seq <= LEN (I1.input_string) + 2
GROUP BY I1.keycol, I1.input_string, S2.seq;

The S1 and S2 copies of Sequence are used to locate bracketing pairs
of commas, and the entire set of substrings located between them is
extracts and cast as integers in one non-procedural step. The trick
is to be sure that the left hand comma of the bracketing pair is the
closest one to the second comma. The place column tells you the
relative position of the value in the input string.

YOu can put this into a VIEW until you can fix the schema and perform
the execution.


With the exception that I prefer vertical bars, or other characters that
couldn't legitimately be in the data stream I agree. I think commas have
too many common uses to be of value as a separator.
--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)

Nov 12 '05 #8

P: n/a
AK
Joe,

I loved the idea to use a sequence table, many thanks!
I would say that there is no need for aggregation either.
Given this table expression:

(SELECT
ROWNUMBER() OVER(ORDER BY S1.SEQ) AS SEPARATOR_NUM,
S1.SEQ AS SEPARATOR_POS
FROM SEQUENCE_TAB S1
WHERE
SUBSTRING (',' + I1.input_string + ',' , S1.seq, 1) = ','
) AS SEPARATORS

then we could self-join it like this:

SELECT
(S1.SEPARATOR_POS + 1) AS TOKEN_FROM
(S2.SEPARATOR_POS - S1.SEPARATOR_POS - 1) AS TOKEN.LENGTH
FROM SEPATATORS S1
JOIN SEPARATORS S2
ON (S1.SEPARATOR_NUM + 1)=S2.SEPARATOR_NUM

What do you think?
Nov 12 '05 #9

P: n/a
>> What do you think? <<

Neat but too proprietary for my taste. Remember, I am the guy who
pushes pure SQL-92 code.
Nov 12 '05 #10

P: n/a
AK
jo*******@northface.edu (--CELKO--) wrote in message news:<a2**************************@posting.google. com>...
What do you think? <<


Neat but too proprietary for my taste. Remember, I am the guy who
pushes pure SQL-92 code.


I would agree. However entry-level folks weren't able to understand
neither my original solution nor yours. This is why an
easy-to-understand one was necessary.

I'd say OLAP functions are very flexible and powerful. Maybe they
deserve getting into the standard?
Nov 12 '05 #11

P: n/a
To the best of my knowledge, ROW_NUMBER() is in the standard. AFAIK it
is alse supported by Oracle. So proprietary: Nope.
is the '+' in the standard for string types? Just curious....

Cheers
Serge

--
Serge Rielau
DB2 SQL Compiler Development
IBM Toronto Lab

Nov 12 '05 #12

P: n/a
> To the best of my knowledge, ROW_NUMBER() is in the standard. AFAIK it
is alse supported by Oracle. So proprietary: Nope.
is the '+' in the standard for string types? Just curious....


Check out:

http://www.jtc1sc32.org/sc32/jtc1sc3...f?OpenDocument

The section on Window functions has all the relational OLAP stuff. I
noticed recursion and common table expressions were also there, all of
which work in Oracle and SQL Server (from Yukon on).

Christian.
Nov 12 '05 #13

P: n/a
AK
I would say that right now my top priority is to leave behind me code
that is easy to understand and maintain. This is where OLAP functions
are very handy. I love them.
I am less, if at all, concerned about portability and ANSI standards.

I think:
'abc'+'cde' will work on MS SQL Server , bot not on DB2 and not on
ORACLE
ROW_NUMBER() will work on DB@ and ORACLE but not on MS SQL Server
Nov 12 '05 #14

P: n/a
>> is the '+' in the standard for string types? <<

No, concatenation is an infixed || token. We copied it from PL/I.
What was funny is that Phil Shaw, the ANSI X3H2 representative from
IBM at the time did not like this symbol because he had worked with
PL/I and hated it.
Nov 12 '05 #15

P: n/a
> I would say that right now my top priority is to leave behind me code
that is easy to understand and maintain. This is where OLAP functions
are very handy. I love them.
I agree. The bonus with OLAP functions is they perform well over very
large result sets. The ANSI 92 equivalent won't.
I am less, if at all, concerned about portability and ANSI standards.


I'm more concerned with portability - but only between the major
vendors. ANSI appears to be a good way to get consistency in the
syntax across the vendors (there's nothing worse than seeing everyone
support a feature differently) - To me, the consistent nature of the
OLAP syntax is the best thing since vendors decided to support the
same outer join syntax.
Nov 12 '05 #16

This discussion thread is closed

Replies have been disabled for this discussion.