473,399 Members | 3,038 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

parsing a comma-separatinf list

AK
Once upon a time there was a table:

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE CHAR(2))@

in a while the developers realized that a vendor may be present in
xseveral states, so the structure was changed to

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE_LIST
VARCHAR(150))@

STATE)LIST could contain a list of comma-separated values like this:
CA,WA,OR
Not the best approach, but out here in the field things like this
happen all the time. Everybody were absolutely happy until somebody
required a query like this:
SELECT * FROM VENDOR WHERE STATE IN ('IL', 'WI','MI')
SELECT COUNT(*), STATE FROM VENDOR GROUP BY STATE
and so on
Given a dayabase structure developed by a skilled professional, that
would be a piece of cake, but not this time.

Moving to a better structure

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50))@
CREATE TABLE VENDOR_IN_STATE(VENDOR_ID INT, STATE CHAR(2))@

nobody wanted to change the front end application.
As a workaround, I created a view and INSTEAD OF triggers that worked
like this:

INSERT INTO VENDOR_IN_STATE_VIEW VALUES(1, 'ABC INC.', 'AK,AL,IL')@

SELECT * FROM VENDOR
VENDOR_ID NAME
----------- --------------------------------------------------
1 ABC INC.
1 record(s) selected.

SELECT * FROM VENDOR_IN_STATE
VENDOR_ID STATE
----------- -----
1 AK
1 AL
1 IL
3 record(s) selected.

I used recursion to define both the view and the triggers:

CREATE FUNCTION PARSE_LIST(C_LIST VARCHAR(100))
RETURNS TABLE(TOKEN VARCHAR(100))
SPECIFIC PARSE_LIST
RETURN
WITH PARSED_LIST(STEP, TOKEN, REMAINDER,LEFTMOST_COMMA)
AS(
VALUES(0, '',C_LIST,3)
UNION ALL
SELECT STEP+1 AS STEP,
CASE WHEN LEFTMOST_COMMA>0 THEN
CAST(LEFT(REMAINDER,LEFTMOST_COMMA-1) AS CHAR(2))
ELSE
REMAINDER
END AS TOKEN,
CASE WHEN LEFTMOST_COMMA>0 THEN
CAST(SUBSTR(REMAINDER,LEFTMOST_COMMA+1) AS VARCHAR(100))
ELSE
NULL
END AS REMAINDER,
LOCATE(',',SUBSTR(REMAINDER,LOCATE(',',REMAINDER)+ 1)) AS
LEFTMOST_COMMA
FROM PARSED_LIST
WHERE REMAINDER IS NOT NULL
)
SELECT TOKEN FROM PARSED_LIST WHERE STEP>0

I wrapped the recursive query in UDF so that it could be reused:

It is very easy to invoke this table UDF:

SELECT * FROM TABLE(PARSE_LIST('AK,AR,IL,OH')) AS PARSE_LIST
TOKEN
------
AK
AR
IL

OH

I user the UDF in an INSTEAD OF trigger

CREATE TRIGGER VENDOR_IN_STATE_I
INSTEAD OF INSERT
ON VENDOR_IN_STATE_VIEW
REFERENCING NEW AS N
FOR EACH ROW MODE DB2SQL
BEGIN ATOMIC
DECLARE VENDOR_FOUND INT;
SET VENDOR_FOUND=(SELECT COUNT(*) FROM VENDOR WHERE NAME=N.NAME);
IF NOT(VENDOR_FOUND>0)
THEN
INSERT INTO VENDOR(VENDOR_ID, NAME) VALUES (N.VENDOR_ID, N.NAME);
END IF;
INSERT INTO VENDOR_IN_STATE(VENDOR_ID, STATE)
SELECT N.VENDOR_ID, PARSE_LIST.TOKEN FROM
TABLE(PARSE_LIST(N.STATE_LIST)) AS PARSE_LIST;
END @

I would really appreciate any feedback.
Are there any simpler approaches?
Nov 12 '05 #1
15 5708
AK <ak************@yahoo.com> wrote:
Once upon a time there was a table:

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE CHAR(2))@

in a while the developers realized that a vendor may be present in
xseveral states, so the structure was changed to [...] I would really appreciate any feedback.
Are there any simpler approaches?


That looks pretty much straight forward to me and similar to what I wrote
once:
http://www-106.ibm.com/developerwork...03stolze1.html

The only alternative that might be a bit faster would be to use an external
table function, written in C or Java. That way, you could avoid the
recursion in SQL. I really don't know if performance would improve or not,
however.

--
Knut Stolze
Information Integration
IBM Germany / University of Jena
Nov 12 '05 #2
why not keep the table as is (i.e. with comma separated states) and
create a view which will have the names of the states in different
rows. should be easier as u wont need a instead of trigger.

view can be created like

select vendor_id,substr(state,locate(',',state,a * 3)+1,2) from
vendor,table(values(1),(2),(3),(4),(5)) as vendor_ids(a) where a <=
length(state)/3

Knut Stolze <st****@de.ibm.com> wrote in message news:<bs**********@fsuj29.rz.uni-jena.de>...
AK <ak************@yahoo.com> wrote:
Once upon a time there was a table:

CREATE TABLE VENDOR(VENDOR_ID INT, NAME VARCHAR(50), STATE CHAR(2))@

in a while the developers realized that a vendor may be present in
xseveral states, so the structure was changed to

[...]
I would really appreciate any feedback.
Are there any simpler approaches?


That looks pretty much straight forward to me and similar to what I wrote
once:
http://www-106.ibm.com/developerwork...03stolze1.html

The only alternative that might be a bit faster would be to use an external
table function, written in C or Java. That way, you could avoid the
recursion in SQL. I really don't know if performance would improve or not,
however.

Nov 12 '05 #3
subaga <su******@yahoo.com> wrote:
why not keep the table as is (i.e. with comma separated states) and
create a view which will have the names of the states in different
rows. should be easier as u wont need a instead of trigger.

view can be created like

select vendor_id,substr(state,locate(',',state,a * 3)+1,2) from
vendor,table(values(1),(2),(3),(4),(5)) as vendor_ids(a) where a <=
length(state)/3


That only works if:
(1) the states are always exactly 2 characters long - if they are not, then
your calculation doesn't work.
(2) you have at most 5 states - ok, you could open this up by adding more
rows to "vendor_ids"; depending on the actual data, you might need several
1000 rows (VARCHARs can be up to 32K long!), and that might not be simpler
then.

--
Knut Stolze
Information Integration
IBM Germany / University of Jena
Nov 12 '05 #4
AK
Knut,

thanks for answering! I loved your article!
Nov 12 '05 #5
AK
su******@yahoo.com (subaga) wrote in message news:<f6************************@posting.google.co m>...
why not keep the table as is (i.e. with comma separated states) and
create a view which will have the names of the states in different
rows. should be easier as u wont need a instead of trigger.

view can be created like

select vendor_id,substr(state,locate(',',state,a * 3)+1,2) from
vendor,table(values(1),(2),(3),(4),(5)) as vendor_ids(a) where a <=
length(state)/3


In addition to what Knut has already writtenm,

if there is no table for states, htere are no indexes and statistics
on it, all the queries involving states run much slower. Also there
are quite a few books on the advantages of normalization, such as
written by Joe Celko
Nov 12 '05 #6
>> in a while the developers realized that a vendor may be present in
several states, so the structure was changed to :

CREATE TABLE Vendors
(vendor_id INTEGER NOT NULL PRIMARY KEY,
vendor_name VARCHAR(35) NOT NULL,
state_list VARCHAR(150)NOT NULL);

STATE_LIST could contain a list of comma-separated values <<

Hopefully, someone took the idiot out to a wall, read him the part in
the most basic RDBMS book they could find about First Normal Form and
shot him. I like stories with happy endings.

Passing a list of parameters to a stored procedure can be done by
putting them into a string with a separator. I like to use the
traditional comma. Let's assume that you have a whole table full of
such parameter lists:

CREATE TABLE InputStrings
(keycol CHAR(10) NOT NULL PRIMARY KEY,
input_string VARCHAR(255) NOT NULL);

INSERT INTO InputStrings VALUES ('first', '12,34,567,896');
INSERT INTO InputStrings VALUES ('second', '312,534,997,896');
...

It makes life easier if the lists in the input strings start and end
with a comma. You will also need a table called Sequence, which is a
set of integers from 1 to (n).

CREATE VIEW Breakdown (keycol, parm_nbr, parameter)
AS
SELECT keycol, COUNT(S2.seq),
CAST (SUBSTRING (',' + I1.input_string + ',', MAX(S1.seq + 1),
(S2.seq - MAX(S1.seq + 1)))
AS INTEGER)
FROM InputStrings AS I1, Sequence AS S1, Sequence AS S2
WHERE SUBSTRING (',' + I1.input_string + ',' , S1.seq, 1) = ','
AND SUBSTRING (',' + I1.input_string + ',' , S2.seq, 1) = ','
AND S1.seq < S2.seq
AND S2.seq <= LEN (I1.input_string) + 2
GROUP BY I1.keycol, I1.input_string, S2.seq;

The S1 and S2 copies of Sequence are used to locate bracketing pairs
of commas, and the entire set of substrings located between them is
extracts and cast as integers in one non-procedural step. The trick
is to be sure that the left hand comma of the bracketing pair is the
closest one to the second comma. The place column tells you the
relative position of the value in the input string.

YOu can put this into a VIEW until you can fix the schema and perform
the execution.
Nov 12 '05 #7
--CELKO-- wrote:
in a while the developers realized that a vendor may be present in


several states, so the structure was changed to :

CREATE TABLE Vendors
(vendor_id INTEGER NOT NULL PRIMARY KEY,
vendor_name VARCHAR(35) NOT NULL,
state_list VARCHAR(150)NOT NULL);

STATE_LIST could contain a list of comma-separated values <<

Hopefully, someone took the idiot out to a wall, read him the part in
the most basic RDBMS book they could find about First Normal Form and
shot him. I like stories with happy endings.

Passing a list of parameters to a stored procedure can be done by
putting them into a string with a separator. I like to use the
traditional comma. Let's assume that you have a whole table full of
such parameter lists:

CREATE TABLE InputStrings
(keycol CHAR(10) NOT NULL PRIMARY KEY,
input_string VARCHAR(255) NOT NULL);

INSERT INTO InputStrings VALUES ('first', '12,34,567,896');
INSERT INTO InputStrings VALUES ('second', '312,534,997,896');
...

It makes life easier if the lists in the input strings start and end
with a comma. You will also need a table called Sequence, which is a
set of integers from 1 to (n).

CREATE VIEW Breakdown (keycol, parm_nbr, parameter)
AS
SELECT keycol, COUNT(S2.seq),
CAST (SUBSTRING (',' + I1.input_string + ',', MAX(S1.seq + 1),
(S2.seq - MAX(S1.seq + 1)))
AS INTEGER)
FROM InputStrings AS I1, Sequence AS S1, Sequence AS S2
WHERE SUBSTRING (',' + I1.input_string + ',' , S1.seq, 1) = ','
AND SUBSTRING (',' + I1.input_string + ',' , S2.seq, 1) = ','
AND S1.seq < S2.seq
AND S2.seq <= LEN (I1.input_string) + 2
GROUP BY I1.keycol, I1.input_string, S2.seq;

The S1 and S2 copies of Sequence are used to locate bracketing pairs
of commas, and the entire set of substrings located between them is
extracts and cast as integers in one non-procedural step. The trick
is to be sure that the left hand comma of the bracketing pair is the
closest one to the second comma. The place column tells you the
relative position of the value in the input string.

YOu can put this into a VIEW until you can fix the schema and perform
the execution.


With the exception that I prefer vertical bars, or other characters that
couldn't legitimately be in the data stream I agree. I think commas have
too many common uses to be of value as a separator.
--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)

Nov 12 '05 #8
AK
Joe,

I loved the idea to use a sequence table, many thanks!
I would say that there is no need for aggregation either.
Given this table expression:

(SELECT
ROWNUMBER() OVER(ORDER BY S1.SEQ) AS SEPARATOR_NUM,
S1.SEQ AS SEPARATOR_POS
FROM SEQUENCE_TAB S1
WHERE
SUBSTRING (',' + I1.input_string + ',' , S1.seq, 1) = ','
) AS SEPARATORS

then we could self-join it like this:

SELECT
(S1.SEPARATOR_POS + 1) AS TOKEN_FROM
(S2.SEPARATOR_POS - S1.SEPARATOR_POS - 1) AS TOKEN.LENGTH
FROM SEPATATORS S1
JOIN SEPARATORS S2
ON (S1.SEPARATOR_NUM + 1)=S2.SEPARATOR_NUM

What do you think?
Nov 12 '05 #9
>> What do you think? <<

Neat but too proprietary for my taste. Remember, I am the guy who
pushes pure SQL-92 code.
Nov 12 '05 #10
AK
jo*******@northface.edu (--CELKO--) wrote in message news:<a2**************************@posting.google. com>...
What do you think? <<


Neat but too proprietary for my taste. Remember, I am the guy who
pushes pure SQL-92 code.


I would agree. However entry-level folks weren't able to understand
neither my original solution nor yours. This is why an
easy-to-understand one was necessary.

I'd say OLAP functions are very flexible and powerful. Maybe they
deserve getting into the standard?
Nov 12 '05 #11
To the best of my knowledge, ROW_NUMBER() is in the standard. AFAIK it
is alse supported by Oracle. So proprietary: Nope.
is the '+' in the standard for string types? Just curious....

Cheers
Serge

--
Serge Rielau
DB2 SQL Compiler Development
IBM Toronto Lab

Nov 12 '05 #12
> To the best of my knowledge, ROW_NUMBER() is in the standard. AFAIK it
is alse supported by Oracle. So proprietary: Nope.
is the '+' in the standard for string types? Just curious....


Check out:

http://www.jtc1sc32.org/sc32/jtc1sc3...f?OpenDocument

The section on Window functions has all the relational OLAP stuff. I
noticed recursion and common table expressions were also there, all of
which work in Oracle and SQL Server (from Yukon on).

Christian.
Nov 12 '05 #13
AK
I would say that right now my top priority is to leave behind me code
that is easy to understand and maintain. This is where OLAP functions
are very handy. I love them.
I am less, if at all, concerned about portability and ANSI standards.

I think:
'abc'+'cde' will work on MS SQL Server , bot not on DB2 and not on
ORACLE
ROW_NUMBER() will work on DB@ and ORACLE but not on MS SQL Server
Nov 12 '05 #14
>> is the '+' in the standard for string types? <<

No, concatenation is an infixed || token. We copied it from PL/I.
What was funny is that Phil Shaw, the ANSI X3H2 representative from
IBM at the time did not like this symbol because he had worked with
PL/I and hated it.
Nov 12 '05 #15
> I would say that right now my top priority is to leave behind me code
that is easy to understand and maintain. This is where OLAP functions
are very handy. I love them.
I agree. The bonus with OLAP functions is they perform well over very
large result sets. The ANSI 92 equivalent won't.
I am less, if at all, concerned about portability and ANSI standards.


I'm more concerned with portability - but only between the major
vendors. ANSI appears to be a good way to get consistency in the
syntax across the vendors (there's nothing worse than seeing everyone
support a feature differently) - To me, the consistent nature of the
OLAP syntax is the best thing since vendors decided to support the
same outer join syntax.
Nov 12 '05 #16

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: felciano | last post by:
Hi -- I am trying to use the csv module to parse a column of values containing comma-delimited values with unusual escaping: AAA, BBB, CCC (some text, right here), DDD I want this to come...
2
by: GIMME | last post by:
I can't figure an expression needed to parse a string. This problem arrises from parsing Excel csv files ... The expression must parse a string based upon comma delimiters, but if a comma...
8
by: Darius Fatakia | last post by:
Hello, I have a file that I have opened for reading and this file contains lines with several different types of constraint information. For example, here are a few lines: length(0) = 10...
26
by: SL33PY | last post by:
Hi, I'm having a problem parsing strings (comming from a flat text input file) to doubles. the code: currentImportDetail.Result = CType(line.Substring(7, 8).Trim(" "), System.Double) What...
2
by: Trip | last post by:
Hello all, Is there anyway to store an array to MS Access 2002 without parsing the entire array row-by-row. For example, Oracle allows you to store BLOBs (binary large objects). I would like...
15
by: VMI | last post by:
I'm parsing a comma-delimited record but I want it to do something if some of the string is between "". How can I do this? With the Excel import it does it correct. I'm using String.Split()....
4
by: igotyourdotnet | last post by:
I have a question. I'm reading a CSV file that is uploading to my SQL db, I'm parsing out the file line by line. I'm getting the values and putting them into an arrayList seperate by commas. The...
4
by: william | last post by:
Hello, I've imported an excel spreadsheet with a Name column which is formatted as Last, First, MI. Some examples I have in the Name column: Smith, Ellen P. Jones, Mary Jane...
9
by: Jasper | last post by:
Hi, I have multiple data files which need parsing in realtime so high performance is *crucial*. I dont have a format definition, but from what I can see there is a hierarchy of data. Each...
4
mickey0
by: mickey0 | last post by:
hello, I have to parsing a string like this: char * = "10 20 30 40"; and put its number into a vector<double> I thought to use strchr() and atof() and it's seems ok but to say the truth the line...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.