By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,225 Members | 2,172 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Sorting Data with the Schwartzian Transform

KevinADC
Expert 2.5K+
P: 4,059
If you are entirely unfamiliar with using Perl to sort data, read the "Sorting Data with Perl - Part One and Two" articles before reading this article. Beginning Perl coders may find this article uses unfamiliar terms and syntax. Intermediate and advanced Perl coders should find this article useful. The object of the article is to inform the reader, it is not about how to code Perl or how to write good Perl code, but to teach the Schwartzian Transform technique in a manner that is hopefully easily understood.

The Schwartzian Transform

The Schwartzian Transform (referred to as ST from now on) is a good balance between implementation and efficiency making it an attractive option for sorting data. For the curious, the name of this technique is in honor of its originator, Randal Schwartz. This article is a detailed look at how the ST works.

ACME INC.

First we need some data to sort. At the end of this article a data file is attached (employees.txt). It's a tab separated values list of employees of ACME INC. It's a list of ten employees with some fields that describe their employment status with ACME INC. The fields are:

EmployeeID, Name, HireDate, Position, Department, Salary

Expand|Select|Wrap|Line Numbers
  1. EmployeeID    Name    HireDate    Position    Department    Salary
  2. M011    Smith,John,T    01-12-1981    Welder    Maintenance    20.00 p/h
  3. M102    Hart,Thomas,J    06-30-1982    Supervisor    Maintenance    28.50 p/h
  4. J309    Jones,Steve,W    02-23-1990    Janitor    Janitorial    15.50 p/h
  5. A124    White,Mary,H    03-15-1990    Assembler    Assembly    8.75 p/h
  6. Q365    Miles,Frank,R    12-01-1999    Inspector    Quality    16.25 p/h
  7. A316    Roberts,Andy,P    07-30-2006    Assembler    Assembly    8.00 p/h
  8. S554    William,Terry,K    03-3-2005    Expeditor    Stock Room    9.00 p/h
  9. R078    Norris,Chris,K    04-17-2003    Clerk    Shipping/Recieving    9.00 p/h
  10. X832    Anderson,Jane,M    03-23-1992    VP Operations    Mangement    33.00 p/h
  11. X111    Kelly,Mark,D    05-29-1989    Engineer    Engineering    26.75 p/h
  12.  
The data file will be read into a hash of hashes (see the Online Resources section at the end of the article if you are unfamiliar with what a hash of hashes is).

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4. use Data::Dumper; # for printing a "picture" of the data
  5. my %employees;
  6. open (my $in, 'path/to/employees.txt') or die "$!";
  7. readline <$in>; #skip file header
  8. while (<$in>) {
  9.     chomp;
  10.     my ($id,$name,$hdate,$pos,$dept,$sal) = split/\t/;
  11.     $employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
  12. }
  13. close $in;
  14. print Dumper \%employees;
  15.  
The above code creates a hash of hashes from the employees file with the EmplyeeID being used as the top-level hash keys. It produces a series of records like this (partial output of Data::Dumper):

Expand|Select|Wrap|Line Numbers
  1. 'R078' => {
  2.            'DEPT' => 'Shipping/Recieving',
  3.            'POSITION' => 'Clerk',
  4.            'NAME' => 'Norris,Chris,K',
  5.            'HIREDATE' => '04-17-2003',
  6.            'SALARY' => '9.00 p/h'
  7.           },
  8.  
'R078' (the employee ID) is the key to an anonymous hash (see the Resources section at the end of the article if you are not familiar with what an anonymous hash is) that has the employee data stored in key/value pairs. The HIREDATE is in American format of MM-DD-YYYY. We are going to order the records by hire date in ascending order. To do that we will need to munge the date into sortable keys.

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4. use Data::Dumper;
  5. my %employees;
  6. open (my $in, 'path/to/employees.txt') or die "$!";
  7. readline <$in>; #skip header
  8. while (<$in>) {
  9.     chomp;
  10.     my ($id,$name,$hdate,$pos,$dept,$sal) = split/\t/;
  11.     $employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
  12. }
  13. close $in;
  14. my @sorted = map  {$employees{$_->[0]}{NAME} was hired on $employees{$_->[0]}{HIREDATE}}
  15.              sort {$a->[1] cmp $b->[1]}
  16.              map  {my ($m,$d,$y) = split/-/,$employees{$_}{HIREDATE};
  17.                   [$_, "$y$m$d"] } keys %employees;
  18.  
  19. print Dumper \@sorted;
  20.  
What we end up with is a list of employees sorted by hire date:

'Smith,John,T was hired on 01-12-1981',
'Hart,Thomas,J was hired on 06-30-1982',
'Kelly,Mark,D was hired on 05-29-1989',
'Jones,Steve,W was hired on 02-23-1990',
'White,Mary,H was hired on 03-15-1990',
'Anderson,Jane,M was hired on 03-23-1992',
'Miles,Frank,R was hired on 12-01-1999',
'Norris,Chris,K was hired on 04-17-2003',
'William,Terry,K was hired on 03-3-2005',
'Roberts,Andy,P was hired on 07-30-2006'

Lets take a closer look at the ST with most of the code stripped away to reveal what's going on.

Expand|Select|Wrap|Line Numbers
  1. my @sorted = map  {}
  2.              sort {}
  3.              map  {} keys %employees;
  4.  
  5.  
That is the bare essence of the ST: a sort block wrapped in two map{} blocks.

What Makes the Schwartzian Transform Fly

What makes the ST fly with so little code and no temporary variables is the ability to chain together list processing functions and Perls support of anonymous data storage.

It's maybe easier to visualize the ST technique linearly. As a flow chart it would look like this with the arrows indicating input/output direction:

Expand|Select|Wrap|Line Numbers
  1. END output  <-- map {} <-- sort {} <-- map {} <-- input START;
  2.  
The first map block is really the trickiest part when using this technique. In that block is where you decide what you need to do to create the list of sort keys that the sort block will sort. Lets take a closer look.

The Map Function

A map block is similar to a subroutine:

Expand|Select|Wrap|Line Numbers
  1. map  {
  2.     my ($m,$d,$y) = split/-/,$employees{$_}{HIREDATE};
  3.     [$_, "$y$m$d"]
  4.  } keys %employees;
  5.  
The data is read into the map block from the list appended to the end instead of imported with @_ like a subroutine. The list can be any list: an array, a hash, a (list of things), another function that creates a list, like the grep() and split() functions, a filehandle, or even a subroutine that returns a list.

The map block is also very much like a loop, but not the same as for/foreach/while/until loops. Map will process all the elements of a list, like grep() does. You can't use loop controls like "last" or "next" or "redo" with map.

In our example we used only one comparison to sort the data:

Expand|Select|Wrap|Line Numbers
  1. sort {$a->[1] cmp $b->[1]}
But you can use as many comparisons as you need to sort your data using the || (or) operator:

Expand|Select|Wrap|Line Numbers
  1. sort {$a->[1] cmp $b->[1] || 
  2.             $b->[2] <=> $a->[2] ||
  3.             $a->[3] cmp $b->[3]}
Note the mixed use of "cmp" and "<=>" and ascending and descending ordering all in the same sort block. This is perfectly valid and very useful depending on your particular needs.

The last map block can also munge the final output however you need it to for your particular application. In our example we just added a little text to make the output more readable. But you can do whatever you want in that map block: omit some of the data, add some markup code (like html code), change the order of the fields, whatever you want.

Why Use a Schwartzian Transform?

The ST is a good option when the sort keys do not actually exist within the data and the sort keys will only be used to sort the data and not for any other purpose. Like in my example we sorted the list by the hire date, but the sort key had to be made on-the-fly by splitting the date and reordering the American style date of MM-DD-YYYY into a sortable ASCII string, YYYYMMDD. The date could have been sorted numerically but sorting it as a string is really easier.

Another good use of the ST is when you have to sort some data just to test some sort keys that you will use in a larger more robust program.

Why Not Use a Schwartzian Transform?

If you have data with keys that are already is a sortable format there is no need to use the ST technique but you still could if you wanted to. You may also consider using a database that can sort data much more efficiently than a Perl script can.

Conclusion

The Schwartzian Transform is a powerful yet convenient method of sorting data. If after reading the article you are still a bit confused, try experimenting with some code to get a feel for how it works. Once you master this technique you will find yourself using it all the time.

Kevin (aka KevinADC)

Online Resources

Perldoc Website All the Perl documentation online.
perlreftut - Mark's very short tutorial about references
perldsc - data structure complex data structure struct
perllol - Manipulating Arrays of Arrays in Perl
map - apply a change to a list to get back a new list with the changes


This article is protected under the Creative Commons License.
Attached Files
File Type: txt employees.txt (667 Bytes, 500 views)
Feb 26 '08 #1
Share this Article
Share on Google+
3 Comments


KevinADC
Expert 2.5K+
P: 4,059
Comments, corrections, discussions are welcome.
Feb 26 '08 #2

P: 1
The Schwartzian Transform is a type of cached-key sort. The basic form is a map-sort-map using an anonymous array to carry the original datum and the sortable form:

Expand|Select|Wrap|Line Numbers
  1. @sorted_items =
  2. map { $_->[0] }
  3. sort { ... }
  4. map { [$_, ...] }
  5. @items;
  6.  
Fill in the bits to created the sortable form and the way to sort, and that's it. You get out a sorted list of the input elements which you can then use for the next bit (instead of creating completely new strings as you do):

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5. my %employees;
  6.  
  7. chomp( my @headers = split /\t/, <DATA> );
  8.  
  9. while (<DATA>) {
  10.     chomp;
  11.     my %hash;
  12.     @hash{@headers} = split/\t/;
  13.     $employees{ $hash{EmployeeID} } = \%hash;
  14.     }
  15.  
  16. my @sorted_keys = 
  17.     map { $_->[0] }
  18.     sort { $a->[1] cmp $b->[1] }
  19.     map {
  20.         my ($m,$d,$y) = split/-/, $employees{$_}{HireDate};
  21.         [$_, "$y$m$d"];
  22.         } 
  23.     keys %employees;
  24.  
  25. foreach my $key ( @sorted_keys )
  26.     {
  27.     print "$employees{$key}{Name} was hired on $employees{$key}{HireDate}\n";
  28.     }
  29.  
  30. __DATA__
  31. EmployeeID    Name    HireDate    Position    Department    Salary
  32. M011    Smith,John,T    01-12-1981    Welder    Maintenance    20.00 p/h
  33. M102    Hart,Thomas,J    06-30-1982    Supervisor    Maintenance    28.50 p/h
  34. J309    Jones,Steve,W    02-23-1990    Janitor    Janitorial    15.50 p/h
  35. A124    White,Mary,H    03-15-1990    Assembler    Assembly    8.75 p/h
  36. Q365    Miles,Frank,R    12-01-1999    Inspector    Quality    16.25 p/h
  37. A316    Roberts,Andy,P    07-30-2006    Assembler    Assembly    8.00 p/h
  38. S554    William,Terry,K    03-3-2005    Expeditor    Stock Room    9.00 p/h
  39. R078    Norris,Chris,K    04-17-2003    Clerk    Shipping/Recieving    9.00 p/h
  40. X832    Anderson,Jane,M    03-23-1992    VP Operations    Mangement    33.00 p/h
  41. X111    Kelly,Mark,D    05-29-1989    Engineer    Engineering    26.75 p/h
  42.  
Also note that you can say &lt;$in&gt; or <tt>readline $in</tt> because those are the same thing. Don't combine them :)

For more information, reading the "Practical Reference Tricks" chapter in <i>Intermediate Perl</i>. Good luck :)
Jun 27 '08 #3

KevinADC
Expert 2.5K+
P: 4,059
The Schwartzian Transform is a type of cached-key sort. The basic form is a map-sort-map using an anonymous array to carry the original datum and the sortable form:

Expand|Select|Wrap|Line Numbers
  1. @sorted_items =
  2. map { $_->[0] }
  3. sort { ... }
  4. map { [$_, ...] }
  5. @items;
  6.  
Fill in the bits to created the sortable form and the way to sort, and that's it. You get out a sorted list of the input elements which you can then use for the next bit (instead of creating completely new strings as you do):

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5. my %employees;
  6.  
  7. chomp( my @headers = split /\t/, <DATA> );
  8.  
  9. while (<DATA>) {
  10.     chomp;
  11.     my %hash;
  12.     @hash{@headers} = split/\t/;
  13.     $employees{ $hash{EmployeeID} } = \%hash;
  14.     }
  15.  
  16. my @sorted_keys = 
  17.     map { $_->[0] }
  18.     sort { $a->[1] cmp $b->[1] }
  19.     map {
  20.         my ($m,$d,$y) = split/-/, $employees{$_}{HireDate};
  21.         [$_, "$y$m$d"];
  22.         } 
  23.     keys %employees;
  24.  
  25. foreach my $key ( @sorted_keys )
  26.     {
  27.     print "$employees{$key}{Name} was hired on $employees{$key}{HireDate}\n";
  28.     }
  29.  
  30. __DATA__
  31. EmployeeID    Name    HireDate    Position    Department    Salary
  32. M011    Smith,John,T    01-12-1981    Welder    Maintenance    20.00 p/h
  33. M102    Hart,Thomas,J    06-30-1982    Supervisor    Maintenance    28.50 p/h
  34. J309    Jones,Steve,W    02-23-1990    Janitor    Janitorial    15.50 p/h
  35. A124    White,Mary,H    03-15-1990    Assembler    Assembly    8.75 p/h
  36. Q365    Miles,Frank,R    12-01-1999    Inspector    Quality    16.25 p/h
  37. A316    Roberts,Andy,P    07-30-2006    Assembler    Assembly    8.00 p/h
  38. S554    William,Terry,K    03-3-2005    Expeditor    Stock Room    9.00 p/h
  39. R078    Norris,Chris,K    04-17-2003    Clerk    Shipping/Recieving    9.00 p/h
  40. X832    Anderson,Jane,M    03-23-1992    VP Operations    Mangement    33.00 p/h
  41. X111    Kelly,Mark,D    05-29-1989    Engineer    Engineering    26.75 p/h
  42.  
Also note that you can say &lt;$in&gt; or <tt>readline $in</tt> because those are the same thing. Don't combine them :)

For more information, reading the "Practical Reference Tricks" chapter in <i>Intermediate Perl</i>. Good luck :)

I am honored you took the time to post your comments Brian. It is very much appreciated.

Regards,
Kevin
Jul 2 '08 #4