472,802 Members | 1,357 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 472,802 developers and data experts.

Sorting Data with the Schwartzian Transform

KevinADC
4,059 Expert 2GB
If you are entirely unfamiliar with using Perl to sort data, read the "Sorting Data with Perl - Part One and Two" articles before reading this article. Beginning Perl coders may find this article uses unfamiliar terms and syntax. Intermediate and advanced Perl coders should find this article useful. The object of the article is to inform the reader, it is not about how to code Perl or how to write good Perl code, but to teach the Schwartzian Transform technique in a manner that is hopefully easily understood.

The Schwartzian Transform

The Schwartzian Transform (referred to as ST from now on) is a good balance between implementation and efficiency making it an attractive option for sorting data. For the curious, the name of this technique is in honor of its originator, Randal Schwartz. This article is a detailed look at how the ST works.

ACME INC.

First we need some data to sort. At the end of this article a data file is attached (employees.txt). It's a tab separated values list of employees of ACME INC. It's a list of ten employees with some fields that describe their employment status with ACME INC. The fields are:

EmployeeID, Name, HireDate, Position, Department, Salary

Expand|Select|Wrap|Line Numbers
  1. EmployeeID    Name    HireDate    Position    Department    Salary
  2. M011    Smith,John,T    01-12-1981    Welder    Maintenance    20.00 p/h
  3. M102    Hart,Thomas,J    06-30-1982    Supervisor    Maintenance    28.50 p/h
  4. J309    Jones,Steve,W    02-23-1990    Janitor    Janitorial    15.50 p/h
  5. A124    White,Mary,H    03-15-1990    Assembler    Assembly    8.75 p/h
  6. Q365    Miles,Frank,R    12-01-1999    Inspector    Quality    16.25 p/h
  7. A316    Roberts,Andy,P    07-30-2006    Assembler    Assembly    8.00 p/h
  8. S554    William,Terry,K    03-3-2005    Expeditor    Stock Room    9.00 p/h
  9. R078    Norris,Chris,K    04-17-2003    Clerk    Shipping/Recieving    9.00 p/h
  10. X832    Anderson,Jane,M    03-23-1992    VP Operations    Mangement    33.00 p/h
  11. X111    Kelly,Mark,D    05-29-1989    Engineer    Engineering    26.75 p/h
  12.  
The data file will be read into a hash of hashes (see the Online Resources section at the end of the article if you are unfamiliar with what a hash of hashes is).

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4. use Data::Dumper; # for printing a "picture" of the data
  5. my %employees;
  6. open (my $in, 'path/to/employees.txt') or die "$!";
  7. readline <$in>; #skip file header
  8. while (<$in>) {
  9.     chomp;
  10.     my ($id,$name,$hdate,$pos,$dept,$sal) = split/\t/;
  11.     $employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
  12. }
  13. close $in;
  14. print Dumper \%employees;
  15.  
The above code creates a hash of hashes from the employees file with the EmplyeeID being used as the top-level hash keys. It produces a series of records like this (partial output of Data::Dumper):

Expand|Select|Wrap|Line Numbers
  1. 'R078' => {
  2.            'DEPT' => 'Shipping/Recieving',
  3.            'POSITION' => 'Clerk',
  4.            'NAME' => 'Norris,Chris,K',
  5.            'HIREDATE' => '04-17-2003',
  6.            'SALARY' => '9.00 p/h'
  7.           },
  8.  
'R078' (the employee ID) is the key to an anonymous hash (see the Resources section at the end of the article if you are not familiar with what an anonymous hash is) that has the employee data stored in key/value pairs. The HIREDATE is in American format of MM-DD-YYYY. We are going to order the records by hire date in ascending order. To do that we will need to munge the date into sortable keys.

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4. use Data::Dumper;
  5. my %employees;
  6. open (my $in, 'path/to/employees.txt') or die "$!";
  7. readline <$in>; #skip header
  8. while (<$in>) {
  9.     chomp;
  10.     my ($id,$name,$hdate,$pos,$dept,$sal) = split/\t/;
  11.     $employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
  12. }
  13. close $in;
  14. my @sorted = map  {$employees{$_->[0]}{NAME} was hired on $employees{$_->[0]}{HIREDATE}}
  15.              sort {$a->[1] cmp $b->[1]}
  16.              map  {my ($m,$d,$y) = split/-/,$employees{$_}{HIREDATE};
  17.                   [$_, "$y$m$d"] } keys %employees;
  18.  
  19. print Dumper \@sorted;
  20.  
What we end up with is a list of employees sorted by hire date:

'Smith,John,T was hired on 01-12-1981',
'Hart,Thomas,J was hired on 06-30-1982',
'Kelly,Mark,D was hired on 05-29-1989',
'Jones,Steve,W was hired on 02-23-1990',
'White,Mary,H was hired on 03-15-1990',
'Anderson,Jane,M was hired on 03-23-1992',
'Miles,Frank,R was hired on 12-01-1999',
'Norris,Chris,K was hired on 04-17-2003',
'William,Terry,K was hired on 03-3-2005',
'Roberts,Andy,P was hired on 07-30-2006'

Lets take a closer look at the ST with most of the code stripped away to reveal what's going on.

Expand|Select|Wrap|Line Numbers
  1. my @sorted = map  {}
  2.              sort {}
  3.              map  {} keys %employees;
  4.  
  5.  
That is the bare essence of the ST: a sort block wrapped in two map{} blocks.

What Makes the Schwartzian Transform Fly

What makes the ST fly with so little code and no temporary variables is the ability to chain together list processing functions and Perls support of anonymous data storage.

It's maybe easier to visualize the ST technique linearly. As a flow chart it would look like this with the arrows indicating input/output direction:

Expand|Select|Wrap|Line Numbers
  1. END output  <-- map {} <-- sort {} <-- map {} <-- input START;
  2.  
The first map block is really the trickiest part when using this technique. In that block is where you decide what you need to do to create the list of sort keys that the sort block will sort. Lets take a closer look.

The Map Function

A map block is similar to a subroutine:

Expand|Select|Wrap|Line Numbers
  1. map  {
  2.     my ($m,$d,$y) = split/-/,$employees{$_}{HIREDATE};
  3.     [$_, "$y$m$d"]
  4.  } keys %employees;
  5.  
The data is read into the map block from the list appended to the end instead of imported with @_ like a subroutine. The list can be any list: an array, a hash, a (list of things), another function that creates a list, like the grep() and split() functions, a filehandle, or even a subroutine that returns a list.

The map block is also very much like a loop, but not the same as for/foreach/while/until loops. Map will process all the elements of a list, like grep() does. You can't use loop controls like "last" or "next" or "redo" with map.

In our example we used only one comparison to sort the data:

Expand|Select|Wrap|Line Numbers
  1. sort {$a->[1] cmp $b->[1]}
But you can use as many comparisons as you need to sort your data using the || (or) operator:

Expand|Select|Wrap|Line Numbers
  1. sort {$a->[1] cmp $b->[1] || 
  2.             $b->[2] <=> $a->[2] ||
  3.             $a->[3] cmp $b->[3]}
Note the mixed use of "cmp" and "<=>" and ascending and descending ordering all in the same sort block. This is perfectly valid and very useful depending on your particular needs.

The last map block can also munge the final output however you need it to for your particular application. In our example we just added a little text to make the output more readable. But you can do whatever you want in that map block: omit some of the data, add some markup code (like html code), change the order of the fields, whatever you want.

Why Use a Schwartzian Transform?

The ST is a good option when the sort keys do not actually exist within the data and the sort keys will only be used to sort the data and not for any other purpose. Like in my example we sorted the list by the hire date, but the sort key had to be made on-the-fly by splitting the date and reordering the American style date of MM-DD-YYYY into a sortable ASCII string, YYYYMMDD. The date could have been sorted numerically but sorting it as a string is really easier.

Another good use of the ST is when you have to sort some data just to test some sort keys that you will use in a larger more robust program.

Why Not Use a Schwartzian Transform?

If you have data with keys that are already is a sortable format there is no need to use the ST technique but you still could if you wanted to. You may also consider using a database that can sort data much more efficiently than a Perl script can.

Conclusion

The Schwartzian Transform is a powerful yet convenient method of sorting data. If after reading the article you are still a bit confused, try experimenting with some code to get a feel for how it works. Once you master this technique you will find yourself using it all the time.

Kevin (aka KevinADC)

Online Resources

Perldoc Website All the Perl documentation online.
perlreftut - Mark's very short tutorial about references
perldsc - data structure complex data structure struct
perllol - Manipulating Arrays of Arrays in Perl
map - apply a change to a list to get back a new list with the changes


This article is protected under the Creative Commons License.
Attached Files
File Type: txt employees.txt (667 Bytes, 638 views)
Feb 26 '08 #1
3 7128
KevinADC
4,059 Expert 2GB
Comments, corrections, discussions are welcome.
Feb 26 '08 #2
The Schwartzian Transform is a type of cached-key sort. The basic form is a map-sort-map using an anonymous array to carry the original datum and the sortable form:

Expand|Select|Wrap|Line Numbers
  1. @sorted_items =
  2. map { $_->[0] }
  3. sort { ... }
  4. map { [$_, ...] }
  5. @items;
  6.  
Fill in the bits to created the sortable form and the way to sort, and that's it. You get out a sorted list of the input elements which you can then use for the next bit (instead of creating completely new strings as you do):

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5. my %employees;
  6.  
  7. chomp( my @headers = split /\t/, <DATA> );
  8.  
  9. while (<DATA>) {
  10.     chomp;
  11.     my %hash;
  12.     @hash{@headers} = split/\t/;
  13.     $employees{ $hash{EmployeeID} } = \%hash;
  14.     }
  15.  
  16. my @sorted_keys = 
  17.     map { $_->[0] }
  18.     sort { $a->[1] cmp $b->[1] }
  19.     map {
  20.         my ($m,$d,$y) = split/-/, $employees{$_}{HireDate};
  21.         [$_, "$y$m$d"];
  22.         } 
  23.     keys %employees;
  24.  
  25. foreach my $key ( @sorted_keys )
  26.     {
  27.     print "$employees{$key}{Name} was hired on $employees{$key}{HireDate}\n";
  28.     }
  29.  
  30. __DATA__
  31. EmployeeID    Name    HireDate    Position    Department    Salary
  32. M011    Smith,John,T    01-12-1981    Welder    Maintenance    20.00 p/h
  33. M102    Hart,Thomas,J    06-30-1982    Supervisor    Maintenance    28.50 p/h
  34. J309    Jones,Steve,W    02-23-1990    Janitor    Janitorial    15.50 p/h
  35. A124    White,Mary,H    03-15-1990    Assembler    Assembly    8.75 p/h
  36. Q365    Miles,Frank,R    12-01-1999    Inspector    Quality    16.25 p/h
  37. A316    Roberts,Andy,P    07-30-2006    Assembler    Assembly    8.00 p/h
  38. S554    William,Terry,K    03-3-2005    Expeditor    Stock Room    9.00 p/h
  39. R078    Norris,Chris,K    04-17-2003    Clerk    Shipping/Recieving    9.00 p/h
  40. X832    Anderson,Jane,M    03-23-1992    VP Operations    Mangement    33.00 p/h
  41. X111    Kelly,Mark,D    05-29-1989    Engineer    Engineering    26.75 p/h
  42.  
Also note that you can say &lt;$in&gt; or <tt>readline $in</tt> because those are the same thing. Don't combine them :)

For more information, reading the "Practical Reference Tricks" chapter in <i>Intermediate Perl</i>. Good luck :)
Jun 27 '08 #3
KevinADC
4,059 Expert 2GB
The Schwartzian Transform is a type of cached-key sort. The basic form is a map-sort-map using an anonymous array to carry the original datum and the sortable form:

Expand|Select|Wrap|Line Numbers
  1. @sorted_items =
  2. map { $_->[0] }
  3. sort { ... }
  4. map { [$_, ...] }
  5. @items;
  6.  
Fill in the bits to created the sortable form and the way to sort, and that's it. You get out a sorted list of the input elements which you can then use for the next bit (instead of creating completely new strings as you do):

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5. my %employees;
  6.  
  7. chomp( my @headers = split /\t/, <DATA> );
  8.  
  9. while (<DATA>) {
  10.     chomp;
  11.     my %hash;
  12.     @hash{@headers} = split/\t/;
  13.     $employees{ $hash{EmployeeID} } = \%hash;
  14.     }
  15.  
  16. my @sorted_keys = 
  17.     map { $_->[0] }
  18.     sort { $a->[1] cmp $b->[1] }
  19.     map {
  20.         my ($m,$d,$y) = split/-/, $employees{$_}{HireDate};
  21.         [$_, "$y$m$d"];
  22.         } 
  23.     keys %employees;
  24.  
  25. foreach my $key ( @sorted_keys )
  26.     {
  27.     print "$employees{$key}{Name} was hired on $employees{$key}{HireDate}\n";
  28.     }
  29.  
  30. __DATA__
  31. EmployeeID    Name    HireDate    Position    Department    Salary
  32. M011    Smith,John,T    01-12-1981    Welder    Maintenance    20.00 p/h
  33. M102    Hart,Thomas,J    06-30-1982    Supervisor    Maintenance    28.50 p/h
  34. J309    Jones,Steve,W    02-23-1990    Janitor    Janitorial    15.50 p/h
  35. A124    White,Mary,H    03-15-1990    Assembler    Assembly    8.75 p/h
  36. Q365    Miles,Frank,R    12-01-1999    Inspector    Quality    16.25 p/h
  37. A316    Roberts,Andy,P    07-30-2006    Assembler    Assembly    8.00 p/h
  38. S554    William,Terry,K    03-3-2005    Expeditor    Stock Room    9.00 p/h
  39. R078    Norris,Chris,K    04-17-2003    Clerk    Shipping/Recieving    9.00 p/h
  40. X832    Anderson,Jane,M    03-23-1992    VP Operations    Mangement    33.00 p/h
  41. X111    Kelly,Mark,D    05-29-1989    Engineer    Engineering    26.75 p/h
  42.  
Also note that you can say &lt;$in&gt; or <tt>readline $in</tt> because those are the same thing. Don't combine them :)

For more information, reading the "Practical Reference Tricks" chapter in <i>Intermediate Perl</i>. Good luck :)

I am honored you took the time to post your comments Brian. It is very much appreciated.

Regards,
Kevin
Jul 2 '08 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

20
by: Xah Lee | last post by:
Sort a List Xah Lee, 200510 In this page, we show how to sort a list in Python & Perl and also discuss some math of sort. To sort a list in Python, use the “sort” method. For example: ...
7
by: Cerebrus99 | last post by:
Hi all, I am confused about how to sort an XML file. I mean how to *actually* sort the data in the physical file, not how to display sorted data. I am using a large XML file as a back-end...
4
by: Jon Paal | last post by:
I am passing an xml data file to a server control and need to sort the data. The data has three levels(tables). What is the recommended approach to do the sorting ? (using vb.net and asp.net...
6
by: Christoph | last post by:
I'm trying to come up with a stylesheet where, when the rows are displayed, duplicate game names are not shown on subsequent rows. It works but doesn't work properly. If I sort the data using...
1
by: jearnshaw | last post by:
Newbie Moment!! I hope you guys can help. I admit it I know NOTHING about xml and css. But I need to get a large amount of data out of xml and onto an intranet site which uses restricted html.I...
3
by: gagonm | last post by:
Hi All I have an urgent issue where I m transforming an xml structure into html page using xslt . since data is more , I need to do Pagination .Also there is an option to do sorting of ...
3
by: fredo66 | last post by:
hello, Can someone help me with this: I have a array like this list where rows are the records and colom the fields. If I use the .sort() method on 'list' the data is sorted on the items of...
16
by: skip | last post by:
The thread on sorting in Python 3 got me to thinking. How could I sort a list of complex numbers using key? As expected: Traceback (most recent call last): File "<stdin>", line 1, in...
5
by: asc | last post by:
Hi all, I have a problem and I'm not sure whether sort() can help me. I understand that if I have a list; say L = I can use L.sort() and I will then have; L = But my problem is this. I have a...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.