Saturday, April 2, 2011

matching combinations in SQL

This SQL recipe is about searching for combinations. Let's say you have a table of winning lottery numbers. An example of such a table is given below.

root@localhost:test>
SELECT *
FROM `winners`
ORDER BY `drawingDate` LIMIT 10;
+-------------+--------+--------+--------+--------+--------+-----+
| drawingDate | white1 | white2 | white3 | white4 | white5 | red |
+-------------+--------+--------+--------+--------+--------+-----+
| 1997-11-05 | 2 | 19 | 35 | 24 | 28 | 26 |
| 1997-11-08 | 40 | 17 | 21 | 7 | 49 | 37 |
| 1997-11-12 | 41 | 14 | 13 | 29 | 16 | 1 |
| 1997-11-15 | 32 | 7 | 45 | 21 | 26 | 8 |
| 1997-11-19 | 19 | 10 | 48 | 3 | 21 | 22 |
| 1997-11-22 | 22 | 31 | 3 | 7 | 14 | 2 |
| 1997-11-26 | 15 | 46 | 34 | 23 | 40 | 35 |
| 1997-11-29 | 11 | 27 | 13 | 2 | 31 | 23 |
| 1997-12-03 | 18 | 9 | 14 | 47 | 42 | 32 |
| 1997-12-06 | 15 | 26 | 28 | 8 | 43 | 36 |
+-------------+--------+--------+--------+--------+--------+-----+
10 rows in set (0.00 sec)

I want to find how many matches a given combination of numbers has in this table. But let's start by searching for a single number, using the following query:

root@localhost:test>
SELECT *
FROM `winners`
WHERE `white1`=42 OR
`white2`=42 OR
`white3`=42 OR
`white4`=42 OR
`white5`=42 OR
`red`=42
ORDER BY `drawingDate` LIMIT 10;
+-------------+--------+--------+--------+--------+--------+-----+
| drawingDate | white1 | white2 | white3 | white4 | white5 | red |
+-------------+--------+--------+--------+--------+--------+-----+
| 1997-12-03 | 18 | 9 | 14 | 47 | 42 | 32 |
| 1998-03-07 | 32 | 36 | 42 | 30 | 21 | 33 |
| 1998-03-14 | 15 | 47 | 35 | 34 | 42 | 15 |
| 1998-03-25 | 42 | 44 | 35 | 23 | 5 | 29 |
| 1998-06-06 | 24 | 42 | 26 | 16 | 28 | 24 |
| 1998-07-04 | 39 | 25 | 38 | 15 | 30 | 42 |
| 1998-08-05 | 42 | 35 | 19 | 12 | 3 | 21 |
| 1998-08-08 | 45 | 42 | 5 | 9 | 33 | 28 |
| 1998-08-12 | 10 | 42 | 33 | 2 | 45 | 2 |
| 1998-08-26 | 6 | 32 | 46 | 23 | 42 | 8 |
+-------------+--------+--------+--------+--------+--------+-----+
10 rows in set (0.00 sec)

If we want to search for a combination of numbers, let's say (2,3,5,7,11) and a "powerball" of 42, then something like the following would work.

root@localhost:test>
SELECT *
FROM `winners`
WHERE
`white1` IN (2,3,5,7,11) OR
`white2` IN (2,3,5,7,11) OR
`white3` IN (2,3,5,7,11) OR
`white4` IN (2,3,5,7,11) OR
`white5` IN (2,3,5,7,11) OR
`red`=42
ORDER BY `drawingDate` LIMIT 10;

+-------------+--------+--------+--------+--------+--------+-----+
| drawingDate | white1 | white2 | white3 | white4 | white5 | red |
+-------------+--------+--------+--------+--------+--------+-----+
| 1997-11-05 | 2 | 19 | 35 | 24 | 28 | 26 |
| 1997-11-08 | 40 | 17 | 21 | 7 | 49 | 37 |
| 1997-11-15 | 32 | 7 | 45 | 21 | 26 | 8 |
| 1997-11-19 | 19 | 10 | 48 | 3 | 21 | 22 |
| 1997-11-22 | 22 | 31 | 3 | 7 | 14 | 2 |
| 1997-11-29 | 11 | 27 | 13 | 2 | 31 | 23 |
| 1997-12-13 | 44 | 41 | 46 | 35 | 5 | 27 |
| 1997-12-17 | 3 | 5 | 45 | 18 | 13 | 20 |
| 1997-12-20 | 23 | 11 | 12 | 17 | 10 | 7 |
| 1997-12-24 | 28 | 16 | 35 | 34 | 3 | 14 |
+-------------+--------+--------+--------+--------+--------+-----+
10 rows in set (0.00 sec)

To find the number of matches for each drawing, we will modify the query to use the 'UNION' operator. The above query is equivalent to the following:

root@localhost:test>
SELECT * FROM `winners` WHERE `white1` IN (2,3,5,7,11)
UNION
SELECT * FROM `winners` WHERE `white2` IN (2,3,5,7,11)
UNION
SELECT * FROM `winners` WHERE `white3` IN (2,3,5,7,11)
UNION
SELECT * FROM `winners` WHERE `white4` IN (2,3,5,7,11)
UNION
SELECT * FROM `winners` WHERE `white5` IN (2,3,5,7,11)
UNION
SELECT * FROM `winners` WHERE `red`=42
ORDER BY `drawingDate` LIMIT 10;

+-------------+--------+--------+--------+--------+--------+-----+
| drawingDate | white1 | white2 | white3 | white4 | white5 | red |
+-------------+--------+--------+--------+--------+--------+-----+
| 1997-11-05 | 2 | 19 | 35 | 24 | 28 | 26 |
| 1997-11-08 | 40 | 17 | 21 | 7 | 49 | 37 |
| 1997-11-15 | 32 | 7 | 45 | 21 | 26 | 8 |
| 1997-11-19 | 19 | 10 | 48 | 3 | 21 | 22 |
| 1997-11-22 | 22 | 31 | 3 | 7 | 14 | 2 |
| 1997-11-29 | 11 | 27 | 13 | 2 | 31 | 23 |
| 1997-12-13 | 44 | 41 | 46 | 35 | 5 | 27 |
| 1997-12-17 | 3 | 5 | 45 | 18 | 13 | 20 |
| 1997-12-20 | 23 | 11 | 12 | 17 | 10 | 7 |
| 1997-12-24 | 28 | 16 | 35 | 34 | 3 | 14 |
+-------------+--------+--------+--------+--------+--------+-----+
10 rows in set (0.00 sec)

Next comes the magic, where we will use 'UNION ALL' instead of 'UNION' in order to display duplicates, and then wrap the entire query with an outer SELECT statement that counts groups by drawingDate.

root@localhost:test>
SELECT COUNT(*), `a`.*
FROM
(SELECT * FROM `winners` WHERE `white1` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white2` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white3` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white4` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white5` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `red`=42) AS `a`
GROUP BY `a`.`drawingDate` ORDER BY `a`.`drawingDate` LIMIT 10;

+----------+-------------+--------+--------+--------+--------+--------+-----+
| COUNT(*) | drawingDate | white1 | white2 | white3 | white4 | white5 | red |
+----------+-------------+--------+--------+--------+--------+--------+-----+
| 1 | 1997-11-05 | 2 | 19 | 35 | 24 | 28 | 26 |
| 1 | 1997-11-08 | 40 | 17 | 21 | 7 | 49 | 37 |
| 1 | 1997-11-15 | 32 | 7 | 45 | 21 | 26 | 8 |
| 1 | 1997-11-19 | 19 | 10 | 48 | 3 | 21 | 22 |
| 2 | 1997-11-22 | 22 | 31 | 3 | 7 | 14 | 2 |
| 2 | 1997-11-29 | 11 | 27 | 13 | 2 | 31 | 23 |
| 1 | 1997-12-13 | 44 | 41 | 46 | 35 | 5 | 27 |
| 2 | 1997-12-17 | 3 | 5 | 45 | 18 | 13 | 20 |
| 1 | 1997-12-20 | 23 | 11 | 12 | 17 | 10 | 7 |
| 1 | 1997-12-24 | 28 | 16 | 35 | 34 | 3 | 14 |
+----------+-------------+--------+--------+--------+--------+--------+-----+
10 rows in set (0.06 sec)

Finally, we can add the count to the order-by clause.

root@localhost:test>
SELECT COUNT(*) AS `count`, `a`.*
FROM
(SELECT * FROM `winners` WHERE `white1` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white2` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white3` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white4` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `white5` IN (2,3,5,7,11)
UNION ALL
SELECT * FROM `winners` WHERE `red`=42) AS `a`
GROUP BY `a`.`drawingDate` ORDER BY `count` DESC, `a`.`drawingDate` LIMIT 10;

+-------+-------------+--------+--------+--------+--------+--------+-----+
| count | drawingDate | white1 | white2 | white3 | white4 | white5 | red |
+-------+-------------+--------+--------+--------+--------+--------+-----+
| 3 | 1998-09-26 | 25 | 24 | 5 | 3 | 2 | 8 |
| 3 | 2002-06-19 | 4 | 7 | 3 | 11 | 1 | 23 |
| 3 | 2003-02-01 | 11 | 2 | 39 | 21 | 16 | 42 |
| 3 | 2003-02-05 | 2 | 7 | 48 | 9 | 5 | 32 |
| 3 | 2004-12-22 | 7 | 11 | 31 | 24 | 2 | 18 |
| 3 | 2005-05-21 | 7 | 30 | 5 | 10 | 2 | 23 |
| 2 | 1997-11-22 | 22 | 31 | 3 | 7 | 14 | 2 |
| 2 | 1997-11-29 | 11 | 27 | 13 | 2 | 31 | 23 |
| 2 | 1997-12-17 | 3 | 5 | 45 | 18 | 13 | 20 |
| 2 | 1998-01-31 | 13 | 36 | 48 | 7 | 11 | 18 |
+-------+-------------+--------+--------+--------+--------+--------+-----+
10 rows in set (0.01 sec)

Thursday, March 31, 2011

finding group leaders in SQL

Here is a SQL recipe for finding group leaders in a relational database table.

Let's say you are tracking ARP data in a table `arp_data` with the following columns: `id`, `mac_address`, `ip_address`, and `timestamp`, where `id` is a primary key. A record in the table represents a particular event of interest.

root@localhost:test>
SELECT * FROM `arp_data`;
+----+---------------+------------+---------------------+
| id | mac_address | ip_address | timestamp |
+----+---------------+------------+---------------------+
| 1 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:11 |
| 2 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:18 |
| 3 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:26:37 |
| 4 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:48 |
| 5 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:51 |
| 6 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:54 |
| 7 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:58 |
| 8 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:01 |
| 9 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:03 |
+----+---------------+------------+---------------------+
9 rows in set (0.00 sec)


Now, what we want to do is find the latest record for each MAC, IP pairing. That is to say, we want to do something like this in SQL:

root@localhost:test>
SELECT `id`, `mac_address`, `ip_address`, MAX(`timestamp`)
FROM `arp_data` GROUP BY `mac_address`, `ip_address`;
+----+---------------+------------+---------------------+
| id | mac_address | ip_address | MAX(`timestamp`) |
+----+---------------+------------+---------------------+
| 1 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:54 |
| 2 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:51 |
| 3 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:03 |
| 4 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:58 |
+----+---------------+------------+---------------------+
4 rows in set (0.00 sec)


This query returned a distinct list of MAC/IP pairs, with the most recent timestamp of each pair. Unfortunately, it did not get the correct `id` for the most recent MAC, IP pair. That is because `id` was not included in the GROUP BY clause. Doing so would have not yielded the desired results. So, how do we get the correct `id` for each pair? Here is one way:

root@localhost:test>
SELECT `a`.*
FROM `arp_data` AS `a`
JOIN
(SELECT `mac_address`, `ip_address`, MAX(`timestamp`) as `timestamp`
FROM `arp_data` GROUP BY `mac_address`, `ip_address`) AS `b`
USING (`mac_address`,`ip_address`, `timestamp`);
+----+---------------+------------+---------------------+
| id | mac_address | ip_address | timestamp |
+----+---------------+------------+---------------------+
| 5 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:51 |
| 6 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:54 |
| 7 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:58 |
| 9 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:03 |
+----+---------------+------------+---------------------+
4 rows in set (0.00 sec)


We simply join the table with itself, using the the group columns and timestamp column to find the primary key. Note, this query will return duplicates if they exist.