PRNG: March 2011

Here is a SQL recipe for finding group leaders in a relational database table.

Let's say you are tracking ARP data in a table `arp_data` with the following columns: `id`, `mac_address`, `ip_address`, and `timestamp`, where `id` is a primary key. A record in the table represents a particular event of interest.

root@localhost:test>

SELECT * FROM `arp_data`;

+----+---------------+------------+---------------------+

+----+---------------+------------+---------------------+

| 1 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:11 |

| 2 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:18 |

| 3 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:26:37 |

| 4 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:48 |

| 5 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:51 |

| 6 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:54 |

| 7 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:58 |

| 8 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:01 |

| 9 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:03 |

+----+---------------+------------+---------------------+

9 rows in set (0.00 sec)

Now, what we want to do is find the latest record for each MAC, IP pairing. That is to say, we want to do something like this in SQL:

root@localhost:test>

SELECT `id`, `mac_address`, `ip_address`, MAX(`timestamp`)

FROM `arp_data` GROUP BY `mac_address`, `ip_address`;

+----+---------------+------------+---------------------+

+----+---------------+------------+---------------------+

| 1 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:54 |

| 2 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:51 |

| 3 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:03 |

| 4 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:58 |

+----+---------------+------------+---------------------+

4 rows in set (0.00 sec)

This query returned a distinct list of MAC/IP pairs, with the most recent timestamp of each pair. Unfortunately, it did not get the correct `id` for the most recent MAC, IP pair. That is because `id` was not included in the GROUP BY clause. Doing so would have not yielded the desired results. So, how do we get the correct `id` for each pair? Here is one way:

root@localhost:test>

SELECT `a`.*

FROM `arp_data` AS `a`

JOIN

(SELECT `mac_address`, `ip_address`, MAX(`timestamp`) as `timestamp`

FROM `arp_data` GROUP BY `mac_address`, `ip_address`) AS `b`

USING (`mac_address`,`ip_address`, `timestamp`);

+----+---------------+------------+---------------------+

+----+---------------+------------+---------------------+

| 5 | abcd.ef01.234 | 10.0.0.2 | 2011-04-02 00:26:51 |

| 6 | abcd.ef01.234 | 10.0.0.1 | 2011-04-02 00:26:54 |

| 7 | abcd.ff01.234 | 10.0.0.2 | 2011-04-02 00:26:58 |

| 9 | abcd.ff01.234 | 10.0.0.1 | 2011-04-02 00:27:03 |

+----+---------------+------------+---------------------+

4 rows in set (0.00 sec)

We simply join the table with itself, using the the group columns and timestamp column to find the primary key. Note, this query will return duplicates if they exist.

PRNG

Thursday, March 31, 2011

finding group leaders in SQL

About Me

Blog Archive