569. Median Employee Salary


The Employee table holds all employees. The employee table has three columns: Employee Id, Company Name, and Salary.

+-----+------------+--------+
|Id   | Company    | Salary |
+-----+------------+--------+
|1    | A          | 2341   |
|2    | A          | 341    |
|3    | A          | 15     |
|4    | A          | 15314  |
|5    | A          | 451    |
|6    | A          | 513    |
|7    | B          | 15     |
|8    | B          | 13     |
|9    | B          | 1154   |
|10   | B          | 1345   |
|11   | B          | 1221   |
|12   | B          | 234    |
|13   | C          | 2345   |
|14   | C          | 2645   |
|15   | C          | 2645   |
|16   | C          | 2652   |
|17   | C          | 65     |
+-----+------------+--------+

Write a SQL query to find the median salary of each company. Bonus points if you can solve it without using any built-in SQL functions.

+-----+------------+--------+
|Id   | Company    | Salary |
+-----+------------+--------+
|5    | A          | 451    |
|6    | A          | 513    |
|12   | B          | 234    |
|9    | B          | 1154   |
|14   | C          | 2645   |
+-----+------------+--------+

b'
\n\n

Solution

\n
\n

Approach #1: Using the definition of median [Accepted]

\n

Intuition

\n

By the definition of median, the count of the bigger numbers than itself should be equal to the count of the smaller ones in an odd array.

\n

Algorithm

\n

Take array [1,3,2] for example, is the first number 1 the median? No, because this array only have 3 elements but there are 2 of them (3, 2) are greater than 1. To continue, we know 3 is not the median as well since there are 2 elements smaller. But for the last element 2, there are equal amount of bigger and smaller numbers. So it is the median in this array!

\n

What if an array has even amount of distinct values, the median is the average of the middle two elements next to each other after sorting this array. It is not hard to understand that for either of these two elements, the difference (absolute value) of its bigger and smaller number than itself in this array is 1, which is the exactly frequency of a element in the distinct array.

\n

So in general, the median\'s frequency should be equal or grater than the absolute difference of its bigger elements and small ones in an array no matter whether it has odd or even amount of numbers and whether they are distinct. This rule is the key, and it is represented as the following code.

\n
SUM(CASE\n    WHEN Employee.Salary = alias.Salary THEN 1\n    ELSE 0\nEND) >= ABS(SUM(SIGN(Employee.Salary - alias.Salary)))\n
\n

Thus, this approach is as following in MySQL code.

\n

MySQL

\n
SELECT\n    Employee.Id, Employee.Company, Employee.Salary\nFROM\n    Employee,\n    Employee alias\nWHERE\n    Employee.Company = alias.Company\nGROUP BY Employee.Company , Employee.Salary\nHAVING SUM(CASE\n    WHEN Employee.Salary = alias.Salary THEN 1\n    ELSE 0\nEND) >= ABS(SUM(SIGN(Employee.Salary - alias.Salary)))\nORDER BY Employee.Id\n;\n
\n
\n

Note: In MySQL 5.6, this code runs fine, but if you are using MySQL 5.7+, please use ANY_VALUE(Employee.Id) instead of Employee.Id in the SELECT statement. Otherwise, you may encouter the following error message:\nError Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column \'Employee.Id\' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by.\nFor more details on how to user ANY_VALUE(arg), please refer to this link.

\n
\n
\n

Approach #2: Sort and then select the median [Accepted]

\n

Intuition

\n

In general, we can just pick the middle one(s) to get the median if the records are ranked by salary. But how can we get them sorted particularly MySQL does not have the build-in rank function, and these are several companies in this case.

\n

Algorithm

\n

By adding a virtual column to simulate the ranking, we can sort these records by salary and pick up the middle one(s). Here we need to use the session variable to achieve this goal.

\n

This approach is more efficient than the first one since it does not calculate all the salary one by one in the table.

\n
SELECT \n    Id, Company, Salary\nFROM\n    (SELECT \n        e.Id,\n            e.Salary,\n            e.Company,\n            IF(@prev = e.Company, @Rank:=@Rank + 1, @Rank:=1) AS rank,\n            @prev:=e.Company\n    FROM\n        Employee e, (SELECT @Rank:=0, @prev:=0) AS temp\n    ORDER BY e.Company , e.Salary , e.Id) Ranking\n        INNER JOIN\n    (SELECT \n        COUNT(*) AS totalcount, Company AS name\n    FROM\n        Employee e2\n    GROUP BY e2.Company) companycount ON companycount.name = Ranking.Company\nWHERE\n    Rank = FLOOR((totalcount + 1) / 2)\n        OR Rank = FLOOR((totalcount + 2) / 2)\n;\n
\n
'