
现在有以下场景:
列出每个人最高的分数(用一条SQL来实现)。其表结构及部分模拟数据如下:
insert into
tb
values
('1', 'Tom', 80, '2017-04-12'),
('2', 'Li', 69, '2017-04-12'),
('3', 'Queue', 99, '2017-04-12'),
('4', 'Tom', 78, '2017-04-20'),
('5', 'Li', 60, '2017-04-20'),
('6', 'Queue', 91, '2017-04-20'),
('7', 'Tom', 70, '2017-05-01'),
('8', 'Li', 73, '2017-05-01'),
('9', 'Queue', 88, '2017-05-01');
| Id | Name | Score | Time |
| 1 | Tom | 80 | 2017-04-12 |
| 2 | Li | 69 | 2017-04-12 |
| 3 | Queue | 99 | 2017-04-12 |
| 4 | Tom | 78 | 2017-04-20 |
| 5 | Li | 60 | 2017-04-20 |
| 6 | Queue | 91 | 2017-04-20 |
| 7 | Tom | 70 | 2017-05-01 |
| 8 | Li | 73 | 2017-05-01 |
| 9 | Queue | 88 | 2017-05-01 |
其正确结果应该是:
| Id | Name | Score | Time |
| 8 | Li | 73 | 2017-05-01 |
| 3 | Queue | 80 | 2017-04-12 |
| 1 | Tom | 99 | 2017-04-12 |
我们脑中首先想到的是先按照Name值进行分组(group by),然后根据每组的分数进行降序排序(order by),取出每组第一个就OK了(翻译成SQL:select * from tb group by Name order by Score)。
上面的思路看似没毛病,但是!
Group by 虽然会根据条件对数据进行分组,每组可能有一条或者多条数据,但是group by 只会返回一个结果。所以一般group by 都是配合聚合函(max、sum、count等等)数一起用的,返回一个该组的汇总记录。如果是以下语句:
select * from tb group by Name;
数据库则会按照默认排序,返回每组的第一条数据,即以下三条:
| Id | Name | Score | Time |
| 2 | Li | 69 | 2017-04-12 |
| 1 | Tom | 80 | 2017-04-12 |
| 3 | Queue | 99 | 2017-04-12 |
那以下语句呢:
select *, max(Score) from tb group by Name;
结果会是:
| Id | Name | Score | Time | max(Score) |
| 2 | Li | 69 | 2017-04-12 | 73 |
| 3 | Queue | 99 | 2017-04-12 | 99 |
| 1 | Tom | 80 | 2017-04-12 | 80 |
为什么是这样的结果呢?因为select * 还是会根据默认排序输出,而max(Score)是取了每组最大的Score值。对于这种Sql一定要注意使用。
现在我们再来看之前的所想的语句:group by 会在order by 之前先执行,group by 会返回每组的第一条数据,而order by 会在group by 返回的结果集上进行排序,显然并没什么卵用。
那我们为什么会这么去想呢?其实这就是我们误以为order by 会在group by 分组后但还未返回的数据集上进行的排序操作,然后返回每组的第一条数据。
理想总是丰满的,现实总是残酷的。。。。
那我们该如何实现这个需求的?
这就是本文的标题:组内排序。
我可以想到的SQL大概有以下几个,结果集的顺序可能会不一样(这里默认一个人只有一个最高分,只取一个人的一个最高分。如果最高分有多个或者其他需求则根据下面的基本SQL进行改写):
1. select * from ( select * from tb order by Score desc) as tmp group by tmp.Name; 2. select l.* from tb l inner join ( select max(Score) Score, Name from tb group by Name) r on l.Score = r.Score and l.Name = r.Name; 3. select l.* from tb l,( select Name, max(Score) Score from tb group by Name) r where l.Name = r.Name and l.Score = r.Score; 4. select * from tb as tmp group by Name, Score having Score = ( select max(Score) from tb where Name = tmp.Name); 5. select * from tb outTb where Score = ( select max(Score) from tb where Name = outTb.Name); 6. select * from tb outTb where exists ( select count(*) from tb where Name = outTb.Name and Score > outTb.Score having count(*) < 1); 7. select * from tb outTb where 1 > ( select count(*) from tb where Name = outTb.Name and Score > outTb.Score); 8. select * from tb outTb where not exists ( select 1 from tb where Name = outTb.Name and Score > outTb.Score); 9. ( select * from tb where Name = 'Tom' order by Score desc limit 1) union all ( select * from tb where Name = 'Queue' order by Score desc limit 1) union all ( select * from tb where Name = 'Li' order by Score desc limit 1)
下面将对上面的SQL进行详解:
第一条SQL: select * from ( select * from tb order by Score desc) as tmp group by tmp.Name;
可能是我们想到的最直观的SQL,但是这个SQL是建立在group by以后返回的是每个分组的第一条数据的基础上的。可是,结果真的是这样的吗?
Percona 5.5,Percona 5.1,MySQL 5.6关闭sql_mode= ONLY_FULL_GROUP_BY,MySQL5.1等版本下,返回值确实是我们想要的值:按照order by Score desc的顺序,相同Name返回Score值最大的数据;
但是在MySQL5.7,关闭sql_mode= ONLY_FULL_GROUP_BY和mariadb 10.*版本中,相同的Name值,返回则是取了最早写入的数据行,忽略了order by Score desc,按照数据的逻辑存储顺序来返回。其实这是的SQL就等价于select * from tb group by Name;
通过Mysql5.7的官方文档中可以找到答案:
在from 后的子查询中的order by会被忽略
group by cloumn返回的行是无序的
所以第一条SQL请慎重选择。
如果需求是列出每个人分数最高的两条记录呢?
其实以上的789这三条SQL改变以下参数就可以实现:
select * from tb outTb where exists ( select count(*) from tb where Name = outTb.Name and Score > outTb.Score having count(*) < 2); select * from tb outTb where 2 > ( select count(*) from tb where Name = outTb.Name and Score > outTb.Score); ( select * from tb where Name = 'Tom' order by Score desc limit 2) union all ( select * from tb where Name = 'Queue' order by Score desc limit 2) union all ( select * from tb where Name = 'Li' order by Score desc limit 2)