Mysql selecting top N per group
It’s been a while since my last post, I’ve been busy lately. You may have noticed that I made a nice WEB2.0 game “How fast can you type the alphabet”. The game is simply about typing a-z as fast as you can and it keeps track of your best score. Currently three alphabets are supported: English, Russian, Hebrew. To make it more competitive, I made “all times best 5″ scoreboard.
For the scoreboard I needed to query the database (Mysql) for the top 5 scores, per language. I didn’t think it’s gonna be an issue, definitely not big enough to dedicate a post for it but as I soon found out, I was very wrong. Here is my scores table structure:
+--------+---------------------+------+-----+-------------------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+---------------------+------+-----+-------------------+-------+ | id | bigint(20) unsigned | NO | PRI | NULL | | | input | tinyint(3) unsigned | NO | PRI | NULL | | | score | float unsigned | NO | | NULL | | | date | timestamp | NO | | CURRENT_TIMESTAMP | | | locale | tinyint(3) unsigned | NO | PRI | NULL | | +--------+---------------------+------+-----+-------------------+-------+
All fields are pretty self-descriptive except “input” maybe. This field represents the input device that was used to type the alphabet for example typing using computer keyboard is probably much faster than typing on your iphone or android. iPad is probably somewhere in between. This feature is for future use so for now I only collect the data but I don’t show it. The triplet <id, input, locale> is used as primary key.
I’m no DBA, maybe I should have used id as primary and “unique” constraint for the triplet but the idea is a player can have only one score per language (locale) and input device. Back to my scoreboard query- I need to select top 5 players per language. After playing a while with basic queries I figured out it can’t be done with simple “group by locale” query as I can’t limit number of results per group nor explicitly tell it to take top X scores. If I needed only best score per language I could’ve simply used:
select id, min(score), locale from scores group by locale
Unfortunately there is no similar function to get N minimal scores. At this point I had two choices: either googling for solution (having one query getting everything done) or using three queries – one per language and programmatically combine the results. Maybe I should have chose the latter. I admit, I didn’t compare overall performance. However I didn’t like the idea of my DB being queried three times each time the scoreboard is refreshed (which happens quite often). Another possible solution is to show only current language scoreboard (therefor having only one query for the current language). This solution was unacceptable because it means new AJAX request each time a player switches between languages. Since most expected players are multilingual, that’s more requests to process by the web server.
Once again, I did not compare overall performance using statistical usage information so it is possible that I was wrong regarding which tradeoff is preferable. Anyhow my decision was to use one AJAX request to get the whole scoreboard and that’s what this post is really about. So I googled for solutions and I found a couple. Some use user-defined variables for the inner counting, some use relatively unreadable (at least for me) queries with inner/outer/left joins. The most readable query I found was:
select s1.id, score, locale from scores s1 where 5 > (select count(*) from scores s2 where s1.locale = s2.locale and s1.score > s2.score) group by locale, s1.id order by locale, score
Which can be interpreted as: for each score count how many scores are lesser (in the same language); then select it only if amount of lesser scores is less than 5 (meaning the score is in top 5); finally group results by language and then by score. We must group by score because otherwise we’ll get only one result per language. Pretty readable in my opinion. As for performance it does make new sub-query for each row so maybe it takes a little more cpu or time but there is no huge Cartesian multiplication.
But wait. Remember the “input” field? it is possible that same player has more than one score per language and if both of his scores are in top 5 I don’t want to show them both. For current version of the game I want to show only best of his scores and count it as one result for the scoreboard. On first thought replacing “score” with “min(score)” should do the job but it doesn’t because it will only show best of his scores but count both scores when counting amount of lesser scores. So my next move, I figured, is replacing “scores” with sub-query that returns filtered scores table – only the best score per player and language. The sub-query would look like:
select id, min(score) as score, input, locale from scores group by locale,id
Then of course I need to exclude blacklisted users who tried to tamper with the system so:
select id, min(score) as score, input, locale from scores where id not in (select id from blacklist) group by locale,id
And since I’m on shared hosting and my hosting provider doesn’t allow to create or use views, I must literally take this query and replace every instance of “scores” on the first query with it. The result is so ugly I won’t even write it. Then I asked for advice from a DBA friend of mine. She told me to try using analytic functions and gave me some examples for Oracle DB. When I looked for analytic functions on Mysql I found this great article “Emulating Analytic (AKA Ranking) Functions with MySQL” from O’REILLY. In fact they have such a good explanation of analytic functions I won’t even repeat it (if you’re unfamiliar with the subject you can read there).
Following this article, I came up with this query:
select * from (select e.locale, e.id, e.score, find_in_set(e.score, x.scoreslist) as rank from (select locale, group_concat(score order by score) as scoreslist from (select locale, min(score) as score from scores where id not in (select id from blacklist) group by locale, id) as k group by locale) as x, scores as e where e.locale = x.locale) as z where rank <= 5 and rank > 0 order by locale, rank
At first glance it might look ugly but let me assure you it’s far more readable and apparently more efficient than the first alternative. The innermost sub-query (k) would return the filtered scores table I discussed before – best score per player and language. Then we use group_concat to form table (x) that looks like (e.g.):
+--------+---------------------------------+ | locale | scoreslist | +--------+---------------------------------+ | 1 | 1.75,2.129,2.85,6.34,9,10,11,12 | | 2 | 2.185,4.12,8.32 | | 3 | 2.4 | +--------+---------------------------------+
Now we have scores list per language. We make Cartesian multiplicity with scores table (e), joining them by language. Multiplicity product isn’t huge as number of languages are fixed and small, so we’re still in the same order of magnitude. The result is original list of scores but now each row additionally has scores list for it’s language. Next, we give the score a rank according it’s location on the list and filter so only ranks in range 1-5 will return. How simple is that?
Problem solved. Note: group_concat has maximal result length (by default 1024b). Although it can be increased the method presented won’t work if you need all your records sorted and grouped but for the purpose of top N assuming N is relatively small it works very well.
All in all, for my “a-z” game it probably doesn’t matter which of the methods I’d have used but it was quite educative and it’s good knowledge for real projects to come. As said before, DB is not my field of expertise so I’d appreciate if someone experienced would comment on things I wrote or shed some light on the things I didn’t.