Saiyine
Punto Com

Statistics on getting a random row from a table

2010-01-24 15:14:31

So yesterday night I found this interesting post talking about of the ages known alternative for getting a random row from a table of using two queries, the first for learning the number of rows, and the second to get just one row using the reserved SQL word offset and passing the number of rows as the top for a random funciont. In pseudo-SQL, something like this...

SELECT COUNT(1) AS total FROM bench_myisam; $rand = rand($total); SELECT * FROM bench_myisam LIMIT 1 OFFSET $rand;

... instead of using our much hated but extremely common order by rand:

SELECT * FROM bench_myisam ORDER BY RAND() LIMIT 1;

I've always thought the two querys alternative had to be slower even if it was just for the overhead of the two network accesses, but this guy proved me wrong by going all the hassle of testing it for real, and, really, the two querys approach resulted being an order faster in is tests than our "beloved" random ordering.

It was precisely these tests that sparkled my curiosity, because I don't feel like they where of my taste. First, he uses MySQL and Sqlite, what an strange combination. Sqlite is like the MsDOS of the databases, there are lots of better embedded ones if it was the reason. Second, what's the use of using MySQL and not telling what storage we are using for the tests? (Althought he hints in the commentaries it could be MyISAM, and asserts correctly that the row number is stored in the metadata of the table). And finally, the consists of just one query. Whoa, what an statistic.

So I did my own tests with my server, a MySQL 5.1.37-1ubuntu5 running on an Ubuntu 9.10, and they really show he's absolutely right:

In a table with a million rows, for getting a random row, using the two reads approach is an order of magnitude better than the naive random order way! MyISAM performing better than InnoDB was expected, but, sincerely, I believed the difference would be greater.

The data of the table is measured in milliseconds per query (Where do I write to get that measure unit named after me????) so obviously the lower the better, as is the time on average a query took to complete, calculated from the thousands of petitions the perl based clients did.

UPDATE: The querys, explained:

The naive approach, ordering by rand() and then getting the first row:

EXPLAIN SELECT * FROM bench_myisam ORDER BY RAND() LIMIT 1;

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1 SIMPLE bench_myisam index (NULL) PRIMARY 8 (NULL) 1000000 Using index; Using temporary; Using filesort

And the output for the two querys approach:

EXPLAIN SELECT COUNT(1) AS total FROM bench_myisam;

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1 SIMPLE (NULL) (NULL) (NULL) (NULL) (NULL) (NULL) (NULL) Select tables optimized away

EXPLAIN SELECT * FROM bench_myisam LIMIT 1 OFFSET $rand;

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1 SIMPLE bench_myisam index (NULL) PRIMARY 8 (NULL) 1000000 Using index

The tables are incredibly boring, just a million rows of an integer primary key and a varchar(200) of sample text. And here is the code for the Perl client I wrote to test the two querys approach. The other client is almost identical, but with just one query.

Rollos antiguos

2009-12-20 17:27:12 - Los gemelos golpean dos veces.

2009-12-19 18:27:23 - A ver esa punteria.

2009-12-08 18:41:40 - Mañana en el museo.

2009-12-06 17:48:04 - Cuando los pingüinos atacan.

2009-11-28 18:27:57 - En el campo de la excelencia.

Saiyine

Selfie of meHi! Welcome to Saiyine Punto Com where I talk about anything that goes through my mind!

Puedo prometer y prometo que a la mayor brevedad aquí irá un menú o algo asín.