Handling large data in MySQL
In a very large DB, very small details in indexing and querying make the difference between smooth sailing and catastrophe.
[This post was inspired by conversations I had with students in the workshop I’m attending on Mining Software Repositories. It has been updated a few times.]
First off, what is “large”? Industry bloggers have come up with the catchy 3 (or 4) V’s of big data. In my case, I was dealing with two very large tables: one with 1.4 billion rows and another with 500 million rows, plus some other smaller tables with a few hundreds of thousands of rows each. On the disk, it amounted to about half a terabyte. So, small-ish end of big data, really.
Second off, what is the problem? My MySQL server is running on a modern, very powerful 64-bit machine with 128G of RAM and a fast hard drive. I thought querying would be a breeze. Well, my first naive queries took hours to complete! I was in shock.
Before illustrating how MySQL can be bipolar, the first thing to mention is that you should edit the configuration of the MySQL server and up the size of every cache. The MySQL config vars are a maze, and the names aren’t always obvious. I ended up with something like this:
key_buffer_size = 1G sort_buffer_size = 16M tmp_table_size = 4G max_heap_table_size = 8G read_buffer_size = 512K read_rnd_buffer_size = 512K myisam_sort_buffer_size = 4G
This helps, but only so much… I’m going to illustrate the anatomy of a MySQL catastrophe.
MySQL does a reasonably good job at retrieving data from individual tables when the data is properly indexed. For example, this query is a breeze on my 1B-row table:
mysql> SELECT * FROM relations WHERE relation_type='INSIDE';
We have an index for that column. It takes a while to transfer the data, because it retrieves millions of records, but the actual search and retrieval is very fast; the data starts streaming immediately.
But those queries are boring. When exploring data, we often want complex queries that involve several tables at the same time, so here is one of those:
mysql> SELECT p.project_id, p.name, ifnull(count(distinct r.relation_id), 0)
FROM relations as r
RIGHT JOIN sourcerer.projects as p
ON r.project_id=p.project_id
WHERE p.source ='Apache'
GROUP BY p.project_id;
...data...
133 rows in set (1 min 33.94 sec)
The thing I did with this query was to join the relations table (the 1B+ row table) with the projects table (about 175,000 thousand rows), select only a subset of the projects (the Apache projects), and group the results by project id, so that I have a count of the number of relations per project on that particular collection. The joined fields are indexed; the source field is not indexed. As seen, it took 1 min and a half for the query to execute. OK, that would be bad for an online query, but not so bad for an offline one. Let’s move on to a query that is just slightly different:
mysql> SELECT p.project_id, p.name, ifnull(count(distinct r.relation_id), 0) FROM relations as r RIGHT JOIN sourcerer.projects as p ON r.project_id=p.project_id WHERE p.source ='Apache' AND r.relation_type='INSIDE' GROUP BY p.project_id; ... 132 rows in set (46 min 26.00 sec)
Whoa! I added one little constraint to the relations, selecting only a subset of them, and now it takes 46 minutes for this query to complete! WTF?! This is totally counter-intuitive. Adding a constraint means that fewer records would be looked at, which would mean faster processing. Right?
Let’s look at what ‘explain’ says. (btw, ‘explain’ is your friend when facing WTFs with MySQL). Here is the ‘explain’ for the first query, the one without the constraint on the relations table:
mysql> explain SELECT p.project_id, p.name, ifnull(count(distinct r.relation_id) , 0) FROM relations as r RIGHT JOIN sourcerer.projects as p ON r.project_id=p.project_id WHERE p.source ='Apache' GROUP BY p.project_id; +----+-------------+-------+------+---------------+------------+---------+------------------------+--------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------------+---------+------------------------+--------+-----------------------------+ | 1 | SIMPLE | p | ALL | NULL | NULL | NULL | NULL | 173663 | Using where; Using filesort | | 1 | SIMPLE | r | ref | project_id | project_id | 8 | sourcerer.p.project_id | 8735 | | +----+-------------+-------+------+---------------+------------+---------+------------------------+--------+-----------------------------+
And here is the ‘explain’ for the second query, the one with the constraint on the relations table:
mysql> EXPLAIN SELECT p.project_id, p.name, ifnull(count(distinct r.relation_id) , 0) FROM relations as r RIGHT JOIN sourcerer.projects as p ON r.project_id=p.project_id WHERE relation_type='INSIDE' AND p.source ='Apache' GROUP BY p.project_id; +----+-------------+-------+--------+--------------------------+---------------+---------+------------------------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+--------+--------------------------+---------------+---------+------------------------+-----------+----------------------------------------------+ | 1 | SIMPLE | r | ref | relation_type,project_id | relation_type | 2 | const | 508126553 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | p | eq_ref | project_id | project_id | 8 | sourcerer.r.project_id | 1 | Using where | +----+-------------+-------+--------+--------------------------+---------------+---------+------------------------+-----------+----------------------------------------------+
So what’s happening?
According to the explanation, in the first query, the selection on the projects table is done first. Because it involves only a couple of hundred of thousands of rows, the resulting table can be kept in memory; the following join between the resulting table and the very large relations table on the indexed field is fast. We’re all good. However, in the second query, the explanation tells us that, first, a selection is done on the relations table (effectively, relations WHERE relation_type=’INSIDE’); the result of that selection is huge (millions of rows), so it doesn’t fit in memory, so MySQL uses a temporary table; the creation of that table takes a very long time… catastrophe!
Here you may ask: but why didn’t the query planner choose to do the select on the projects first, just like it did on the first query? Then it should join that with the large relations table, just like it did before, which would be fast, and then select the INSIDE relations and count and group stuff. I’m not sure why the planner made the decision it made. But let’s try telling it exactly what I just said:
mysql> SELECT p.project_id, p.name, IFNULL(count(distinct r.relation_id),0)
FROM (SELECT * FROM projects WHERE source='Apache') AS p
LEFT JOIN relations AS r on (p.project_id=r.project_id)
WHERE r.relation_type='INSIDE'
GROUP BY p.project_id
...
132 rows in set (57.26 sec)
OK, we’re back in business: 57 seconds!
As you can see, the line between smooth sailing and catastrophe is very thin, and particularly so with very large tables. Very small changes in the query can have gigantic effects in performance. You always need to understand what the query planner is planning to do.
Now, in this particular example, we could also have added an index in the source field of the projects table. Let me do that:
mysql> alter table projects add index source (source); Query OK, 173663 rows affected, 4 warnings (16.78 sec) Records: 173663 Duplicates: 0 Warnings: 0
Let’s go back to the slow query and see what the query planner wants to do now:
mysql> explain SELECT p.project_id, p.name, ifnull(count(distinct r.relation_id), 0) FROM relations as r RIGHT JOIN sourcerer.projects as p ON r.project_id=p.project_id WHERE p.source ='Apache' AND r.relation_type='INSIDE' GROUP BY p.project_id; +----+-------------+-------+------+--------------------------+------------+---------+------------------------+------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+--------------------------+------------+---------+------------------------+------+-----------------------------+ | 1 | SIMPLE | p | ref | project_id,source | source | 1003 | const | 227 | Using where; Using filesort | | 1 | SIMPLE | r | ref | relation_type,project_id | project_id | 8 | sourcerer.p.project_id | 8735 | Using where | +----+-------------+-------+------+--------------------------+------------+---------+------------------------+------+-----------------------------+
Ah-ha! Now it changed its mind about which table to process first: it wants to process projects first. That looks good, no temporary tables anywhere, so let’s try it:
mysql> SELECT p.project_id, p.name, ifnull(count(distinct r.relation_id), 0)
FROM relations as r
RIGHT JOIN sourcerer.projects as p
ON r.project_id=p.project_id
WHERE p.source ='Apache' AND r.relation_type='INSIDE'
GROUP BY p.project_id;
...
132 rows in set (1 min 15.71 sec)
OK, good. The index on the source field doesn’t necessarily make a huge performance improvement on the lookup of the projects (after all, they seem to fit in memory), but the dominant factor here is that, because of that index, the planner decided to process the projects table first. In this case, that makes the difference between smooth sailing and catastrophe.
Some of my students have been following a different approach. Rather than relying on the MySQL query processor for joining and constraining the data, they retrieve the records in bulk and then do the filtering/processing themselves in Java or Python programs. If you’re not willing to dive into the subtle details of MySQL query processing, that is an alternative too.
Comments
This reads like a limitation on MySQL in particular, but isn’t this a problem with large relational databases in general?
Yes, I would think the other relational DBs would suffer from the same problem, but I haven ‘t used them nearly as much as I’ve used MySQL. It may be that commercial DB engines do something better.
You seem to have missed the important variables for your workload. These variables depend on the storage engine. Hopefully you’re using innodb. If you are then increase innodb_buffer_pool_size to as large as you can without the machine swapping. This should make queries after the first one significantly faster. If you aren’t using the innodb storage engine then you should be.
With no prior training, if you were to sit down at the controls of a commercial airplane and try to fly it, you will probably run into a lot of problems. You might conclude that airplanes are an unsafe way to move people around. But if you look around you’ll see that lots of people are using them successfully.
It’s the same for MySQL and RDBMSes: if you look around you’ll see lots of people are using them for big data. 500GB doesn’t even really count as big data these days.
So, it’s true that the MySQL optimizer isn’t perfect, but you missed a pretty big change that you made, and the explain plan told you. When you added r.relation_type=’INSIDE’ to the query, you turned your explicit outer join to an implicit inner join. That is to say even though you wrote RIGHT JOIN, your second query no longer was one. MySQL happily tried to use the index you had, which resulted in changing the table order, which meant you couldn’t use an index to cover the GROUP BY clause (which is important in your case!)
A trivial way to return your query to the previous execution time would be to add SELECT STRAIGHT_JOIN … to the query which forces the table order.
A more complex solution lies in analyzing your data and figuring out the best way to index it. This has always been true of any relational database at any size.
What performance numbers do you get with other databases, such as PostgreSQL?
Comments are closed.