Take Out The Papers And The Trash

  

How to clean and clear large tables in MySQL

From time to time our data team gets requests for advice on how to perform cleanup operations against large database tables. Typically, these originate in a ticket requesting information about how the disk is being used or why a specific table is performing poorly.

Somewhat less often, we are asked us to explain why a cleanup attempt has failed or why it has caused downtime for an application. Managing these types of operations with minimal or no downtime can be a challenge given the way a database like MySQL performs these tasks.

The most common form of table cleanup operation we are asked about is for the sessions table. Even though these are not really recommended practice(https://guides.rubyonrails.org/action_controller_overview.html#session) they are still quite common to see. Depending on your application workload and use cases these tables can grow very quickly in size; often including older records that are never going to be used again. Even though the data payload in each row is relatively small, it's not uncommon to find sessions tables that are 10 or 20 GB in size—often larger than the rest of the database combined.

The "Standard" Solution

While there is no automatic session cleanup built into a rails app it happens to be really easy to write a rake task that handles this cleanup for you. In fact, it's so easy that it's often overlooked until the table has reached the size where it is difficult to manage.

A typical solution for clearing sessions might look like this:

task :clear_expired_sessions => :environment do
sql = 'DELETE FROM sessions WHERE updated_at < DATE_SUB(NOW(), INTERVAL 1 WEEK);
'ActiveRecord::Base.connection.execute(
sql)
end

source

This seems straightforward enough. Schedule this through cron and go on with creating the rest of your app. This will delete everything that hasn't been updated in the last week from your sessions table. Simple, right?

Not quite; the problem is, if your table has far more old records than it has new, it's going to perform a full table scan for this delete operation; the full scan is going to result in a write lock that prevents other changes to the table for the duration of the delete. If your table is quite large this operation could be measured in hours or even days.

The "Toss Everything" Fix

The quickest and simplest resolution to this issue is to clear the sessions table completely. This may mean that some customers have to start new sessions, but that is often acceptable for a web application. With this approach we replace the DELETE command with an operation called TRUNCATE.

Internally a TRUNCATE operation is just a DROP and CREATE table as a single statement so the table empties in the time it takes to delete a file of that size from disk. For MySQL, a TRUNCATE operation is not transaction controlled, so the best practice is to run this operation during a planned maintenance outage. The recommended steps for this are:

  • Apply a maintenance page (ey web disable)
  • For MySQL run the command: `TRUNCATE TABLE sessions; OPTIMIZE TABLE sessions;` *Note* the optimize is truly only required on MySQL 5.0 versions but doesn't hurt to use it universally.
  • Take down your maintenance page (ey web enable)

On a 20 to 30 GB table plan on a 10 minute maintenance window, the actual time will usually be well under 5 minutes.

The "Keep Some" Fix

In some cases maintaining the most recent session data is important to the user experience. With a bit more effort and a slightly longer maintenance window this can be accomplished. To make this possible we create a new sessions table with the same structure in it and copy the rows we want to keep from the original table. Once the copy is complete, we start a maintenance window, copy any additional records that have recently changed, and then rename the tables; the new takes the place of the old.

Note: The updated_at column is just one example of a column that can be used for identifying the rows to be kept. It is important that whatever column is used for this cleanup has been indexed beforehand.

  • Create a new table with identical structure to the existing table: `CREATE TABLE sessions_new LIKE sessions;`
  • Copy the most recently updated data to the new table: `INSERT INTO sessions_new SELECT * FROM sessions WHERE updated_at > DATE_SUB(now(), interval 7 day);`
  • Determine the most recent updated_at value in the new table and store it in a database variable: `SELECT MAX(updated_at) INTO @max_date FROM sessions_new;`
  • Apply a maintenance page (ey web disable)
  • Update the records in your new table with any data that has changed: `REPLACE INTO sessions_new SELECT * FROM sessions WHERE updated_at >= @max_date;`
  • Swap the tables out: `RENAME TABLE sessions TO sessions_old, sessions_new TO sessions;`
  • Take down your maintenance page (ey web enable)
  • Archive or drop the old table as appropriate to your needs.

The running time of this version will vary depending on how much data changes between your initial insert into the new table and the start of your maintenance window. The best practice here would be to walk through the procedure in Staging with recent Production data to get an estimated run time. A rename table operation is almost instantaneous so the majority of the time associated with this approach will be the time from the INSERT...SELECT and REPLACE...INTO statements. Once this maintenance operation is complete you can remove the sessions_old table at your discretion.

Wrapping Up

Now that "you've taken out the papers and the trash" and the sessions table is back to a reasonable size be sure to rewind to the top of this article and implement a sessions clearing task that fits the needs of your application so "you can get your spending cash".

Free Ebook: PaaS Is Dead

Platform as a Service (PaaS) is experiencing a digital transformation, and despite what some may argue, it’s far from dead. Learn why PaaS continues to prove it has a promising future for DevOps.

PaaS Is Dead

Related posts

PaaS Providers Must Evolve or Die

May 23, 2018

Technology changes almost as frequently as the latest fashion trend. What’s in vogue today will

Read More

Ruby on Rails vs PHP

May 16, 2018

There’s more than one way to build a web application. No matter what type of application you are

Read More

What to Look for When Considering Application Hosting

May 8, 2018

Cloud computing has made hosting business-critical applications easier and less expensive.

Read More

Tyler Poland

Comments

Subscribe Here!