Carefully converting your MySQL database to utf8

Converting all the data in your database can be a nail-biting experience. As you can see from the code below we are doing our best to be super careful. We convert each table separately and before each conversion we store the table’s column types and an MD5 hash of every row in the table (we were lucky enough to not have enormous tables). After converting the table we check that no column types or rows were changed. It goes without saying that we do a trial run on a database dump first.

require 'set'
require 'digest/md5'

CHARACTER_SET = 'utf8'
COLLATION = 'utf8_unicode_ci'

class ConvertAllTablesToUtf8 < ActiveRecord::Migration
  def up
    ActiveRecord::Base.connection.tables.each do |table|
      ActiveRecord::Base.transaction do
        ActiveRecord::Base.connection.execute("LOCK TABLES #{table} WRITE")
          say "starting work on table: #{table}"

          model = table.classify.constantize
          say "associated model: #{model}"

          say 'storing column types information before converting table to unicode'
          column_types_before = model.columns_hash.each_with_object({}) do |(column_name, column_info), column_types_before|
            column_types_before[column_name] = [column_info.sql_type, column_info.type]
          end

          say 'storing set of table data hashes before converting table to unicode'
          table_data_before = Set.new
          model.find_each do |datum|
            table_data_before << Digest::MD5.hexdigest(datum.inspect)
          end

          say 'converting table to unicode'
          execute("ALTER TABLE #{table} CONVERT TO CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION}")
          execute("ALTER TABLE #{table} DEFAULT CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION}")

          say 'getting column types information after conversion to unicode'
          column_types_after = model.columns_hash.each_with_object({}) do |(column_name, column_info), column_types_after|
            column_types_after[column_name] = [column_info.sql_type, column_info.type]
          end

          say 'getting set of table data hashes after conversion to unicode'
          table_data_after = Set.new
          model.find_each do |datum|
            table_data_after << Digest::MD5.hexdigest(datum.inspect)
          end

          say "checking that column types haven't changed"
          if column_types_before != column_types_after
            raise "Column types of the #{table} table have changed"
          end

          say "checking that data hasn't changed"
          if table_data_before != table_data_after
            raise "Data in the #{table} table has changed"
          end
        ActiveRecord::Base.connection.execute('UNLOCK TABLES')
      end
    end

    execute("ALTER DATABASE #{ActiveRecord::Base.connection.current_database} DEFAULT CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION}")
  end

  def down
    raise ActiveRecord::IrreversibleMigration
  end
end

Note how we lock each table before converting it. If we didn’t lock it then new data could be written to the table while we are busy storing MD5 hashes of the rows in preparation for the actual conversion. This, in turn, would cause our migration to complain that new data was present after the conversion had taken place.

We also wrap each table conversion inside a transaction. I’ve talked before about how converting a table will cause an implicit commit, meaning that a rollback won’t undo any of the changes made by the conversion. So why have a transaction here then? Imagine that an exception were to be raised during our migration. In that case we want to ensure our table lock gets dropped as soon as possible. The transaction guarantees this behavior.

Also, if we weren’t so paranoid about checking the before and after data as part of our migration, we could simplify this code quite a bit.

CHARACTER_SET = 'utf8'
COLLATION = 'utf8_unicode_ci'

class ConvertAllTablesToUtf8 < ActiveRecord::Migration
  def up
    ActiveRecord::Base.connection.tables.each do |table|
      say 'converting table to unicode'
      execute("ALTER TABLE #{table} CONVERT TO CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION}")
      execute("ALTER TABLE #{table} DEFAULT CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION}")
    end

    execute("ALTER DATABASE #{ActiveRecord::Base.connection.current_database} DEFAULT CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION}")
  end

  def down
    raise ActiveRecord::IrreversibleMigration
  end
end

Notice how we can drop the lock as the ALTER TABLE command will prevent all writes to the table while simultaneously allowing all reads.

In most cases, ALTER TABLE makes a temporary copy of the original table. MySQL waits for other operations that are modifying the table, then proceeds. It incorporates the alteration into the copy, deletes the original table, and renames the new one. While ALTER TABLE is executing, the original table is readable by other sessions. Updates and writes to the table that begin after the ALTER TABLE operation begins are stalled until the new table is ready, then are automatically redirected to the new table without any failed updates. The temporary copy of the original table is created in the database directory of the new table. This can differ from the database directory of the original table for ALTER TABLE operations that rename the table to a different database.

Furthermore, since we now no longer have a lock on our table we can also drop the transaction. This gives us the much-simplified code shown above.