Tech Corner

changing DB encoding feature image

Changing DB Encoding to Support Emojis in a Production Rails Environment

Greg Wahl By Greg Wahl • March 23, 2020

The Problem

If you’ve created a Rails application that uses MySQL 5.6 or earlier, you may have run into this problem:

Way back when you first created your application’s initial codebase, you ran rails new my_application -d mysql or some similar command to get a default Rails project using MySQL. Then you created a new MySQL 5.6 database in Amazon RDS or some similar hosting services to hold your production data, set your DATABASE_URL environment variable and/or config/database.yml file to reference it, and released your application into the wild. Everything was running smoothly until one day a user went to save a form in your application and their API request failed 😮!

Checking your Rails error logs revealed an error like this:

"\xF0" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

After Googling the error, you found that the request failed because the user tried to enter a character outside of the standard UTF-8 character set, like an emoji (😀) or an accented character (é).

To understand why that error occurred, let’s take a closer look at the config/database.yml that was generated when you created the rails application:

  default: &default
  adapter: mysql2
  encoding: utf8
  pool: 5
  username: root
  password:
  socket: /tmp/mysql.sock

The culprit behind the error you found is that encoding: utf8 setting. It turns out that by default, Rails versions < 6.0.0 configure MySQL databases to use the UTF-8 character set, and by extension default to the UTF-8 collation as well (“character sets” are referred to as “encoding” in Rails configuration). Characters in UTF-8 are encoded using just 3 bytes, instead of the 4 bytes needed to support those emojis and other extended characters your users tried to enter. You can verify the character set and collation your database is using with this code snippet:

ActiveRecord::Base.connection.exec_query(
  'SELECT table_name, column_name, character_set_name, collation_name, column_type'\
  ' FROM information_schema.`COLUMNS`'\
  " WHERE table_schema = \"#{ActiveRecord::Base.connection.current_database}\""
).to_a

Which might give you output like the following:

[
  {"table_name"=>"users", "column_name"=>"name", "character_set_name"=>"utf8", "collation_name"=>"utf8_general_ci", "column_type"=>"varchar(255)"},
  {"table_name"=>"users", "column_name"=>"email", "character_set_name"=>"utf8", "collation_name"=>"utf8_general_ci", "column_type"=>"varchar(255)"},
  ...
]

NOTE: For further information on “character set” and “collation”, see here.

 
The Solution

Thankfully, MySQL 5.5.3 and later supports a utf8mb4 character set that uses 4 byte unicode encoding, thus supporting all of those extended characters your users want to use!

However, to get your Rails application using the utf8mb4 character set and collation instead of utf8, there are a couple changes you’ll need to make and some considerations to keep in mind.

 
Update Rails Database Configuration

First, you’ll want to update your config/database.yml (or your DATABASE_URL environment variable if you don’t source control config/database.yml) with the new character set and collation. This won’t change your existing database schema, but will ensure that future schema changes will default to the preferred character set and collation. (NOTE: there are several different types of collation available for the utf8mb4. (See the MySQL supported character set documentation and this helpful blog post for information on available character sets to choose the one that’s best for your use case.)

encoding: utf8mb4
collation: utf8mb4_unicode_ci
 
Change Your Existing Database

Next, you’ll need to execute some SQL statements to change the character set and collation on the existing database tables and columns. This can be accomplished via a rake task to set the character set and collation on the database itself, as well as on each individual table and each VARCHAR column that has the incorrect character set and collation.

Here’s a walkthrough of a Ruby method that could accomplish this:

First, let’s define some constants, and fix the overall database character set and collation.

CHARACTER_SET = 'utf8mb4'.freeze
COLLATION = 'utf8mb4_unicode_ci'.freeze

current_database = ActiveRecord::Base.connection.current_database
# Fix database
ActiveRecord::Base.connection.execute "ALTER DATABASE `#{current_database}` CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION};"

Next, let’s capture some checksums of each table.

# Calculate checksums on the tables so we can determine if encoding change caused problems.
table_checksums = {}
tables_to_fix = ActiveRecord::Base.connection.select_rows(
  "SELECT table_name
    FROM information_schema.tables
    WHERE table_schema = '#{current_database}'
    AND table_collation != '#{COLLATION}';"
)
tables_to_fix.each do |t|
  checksum = ActiveRecord::Base.connection.select_rows("CHECKSUM TABLE `#{current_database}`.`#{t[0]}`").first
  table_checksums[t] = checksum[1]
end

This will be a safety measure against data loss. By taking a checksum of each of our tables before and after performing the change, we can compare them and ensure none of our data changed.

Now to fix the individual columns and tables.

MAX_VARCHAR_LENGTH = 191
# Fix columns
columns_to_fix = ActiveRecord::Base.connection.select_rows(
  "SELECT table_name, column_name, column_type, is_nullable, character_set_name, collation_name, column_default, data_type, character_maximum_length
   FROM INFORMATION_SCHEMA.COLUMNS
   WHERE table_schema = '#{current_database}'
   AND collation_name is not null
   AND (collation_name != '#{COLLATION}' OR character_set_name != '#{CHARACTER_SET}');"
)
columns_to_fix.each do |tc|
  column_type = tc[2]
  if tc[7] == 'varchar' && tc[8] > MAX_VARCHAR_LENGTH
    column_type = "varchar(#{MAX_VARCHAR_LENGTH})"
  end
  puts "fixing #{tc[0]}.#{tc[1]} was #{tc[4]}/#{tc[5]} changing to #{CHARACTER_SET}/#{COLLATION} #{column_type}"
  ActiveRecord::Base.connection.execute(
    "ALTER TABLE `#{current_database}`.`#{tc[0]}`
       modify `#{tc[1]}` #{column_type}
       CHARACTER SET #{CHARACTER_SET}
       COLLATE #{COLLATION}
       #{tc[3] == 'YES' ? 'NULL' : 'NOT NULL'}
       #{tc[6] ? "DEFAULT '#{tc[6]}'" : ''};"
  )
end

# Fix tables
tables_to_fix.each do |t|
  puts "setting default collation #{t[0]}"
  ActiveRecord::Base.connection.execute "ALTER TABLE `#{current_database}`.`#{t[0]}` CONVERT TO CHARACTER SET #{CHARACTER_SET} COLLATE #{COLLATION}"
end

(Note that MAX_VARCHAR_LENGTH = 191 bit, which will be explained in the next section below)

Finally, let’s compare the checksums to see if any data has changed.

# Verify data has not changed
tables_to_fix.each do |t|
  new_checksum = ActiveRecord::Base.connection.select_rows("CHECKSUM TABLE `#{current_database}`.`#{t[0]}`").first
  puts "verifying checksum for table #{t}. was: #{table_checksums[t]} is #{new_checksum[1]}"
end

You can view the full code for this task at this Gist link.

 
Working Around The 767 Byte Index Issue

In the task above, why did we change all our varchar columns to be a maximum of 191 characters? That’s because there is an important limitation of utf8mb4 to keep in mind. In MySQL’s InnoDB storage engine, by default all index key lengths are limited to 767 bytes. Since with utfm8mb4 characters take up 4 bytes instead of 3, that means you can only fit a maximum of 191 characters in an indexed column (4 bytes * 191 = 764 bytes), instead of the 255 characters you could fit previously (3 bytes * 255 = 765 bytes). This means if you have any indexed VARCHAR columns that are greater than 191 characters in length, they will need to be truncated to 191 characters.

There is a way you can get around this limitation by changing your InnoDB file format to Barracuda (the default for MySQL 5.7) instead of Antelope (the default for MySQL 5.6), and using the COMPRESSED or DYNAMIC row format (See this reference page and article).

However, this blog will outline how to update the encoding in the case that you do not choose to change your file format.

Since all indexed columns will need to be no more than 191 characters, we will need to truncate them down to exactly 191 characters. You may choose to be selective about which columns you truncate and only truncate the VARCHAR columns that currently have indexes associated with them, but for consistency’s sake, and to ensure we can index any additional VARCHAR columns in the future, in this guide we’ll opt to truncate all of them to 191.

Obviously, if you’re dealing with production data, you should be very careful about arbitrarily truncating columns as this could lead to data loss if you have rows with any column values over 191 characters. Not only could you lose the ends of string values, but if a column is storing a serialized array or hash field, you would be unable to deserialize it after truncating it.

So to ensure we’re not truncating any important data, lets make a method that can identify any rows that contain a VARCHAR colum value that exceeds our maximum 191 character length in production:

MAX_VARCHAR_LENGTH = 191

def find_long_values
  base_query = "SELECT table_name, column_name
     FROM INFORMATION_SCHEMA.COLUMNS
     WHERE table_schema = '#{ActiveRecord::Base.connection.current_database}'
     AND collation_name is not null
     AND (collation_name != '#{COLLATION}' OR character_set_name != '#{CHARACTER_SET}')
     AND data_type = 'varchar'
     AND character_maximum_length > #{MAX_VARCHAR_LENGTH}"

   all_columns = ActiveRecord::Base.connection.select_rows(
     "SHOW COLUMNS FROM #{table}"
   ).map { |item| item[0] }

   rows_above_max = ActiveRecord::Base.connection.select_rows(
     "SELECT #{all_columns.join(',')}
      FROM #{table}
      WHERE CHAR_LENGTH(#{column}) > #{MAX_VARCHAR_LENGTH}"
   )
   if rows_above_max.present?
     puts "Table: '#{table}', Column: '#{column}'"
     rows_above_max.each do |row|
       found_rows = []
       # Assuming `id` is the first column in your table
       puts "Row id: #{row[0]}"
     end
   else
     puts "No long values found"
    end
  end
end

We will need to run this before changing the encoding on the database itself, but after beginning our downtime and taking a database backup to ensure we are not truncating any values. If we encounter any rows in our database that have columns above the max, we will need to manually correct any columns that exceed the length.

 
Performing the Fix in Production

With all of the pieces in place, here’s an outline of the steps involved in performing the encoding change in production.

  1. Disable all flow of data into our database.
    • Temporarily disable our customer portals so users cannot enter new data
    • Wait for any background workers to complete their processes, and disable them
    • Disable any periodic processes that modify data
  2. Take a full database backup for safety in case something goes wrong with our encoding change, or we accidentally truncate valuable data.
  3. Run rake task to find long values.
    • If there are any long values, correct them via the Rails console or a direct MySQL connection
  4. Run rake task to fix encoding.
    • Ensure checksums before and after the change match
    • Sanity check our database to ensure collation changes were made
  SELECT table_name, column_name, character_set_name, collation_name, column_type
  FROM information_schema.`COLUMNS`
  WHERE table_schema = 'your-database-schema-name';
  1. Re-enable data flow to our database
    • Bring up application
    • Bring up background workers and re-enable periodic processes
  2. Perform a production test of entering special characters.

And that’s it! With the fix in place your users should be able to enter emojis to their heart’s content without breaking your application. 🎉🎉🎉

Greg Wahl

About the Author

Greg Wahl

Senior Software Engineer

Greg is a Senior Software Engineer at Snapsheet, currently working on the Platform team to develop a number of high-impact tools and applications for their claims software suite. He also is involved with the Snapsheet Pride Circle, and enjoys playing the saxophone.

Get the latest from Snapsheet delivered to your inbox.