In most Rails applications, we use an ActiveRecord Model backed by a Database Table whenever we need to query data. While this is sufficient for most scenarios, a few situations may benefit from using a View, or even raw SQL Statements to update data in a Database table. In this blog post, we’ll go through why our need for complex derived data in one of our Rails applications at Snapsheet pushed us to explore all of these options and describe our final solution. We’ll also explain the pros and cons of each approach so that you can decide what would work best for your project.
Let’s first start with an example scenario and database schema to understand the problem better. Let’s say you are developing a new TV show review website. This site will allow users to view and post reviews for each episode of a TV show and will display interesting metrics about these reviews based on season, genre, review time, and more. You have laid out the database schema as follows:
Using this collected data, you would like to create live analytics around what users say about each show and season. You also have a constraint that these analytics must have a minimal delay and should stay accurate as new reviews are being submitted.
Some examples of these metrics may be:
Counts of top reviews by reviewee_type (Critic, Consumer)
Average rating of TV shows mid-season vs. post-season
Average rating of TV shows by Genre or Network
Average rating of Seasons by Episode Count
Average rating of Season difference
Let’s use “Average rating of TV shows mid-season vs post-season” as an example of how one might implement this with the traditional ActiveRecord Model and Database Table paradigm.
First of all, your Database Table for this analysis may look something like this:
To keep this table up to date with every new review that comes in, we would need to implement a callback on the Review model to update the appropriate rows on this table:
In this simple example, doing these calculations in a callback is relatively innocuous and can serve as a good solution. We only have a single after_commit hook, which will adequately update our ReviewAnalytics table after every submitted review. The code is readable and easy to follow.
However, if this were a more extensive system with many pre-existing after_commit callbacks, large data quantities, as well as being a multithreaded Rails application, we would have some additional considerations that may steer us away from this solution:
We would want to implement some read locks to prevent race conditions between multiple threads updating the same ReviewAnalytics rows at the same time as other after_commit hooks are firing, which may be triggering updates on other tv show data.
Some other metrics, like “Analytic reviews of TV shows by Genre,” would also require some joins and group_bys involving the genre attribute on the tv_show table, which would introduce some significant code complexity and more data rows to lock.
Adding these calculation callbacks to the Review model also detracts from the system's primary purpose, which is to accurately log user reviews. Introducing this complexity can make the system more challenging to maintain in the long term.
On a fundamental level, these metrics are all derived datapoints, which means storing them as new data can introduce a “multiple sources of truth” problem where the analytics data does not match the real review data it is calculated on due to race conditions or other bugs.
Doing these calculations in Rails means loading all of this data into the application memory, which can become a bottleneck as data size grows.
For these reasons, a better approach in a complicated live system may be to utilize a Database View.
What is a Database View?
A Database View is a virtual representation of a Database Table formed using a query to generate rows and columns. This can be made of multiple tables joined together within the resulting query or a subset of data from a single table. The View occupies no space within the database and is calculated in place whenever a query is made to it. A view is created using the following syntax:
The main advantage of views is that it can simplify complex queries that draw from multiple tables into a single virtual table.
Getting the View into Rails
A large part of using a View with complex calculations is the ability to integrate it into a Rails platform. Therefore, we needed to research what Ruby gems were available to allow migrations to have Database Views. We landed on the Scenic gem, which allows SQL Views within the Rails application without switching from the classic schema.rb file to use a structure.sql for tracking the View. This gem treats Views like Tables. You can generate a migration that creates the View. Then, to back an ActiveRecord model, you set a method for readonly? that returns true. If you need to change the View, the migration handler allows that, as well as versioning the View. A caveat to this gem is that the primary database support is for Postgres, so we needed a third-party ScenicMysqlAdapter for our MySQL database. Once we had the tooling to add a View to our Rails application, we needed to design the View. A few considerations went into this design, such as index support, efficiency, calculation complexity, available methods, and scalability.
Running the command to generate the View and model files is like generating a migration file.
Then, we can edit the db/views/review_analytics_v01.sql to show what our View select statement will look like for the above scenario
Finally, running rake db: migrate will add the view to the schema.rb as:
Using the view clarifies what the table calculations contain and keeps it easily maintainable.
Once we had the View, we noticed that selecting from the View with a large data set would not work. Because the View has to recalculate the entire table even with a where clause added to it, the run-time was very slow. In this example, to look for data for a single tv show, the View took ~3 seconds to run. This test dataset we are using only has about 125,000 episode reviews and 11,000 reviews. This run time would exponentially grow with more columns or calculations needed to add to the review_analytics table. For a larger application, this is too slow for a production environment. Our real data set has ~1 million records in each table we would have to join. In our environment, this meant fetching a limited subset still took ~30 seconds.
Another constraint that forced us to move away from using a View was the impact on downstream data processes, such as our analysis schema data pipeline, which delivers other aggregated analytics to our customer data warehouses (DWHs). The image below shows an overview of our data pipeline, with DB #1 being where our data lives.
We were forced to implement a solution that would not only solve for our use case but also be compatible with our entire ETL operation. Our data pipeline ingests data from multiple disparate databases and combines them into a single warehouse to enable querying across previously logically separated databases. From here, transformation logic is applied to the data, and the newly transformed data is persisted in our Analysis Schema which is then used for analysis and replicated to customers' data warehouses. This system relies on database triggers and thus does not support the use of Views, so we needed to find an alternative solution that would still allow updates in our source DWH to propagate through to our Analysis Schema and further downstream DWHs.
Shift To A Table
Eventually, we landed on using an INSERT INTO statement on an after_commit rails callback that would insert into our analytics table to recalculate the values. This improves upon our initial approach by running calculations on the database level instead of within Rails and ensures the required data integrity. It also improves upon our second solution using a View, allowing us to avoid recalculating the entire table when one value changes.
Using a table, we needed to ensure all of the appropriate tables had the proper indexes to make the calculations quick, the same way we needed for the View. This table reduced the time to select data for a single tv show from ~3 seconds to ~0.04 seconds. A greater than 300% reduction in the speed efficiency of the query. This reduction can be attributed to how most of the work is upon the INSERT into the table rather than the SELECT fetch. When querying from a view for a single tv show, the full view must be calculated on the spot and then reduced for the rows we care about. The fetch is very quick for the table; all the data already exists and with an index on the tv show.
Deadlocks have become a concern when doing concurrent inserts into the table. When we started our Rails implementation, we didn’t realize that locking the dependent tables was a bad idea. Since the INSERT will do the appropriate locking for us, there is no need to add locks on the rails objects manually. Even so, MySQL documentation even states:
Always be prepared to re-issue a transaction if it fails due to deadlock. Deadlocks are not dangerous. Just try again.
Comparison of Table vs. View
When deciding to use a table or a view for these calculations, we encountered a few pros and cons.
When deciding between the best data retrieval method in your rails application, whether callbacks on ActiveRecord models, using Views, or calculated tables, you must consider a few things. Start first by assessing the complexity of the data and the transformations it requires. Using callbacks on an ActiveRecord model may be a tried and true method of getting your data in the right place, but beware of locks on your data. Views work great for gathering sets of calculated data on an infrequent basis, but when you need to deliver a small portion of your data set consistently, calculating a View contributes to unnecessary overhead. Creating and persisting calculated tables may be the most flexible and performant option; however, it will likely require effective planning upfront and maintenance. Another possible alternative to these options would be the using Materialized Views, essentially database-persisted View results stored on disk. Unfortunately, using materialized Views is not natively supported in MySQL, our application database. In truth, there is no perfect solution to presenting highly transformed data in your Rails application. Each of the solutions mentioned in this post has its benefits and drawbacks. We hope that in writing this, we have provided you with a guideline and tools to make that decision more strategically.
Authors: Carrie Utter, Kazuhiro Uchikata, Alex Khalil