Multi-Database joins in ActiveRecord

03/21/09

So, what I learned in the last two weeks is that going into an office everyday involves two thing:
1. Getting interrupted regularly, and therefore being much less efficient
2. Commuting many extra hours on the BART train, which is sometimes prone to spontaneous smoking (and subsequent delays).

So, my daily stand-up has not been so daily. Some of that has also been because I have been doing a lot of front end fixes. Unhappily, I only learned one new thing about CSS. Hopefully, I will be able to get that up sometime when I am not being held hostage on a smoking BART.

Today’s stand-up however involves using ActiveRecord with multiple databases. I have been doing this for a while because when I look at my data and see that some of it is user related and some of it is not, I imagine that the data will grow in very different ways.

Right now in my spare time, and in particular when I am experiencing a mass-transit kidnapping, I am working on an app billtrack.org. I was inspired to get off my ass about this because of the Sunlight Foundation’s App for America contest. I have wanted to do an app for a while that made sense of federal government. I have a friend that worked as a lobbyist and has done lots of other political work in Washington. When he talks to me about what bills are important, it makes a lot of sense. When I look at the news, I have no idea what is going on in Congress. So, although lots of sites like OpenCongress.org are doing a stand-up job putting together more information than I have tackled, I wanted to make a more usable interface for the three data models that seem most important: bills, people and issues.

The issues are a lot like tags, user generated ways of labeling bills. From a database perspective, the important part is that there is a lot of governmental data coming from remote sources and API mashery. Then there is member generated data. To me that says not just two databases that grow very differently. That says two different applications. So after I got some rigorous import functions in place, I split the app in two: an api app that imports governmental data, and the front facing app, that does all the unique stuff.

This front facing app has to access both databases while the api app, only cares about its own data. So, now we are into the realm of multi-database ActiveRecord. Lots of people have written about how to connect a secondary database to Rails, but I haven’t seen much wirtten about how to connect the cross database queries. Let’s take a look at some models:

class Bill < ApiData
  # ApiData is an abstract class inherited from ActiveRecord::Base. 
  # It makes a separate connection its database. 
end

# this join model has a table in the member database and 
# has some extra information that mandate it has its own model. 
class BillIssue < ActiveRecord::Base
  belongs_to :issue
  belongs_to :bill # yup, this works without any extra work. 
  # AR just knows which connection to use!
end

class Issue < ActiveRecord::Base
  has_many :bill_issues # not an issue 
    # since both tables are in the same db
  
  # sadly this won't work, since has_many starts 
  # its query by looking in the api database, 
  # then freaks out when it can't find a bill_issues table there
  has_many :bills, :through => :bill_issues # oh, no! failure!
end

There are a couple options for how to handle this connection through ActiveRecord. My first attempt at this resorted to raw sql with an AR wrapper for caching and the other AR goodies.


# in Issue model
has_many :bills, :finder_sql => '
  SELECT bills.* 
  FROM billtrack_data#{ self.class.table_environment }.bills, 
    billtrack_member#{ self.class.table_environment }.bill_issues
  WHERE bills.id = bill_issues.bill_id
    AND bill_issues.issue_id = #{id}
'

Couple things to note about this code:
1. The SQL string is single quoted but contains ruby insertions that usually require a double quoted string for execution. What is this about? Well, it turns out that on a per instance basis the code gets executed within AR as double quoted text. This happens when the relationship is executed on the instance. If you were to use double quotes for this string, the code would be executed at the time that the class is loaded. This means that the #{id} would embed the id for the Issue class rather than the issue instance. Clearly not the desired result. Worse though is the other bit of embedded code, which shall be described in point 2 below.

2. There is a strange ruby call in the string to a class level method called table_environment. I included this method in all AR models so that tests would work. I am using Merb, so here is the code:


def table_environment
“_#{Merb.env}”
end

Pretty basic stuff. It spits out an underscore and the environment. This attached to your table will bring up the right table for your environment. Because the #table_environment method lives in the class, and because this string is executed on an instance level, the self.class object must be sent this message. If the string were double quoted, it would be executed at the time that the model is loaded. So, Ruby would look for a ghost class or anonymous class for the model containing the #table_environment method. Since this method is not defined there, it would crash and burn!

My second round at this relationship make for a lot less SQL and therefore greater database portability.

# in the Issue model
has_and_belongs_to_many :bills, 
  :join_table => 
    "billtrack_member#{ self.table_environment }.bill_issues"

You can’t do this with a has_many :through declaration, because you cannot declare the join table. The through parameter anticipates a model and therefore, makes assumptions about the table that just don’t work out for our purposes. In the end though, you don’t loose anything by declaring the relationship in this way. You can still use and abuse the BillIssue model through its own declared relationship, and you get to have your SQL clothed in AR.

The join table receives a double quoted string which means the ruby inside is executed when the class is loaded. Therefore self rather than self.class is the appropriate object to call.

There is a problem with this, however. If you look at the SQL generated by the query you will see that it is selecting all from a inner join, which results in two columns containing id.

   SELECT * FROM `bills` 
   INNER JOIN `billtrack_member`.`bill_issues` 
   ON `bills`.id = `billtrack_member`.`bill_issues`.bill_id 
   WHERE (`billtrack_member`.`bill_issues`.issue_id = #{id} ) 

If you run this in your database directly, you will see that it related datadata from both tables. The query itself will namespace these different fields by database and table, but when AR goes to parse the resulting data, it apparently does not discriminate about which id it is grabbing. This leads to some really unexpected data.

The fix here is to include a select option:


has_and_belongs_to_many :bills, :select => “bills.*”
:join_table =>
“billtrack_member#{ self.table_environment }.bill_issues”