We have a Rails app hosted on Heroku which periodically develops a memory leak, pushing it well over Heroku’s per-dyno memory quote and slowing everything down as it hits swap. The issue is intermittent, random, and only happens every few days but it’s easy enough to deal with, just restart the dynos. However it has a habit of happening at night or weekends (the site is used entirely in the US), which makes it difficult to deal with out of hours.
While we are making efforts to find the cause of the leak, our primary concern is to make sure the site remains usable. To that end, I’ve put together a little something to restart the web dynos automatically, even when it’s the middle of the night for us.
We use the LogEntries service, available as a free plugin for Heroku apps, to monitor our applications. LogEntries tails the logs and triggers alerts based on configurable conditions. It can detect all the Heroku platform errors such as the one we are interested in “R14 Memory quote exceeded”, and send an email, slack notification, or poke a webhook. It seemed logical to use LogEntries to restart the dynos when they got into trouble.
Restarting the Dynos
To restart our web dynos we create an ActiveJob task, which uses the Heroku Platform API (Ruby gem) to fetch the list of dynos, filter them down to just the running web instances (we’ve never had a problem with the workers), and restart each one in turn.
First install the Heroku CLI OAuth Plugin
heroku plugins:install https://github.com/heroku/heroku-oauth
Then create a OAuth token with write privileges (I suggest you use Heroku that can only access this app to create the token) and set it as an environment variable
heroku authorizations:create -s write
heroku config:add RESTART_API_KEY=<API KEY>
Now create an ActiveJob task, which we’ve called RestartAppJob.
require 'platform-api' class RestartAppJob < ActiveJob::Base queue_as :restarts class Dyno attr_accessor :type attr_accessor :name attr_accessor :state def self.connection if ENV['RESTART_API_KEY'] @@connection ||= PlatformAPI.connect_oauth(ENV['RESTART_API_KEY']) end end def self.dynos connection.dyno.list(ENV['APP_NAME']).map do |dyno_info| Dyno.new(dyno_info) end end def self.running_web_dynos dynos.select { |dyno| dyno.web? && dyno.up? } end def web? type == 'web' end def up? state == 'up' end def connection self.class.connection end def restart! connection.dyno.restart(ENV['APP_NAME'], name) end def initialize(info) self.type = info['type'] self.name = info['name'] self.state = info['state'] end end def perform(*args) if Dyno.connection Dyno.running_web_dynos.each do |dyno| dyno.restart! end end end end
As you can see, most of the work is done in the Dyno class.
Calling…
RestartAppJob.perform_later
…will queue up a job to restart your webservices.
Triggering the Job
To trigger the job we have a controller action that looks like this…
def restart_web_dynos if params[:key] == ENV['RESTART_WEBHOOK_KEY'] RestartAppJob.perform_later render text: 'Restart triggered' else render text: 'You are not allowed to restart the dynos' end end
You can put this in any controller you think is appropriate, and setup the routes however you like. It expects a parameter of ‘key’ that matches whatever you set the environment variable RESTART_WEBHOOK_KEY to (I suggest generating a GUID using the SecureRandom library)
With the controller action in place you can set the webhook action in LogEntries to point to http://example.com/foo/restart_web_dynos?key=somejibberish
.
Now, whenever LogEntries detects the memory quota issue it will call the webhook, which will schedule the job, which restarts the dynos. You could extend this to other events or monitoring services easily enough.
Caveat
Obviously this relies on at least one dyno still being functional. We tend to find that while the app slows down when it hits the quota it doesn’t actually stop, so this approach is ok. However if you have dynos that stop responding entirely you will need to host this code separately.