My deploy failed with HTTP health checks failed

Cause

When your App has one or more HTTP(S) Endpoints, Enclave automatically performs Health Checks during your deploy to make sure your Containers are properly responding to HTTP traffic.

If your containers are not responding to HTTP traffic, the health check fails.

These health checks are called Release Health Checks.

Resolution

There are several reasons why the health check might fail, each with their own fix:

App is exiting immediately

If your app crashes immediately upon start up, it’s not healthy. In this case, Enclave will indicate that your Containers exited, and report their Container Command and exit code.

You’ll need to identify why your Containers are exiting immediately. There are usually two possible causes:

  • There’s a bug and your container is crashing. If this is the case, it should be obvious from the logs. To proceed, fix the issue, and try again.
  • Your container is starting a program that immediately daemonizes. In this case, your container will appear to have exited from Enclave’s perspective. To proceed, make sure the program you’re starting stays in the foreground and does not daemonize, then try again.

App listens on incorrect host

If your app is listening on localhost (a.k.a 127.0.0.1), then Enclave cannot connect to it, so the health check won’t pass.

Indeed, your app is running in Containers, so if the app is listening on 127.0.0.1, then it’s only routable from within those Containers, and notably it’s not routable from the Endpoint.

To solve this issue, you need to make sure your app is listening on all interfaces. Most application servers let you do so by binding to 0.0.0.0.

App listens on incorrect port

If your Containers are listening on a given port, but the Endpoint is trying to connect to a different port, the health check can’t pass.

There are two possible scenarios here:

  • Your Image does not expose the port your app is listening on.
  • Your Image exposes multiple ports, but your Endpoint and your app are using different ports.

In either case, to solve this problem, you should make sure that:

  • The port your app is listening on is exposed by your image. For example, if your app listens on port 8000, your :ref:Dockerfile must include the following directive: EXPOSE 8000.
  • Your Endpoint is using the same port as your app. By default, Enclave HTTP(S) Endpoints automatically select the lexicographically lowest port exposed by your image (e.g. if your image exposes port 443 and 80, then the default is 443), but you can select the port Enclave should use when creating the Endpoint, and modify it at any time.

App takes too long to come up

It’s possible that your app Containers are is simply taking longer to finish booting up and start accepting traffic than Enclave is willing to wait.

Indeed, by default, Enclave waits for up to 3 minutes for your app to respond. However, you can increase that timeout by setting the RELEASE_HEALTHCHECK_TIMEOUT Configuration variable on your app.

There is one particular error case worth mentioning here:

Gunicorn and [CRITICAL] WORKER TIMEOUT

When starting a Python app using Gunicorn as your application server, the health check might fail with a repeated set of [CRITICAL] WORKER TIMEOUT errors.

These errors are generated by Gunicorn when your worker processes fail to boot within Gunicorn’s timeout. When that happens, Gunicorn terminates the worker processes, then starts over.

By default, Gunicorn’s timeout is 30 seconds. This means that if your app needs e.g. 35 seconds to boot, Gunicorn will repeatedly timeout then restart it from scratch.

As a result, even though Enclave gives you 3 minutes to boot up (configurable with RELEASE_HEALTHCHECK_TIMEOUT), an app that needs 35 seconds to boot will time out on the Release Health Check, because Gunicorn is repeatedly killing then restarting it.

30 seconds might seem like a long time for your app to boot up, but with a large app and a small Container on a Stack enforcing CPU Limits, hitting this timeout is fairly common. Besides, you might have configured the timeout with a lower value (via the --timeout option).

There are two recommended strategies to address this problem:

  • If you are using a synchronous worker in Gunicorn (the default), use Gunicorn’s --preload flag. This option will cause Gunicorn to load your app before starting worker processes. As a result, when the worker processes are started, they don’t need to load your app, and they can immediately start listening for requests instead (which won’t time out).
  • If you are using an asynchronous worker in Gunicorn, increase your timeout using Gunicorn’s --timeout flag.

Note

If neither of the options listed above satisfies you, you can also reduce your worker count using Gunicorn’s --workers flag, or scale up your Container to make more resources available to them.

We don’t recommend these options to address boot-up timeouts because they affect your app beyond the boot-up stage, respectively by reducing the number of available workers and increasing your bill.

That said, you should definitely consider making changes to your worker count or Container size if your app is performing poorly or Metrics are reporting you’re undersized: just don’t do it only for the sake of making the Release Health Check pass.

App is not expecting HTTP traffic

HTTP(S) Endpoints expect your app to be listening for HTTP traffic. If you need to expose an app that’s not expecting HTTP traffic, you shouldn’t be using an HTTP(S) Endpoint.

Instead, you should consider TLS Endpoints and TCP Endpoints.