A microservice retry strategy with failover

Summary

The microservice architecture powering perkbound.com relies heavily on a service registry, service self-registration, and interservice RESTful communication. That's a lot of back and forth, and a lot of opportunities for failure.

To try and reduce the chance of these server faults and minimize the effects of timeouts and possible registry issues I've implemented a retry interface with a local cache failover.

Strategies

The retry interface offers three different strategies for communicating with a service, all three strategies assume the following:

  1. A communication fault occurred
  2. A server error occurred

The three strategies offered are:

  • Consecutive

    Retry consecutively for the iterations specified.

  • Delay

    Retry for the iterations specified with a delay (milliseconds, seconds, minutes, hours) between each attempt.

  • BackOff

    Retry for a maximum amount of time with an exponential delay between retries.

Fluent interfaces are still cool, right?

Fluent interfaces have their place especially when you have a set of commands to build on and you'd like to make your set of commands easier to read/follow. To call one of perkbound's microservices on a user request a call is made that looks like:

new ServiceManager.Call(() => IUserService.GetUser(1))
    .RetryOnFailFor(1).Minutes()
    .BackOff()
    .Execute()

I love how straightforward the command set is. This chain will make a call to an injected IUserService class and on a fault such as a timeout retry for up to one minute with an increasing delay between retries.

The faults (transient~) I'm currently looking looking for are:

  • Service Unavailable
  • Gateway Timeout
  • Bad Gateway
  • Request Timeout

Retry with failover

I decided to use a service discovery pattern with my service registry which means I make a request to my registry with a service name and the registry responds, if found, with a network location. With each successful request the caller records the registry's response in a local cache for use as a failover.

The service registry, while extremely important, is not mission critical to every request. In a case where services typically stay put the registry is repeatedly providing the same information on each request. In this scenario, where we are 99.9% sure we can infer the response we do not want to retry the registry for 1 minute; we want to try, and on failure immediately failover to our cache (as long as it isn't too stale).

With that said, the fluent chain for contacting the service registry looks like:

new ServiceManager.Call(() => IServiceRegistryClient.Find("service-name"))
    .OnServerFault(() => _cacheManager.Find("service-name"))
    .Execute()

Surviging a restart/recycle

It is inevitable that the service registry or the calling site/service goes down at some point for a seen or unseen circumstances. The retry failover has two caching mechanisms: an in memory cache, and a persistent sqlite cache.

When a response is received from the service registry the memory cache and sqlite database are both updated. When falling back to cache on a registry fault though we will only talk to memory cache for performance reasons.

Where does sqlite come into play? When the web application restarts the memory cache is lost meaning our cache queries have no results. A solution to this is on application restart the sqlite database will rehydrate the memory cache and we pretend no one noticed the service died momentarily.

Conclusion

I'm fairly happy with the result of the retry interface. The sqlite cache queries are fast enough that I don't think the memory cache is even necessary, but if you're already over-optimizing then why not scope creep a bit more :)

Keep those buzz words buzzing ya'll.