High Availability
This page describes how you can configure a Halyard deployment to increase the availability of specific services beyond simply horizontally scaling the service. Halyard does this by splitting the functionalities of a service into separate logical roles (also known as sharding). The benefits of doing this is specific to the service that is being sharded. These deployment strategies are inspired by Netflix’s large scale experience .
When sharded, the new logical services are given new names. This means that these logical services can be configured and scaled independently of each other.
Currently, this feature is only for Clouddriver and Echo.
Important: Halyard only supports this functionality for a distributed Spinnaker deployment configured with the Kubernetes provider .
HA Clouddriver
Clouddriver benefits greatly from isolating its operations into separate services. To split Clouddriver for increased availability, run:
hal config deploy ha clouddriver enable
When Spinnaker is deployed with this flag enabled, Clouddriver will be deployed as four different services, each only performing a subset of the base Clouddriver’s operations.
Although by default the four Clouddriver services will communicate with the global Redis (all Spinnaker services speak to this Redis) provided by Halyard, it is recommended that the logical Clouddriver services be configured to communicate with an external Redis service. To be most effective, clouddriver-ro
should be configured to speak to a Redis read replica, clouddriver-ro-deck
should be configured to speak to a different Redis read replica, and the other two should be configured to speak to the master. This is handled automatically by Halyard if the user provides the two endpoints using this command:
hal config deploy ha clouddriver edit --redis-master-endpoint $REDIS_MASTER_ENDPOINT --redis-slave-endpoint $REDIS_SLAVE_ENDPOINT --redis-slave-deck-endpoint $REDIS_SLAVE_DECK_ENDPOINT
The values for REDIS_MASTER_ENDPOINT
, REDIS_SLAVE_ENDPOINT
, and REDIS_SLAVE_DECK_ENDPOINT
must be
valid Redis URIs
.
More information on Redis replication can be found here .
clouddriver-caching
The first of the four logical Clouddriver services is the clouddriver-caching
service. This service caches and retrieves cloud infrastructure data. Since this is all that clouddriver-caching
is doing, there is no communication between this service and any other Spinnaker service.
This service’s name when
configuring its sizing
is spin-clouddriver-caching
.
To add a
custom profile
or
custom service settings
for this service, use the name clouddriver-caching
.
clouddriver-rw
The second logical Clouddriver service is the clouddriver-rw
service. This service handles all mutating operations aside from what the clouddriver-caching
service does. This service can be scaled to handle an increased number of writes.
This service’s name when
configuring its sizing
is spin-clouddriver-rw
.
To add a
custom profile
or
custom service settings
for this service, use the name clouddriver-rw
.
clouddriver-ro
The clouddriver-ro
service handles all read requests to Clouddriver. This service can be scaled to handle an increased number of reads.
This service’s name when
configuring its sizing
is spin-clouddriver-ro
.
To add a
custom profile
or
custom service settings
for this service, use the name clouddriver-ro
.
clouddriver-ro-deck
The clouddriver-ro-deck
service handles all read requests to Clouddriver from Deck (through Gate). This service can be scaled to handle an increased number of reads.
This service’s name when
configuring its sizing
is spin-clouddriver-ro-deck
.
To add a
custom profile
or
custom service settings
for this service, use the name clouddriver-ro-deck
.
HA Echo
Echo can be split into two separate services that handle different operations. To split Echo for increased availability, run:
hal config deploy ha echo enable
When Spinnaker is deployed with this enabled, Echo will be deploy as two different services:
Although only the echo-worker
service can be horizontally scaled, splitting the services will reduce the load on both.
echo-scheduler
The echo-scheduler
service handles scheduled tasks, or cron-jobs. Since it performs its tasks periodically (no triggers) there is no need for communication with other Spinnaker services.
This service’s name when
configuring its sizing
is spin-echo-scheduler
. To avoid duplicate triggering, this service must be deployed with exactly one pod.
To add a
custom profile
or
custom service settings
for this service, use the name echo-scheduler
.
echo-worker
The echo-worker
service handles all operations of Echo besides the cron-jobs.
This service’s name when
configuring its sizing
is spin-echo-worker
. This service can be scaled to more than one pod, unlike the echo-scheduler
.
To add a
custom profile
or
custom service settings
for this service, use the name echo-worker
.
Deleting Orphaned Services
When enabling or disabling HA for a service on a running Spinnaker, Halyard will not clean up the old service(s) by default. This means that if a non-HA Clouddriver is running (for example) and Spinnaker is then deployed with HA Clouddriver enabled, the non-HA Clouddriver will still be running, even though it is no longer used. To clean up these orphaned services, add a --delete-orphaned-services
flag to hal deploy apply
:
hal deploy apply --delete-orphaned-services
HA Topology
With all services enabled for high availability, the new architecture looks like this:
Recoverability
If you’ve
configured an external Redis
, Spinnaker can recover from failure events. Igor is responsible for polling the state store (Redis) and recreating state. During a recovery, numerous old pipelines may be re-triggered. To protect against this scenario, Igor has a setting called pollingSafeguard.itemUpperThreshold
, which is the max number of pipeline triggers to accept before recognizing the scenario of numerous old re-triggers and stopping the state update.
Read the inline documentation for this setting
.
Although it’s possible, it is not recommended to use spot instances on AWS or preemtible nodes to lower cost in production, as outages in your continuous deployment tool will likely cost more than any savings. Also consider hosting Spinnaker on isolated infrastructure to reduce the possibility that other applications or teams will affect your Spinnaker instance.
It’s also recommended to spread out services over multiple availability zones, as described in the Netflix implementation .