Orca: Quality of Service
EXPERIMENTAL: This feature is still in an early adoption / experimental phase. While you can use it today (Orca v6.71.0), Netflix is currently running this in learning mode / judiciously enabling in response to on-call events.
Spinnaker ships with an optional Quality of Service (QoS) module that can be used to manage the amount of active executions running at any given time. By default, this QoS module is disabled, but can be enabled and tuned with a handful of knobs and different strategies. Before we dive into the configuration settings, we’ll go over how QoS works.
How QoS Works
The QoS system is self-contained within Orca’s orca-qos module and operates only on newly created executions (both pipeline and orchestrations). That is to say, the QoS system will not impact running executions: Once an execution has started, it is completely outside of the domain of the QoS module. The rest of this section assumes that QoS is enabled.
When an execution is submitted to Orca (either manually via the API or UI, or through an automated trigger), Orca will first emit a synchronous BeforeExecutionPersist
event which the QoS
ExecutionBufferActuator
is listening on.
The behavior of the ExecutionBufferActuator
depends firstly on the result of a
BufferStateSupplier
.
The BufferStateSupplier
can perform whatever heuristics necessary to determine whether or not any new execution should go through the QoS process.
If the BufferStateSupplier
returns false
, no other QoS actions occur and the execution is started as normal.
In the event BufferStateSupplier
returns true
, the execution is passed through a chain of ordered
BufferPolicy
functions.
These BufferPolicy
functions return a result defining whether or not to BUFFER
or ENQUEUE
the execution.
All BufferPolicy
functions must return ENQUEUE
, otherwise the execution will be assigned a status of BUFFERED
, delaying the initialization of the execution.
When an execution is BUFFERED
, it will effectively stay in a waiting state until it is unbuffered, which we’ll go over later.
BufferPolicy
functions are pluggable and can contain arbitrary logic.
For example, one BufferPolicy
that is always enabled is
EnqueueDeckOrchestrationBufferPolicy
, which will always ENQUEUE
an execution it is an Orchestration and from the UI.
This specific policy forces the ENQUEUE
status, even if other policies call for the execution to be buffered; this is done through a force
flag that policies can return.
An example of other pluggable behavior is determining buffering action based on criticality of the execution: At Netflix we have a custom concept of application criticality, so we can buffer low criticality executions to allow capacity for higher criticality executions.
Once an execution is put into a BUFFERED
state, it will remain in that state until the
ExecutionPromoter
decides to change it to NOT_STARTED
and enqueue it for processing.
Similar to the ExecutionBufferActuator
, the ExecutionPromoter
uses an ordered chain of
PromotionPolicy
functions to determine what buffered executions to promote.
Every promotion cycle, the list of buffered executions (called candidates) are passed through the policies, each reducing the list of candidates to a final list of executions that will be promoted.
Again, these policies can have arbitrary logic, but by default a naive promotion policy is included that will promote the N oldest executions.
This promotion process happens (by default) on a 5-second interval on every Orca instance.
With both BufferPolicy
and PromotionPolicy
, the results of each function returns a result with a human readable “reason”, which is logged out for each execution that is evaluated so it is easy to trace.
Note: This is the first implementation of the QoS system, we plan to iterate on this concept and make it more advanced over time. You can read the original proposal to get an idea of a potential roadmap.
Configuration
These configurations are not guaranteed to be fully inclusive of all knobs. A definitive list is available via the codebase.
qos.enabled
: Boolean (defaultfalse
). Global flag controlling whether or not the system is enabled. This flag will not disable theExecutionPromoter
.qos.learningMode.enabled
: Boolean (defaulttrue
). If enabled, executions will always beENQUEUED
, but log messages & metrics will be emitted saying what the system would have done. This flag has no effect onExecutionPromoter
.pollers.qos.promoteIntervalMs
: Integer (default5000
). The time (in milliseconds) that the promotion process will be run.
BufferPolicy: Naive
The NaiveBufferPolicy
will always buffer executions when enabled.
qos.bufferPolicy.naive.enabled
: Boolean (defaulttrue
).
BufferStateSupplier: ActiveExecutions
The ActiveExecutionsBufferStateSupplier
will enable/disable the buffering state based on the number of active executions in the system.
qos.bufferingState.supplier
must be set toactiveExecutions
.qos.bufferingState.activeExecutions.threshold
: Integer (default100
). The high threshold of active executions before QoS will start actuating on executions.pollers.qos.updateStateIntervalMs
: Integer (default5000
). The time (in milliseconds) that the function will update its internal record for how many executions are running in the system.
BufferStateSupplier: KillSwitch
The KillSwitchBufferStateSupplier
will enable/disable the buffering state based on configuration only.
This is handy if you’re evaluating the fundamentals of the QoS system, or you want a break-the-glass operator knob to control QoS.
qos.bufferingState.supplier
must be set tokillSwitch
.qos.bufferingState.killSwitch.enabled
: Boolean (defaultfalse
). Iftrue
, QoS will be enabled.
PromotionPolicy: Naive
The NaivePromotionPolicy
will promote N executions every promotion cycle.
qos.promotionPolicy.naive.enabled
: Boolean (defaulttrue
). Whether or not this policy is enabled.qos.promotionPolicy.naive.size
: Integer (default1
). The max number of executions to promote.
Monitoring
qos.executionsBuffered
: Counter. The number of executions that have been buffered.qos.executionsEnqueued
: Counter. The number of executions that have been enqueued (e.g. passed through the system and were judged not to be buffered).qos.actuator.elapsedTime
: Timer. The amount of time that is spent passing an execution through all enabledBufferPolicy
s.qos.promoter.elapsedTime
: Timer. The amount of time that is spent passing an execution through all enabledPromotionPolicy
s. Since the promoter is run on a static interval, this should usually be a relatively high, yet constant, number.qos.promoter.executionsPromoted
: Counter. The number of executions that have been promoted.
Additional Notes
The QoS system is currently shared-nothing state. Each Orca instance will maintain its own state (aside from configuration) about whether or not it should be buffering executions, or when it should be running pollers.