Orca: Quality of Service
EXPERIMENTAL: This feature is still in an early adoption / experimental phase. While you can use it today (Orca v6.71.0), Netflix is currently running this in learning mode / judiciously enabling in response to on-call events.
Spinnaker ships with an optional Quality of Service (QoS) module that can be used to manage the amount of active executions running at any given time. By default, this QoS module is disabled, but can be enabled and tuned with a handful of knobs and different strategies. Before we dive into the configuration settings, we’ll go over how QoS works.
How QoS Works
The QoS system is self-contained within Orca’s orca-qos module and operates only on newly created executions (both pipeline and orchestrations). That is to say, the QoS system will not impact running executions: Once an execution has started, it is completely outside of the domain of the QoS module. The rest of this section assumes that QoS is enabled.
When an execution is submitted to Orca (either manually via the API or UI, or through an automated trigger), Orca will first emit a synchronous
BeforeExecutionPersist event which the QoS
is listening on.
The behavior of the
ExecutionBufferActuator depends firstly on the result of a
BufferStateSupplier can perform whatever heuristics necessary to determine whether or not any new execution should go through the QoS process.
false, no other QoS actions occur and the execution is started as normal.
In the event
true, the execution is passed through a chain of ordered
BufferPolicy functions return a result defining whether or not to
ENQUEUE the execution.
BufferPolicy functions must return
ENQUEUE, otherwise the execution will be assigned a status of
BUFFERED, delaying the initialization of the execution.
When an execution is
BUFFERED, it will effectively stay in a waiting state until it is unbuffered, which we’ll go over later.
BufferPolicy functions are pluggable and can contain arbitrary logic.
For example, one
BufferPolicy that is always enabled is
, which will always
ENQUEUE an execution it is an Orchestration and from the UI.
This specific policy forces the
ENQUEUE status, even if other policies call for the execution to be buffered; this is done through a
force flag that policies can return.
An example of other pluggable behavior is determining buffering action based on criticality of the execution: At Netflix we have a custom concept of application criticality, so we can buffer low criticality executions to allow capacity for higher criticality executions.
Once an execution is put into a
BUFFERED state, it will remain in that state until the
decides to change it to
NOT_STARTED and enqueue it for processing.
Similar to the
ExecutionPromoter uses an ordered chain of
functions to determine what buffered executions to promote.
Every promotion cycle, the list of buffered executions (called candidates) are passed through the policies, each reducing the list of candidates to a final list of executions that will be promoted.
Again, these policies can have arbitrary logic, but by default a naive promotion policy is included that will promote the N oldest executions.
This promotion process happens (by default) on a 5-second interval on every Orca instance.
PromotionPolicy, the results of each function returns a result with a human readable “reason”, which is logged out for each execution that is evaluated so it is easy to trace.
Note: This is the first implementation of the QoS system, we plan to iterate on this concept and make it more advanced over time. You can read the original proposal to get an idea of a potential roadmap.
These configurations are not guaranteed to be fully inclusive of all knobs. A definitive list is available via the codebase.
qos.enabled: Boolean (default
false). Global flag controlling whether or not the system is enabled. This flag will not disable the
qos.learningMode.enabled: Boolean (default
true). If enabled, executions will always be
ENQUEUED, but log messages & metrics will be emitted saying what the system would have done. This flag has no effect on
pollers.qos.promoteIntervalMs: Integer (default
5000). The time (in milliseconds) that the promotion process will be run.
NaiveBufferPolicy will always buffer executions when enabled.
qos.bufferPolicy.naive.enabled: Boolean (default
ActiveExecutionsBufferStateSupplier will enable/disable the buffering state based on the number of active executions in the system.
qos.bufferingState.suppliermust be set to
qos.bufferingState.activeExecutions.threshold: Integer (default
100). The high threshold of active executions before QoS will start actuating on executions.
pollers.qos.updateStateIntervalMs: Integer (default
5000). The time (in milliseconds) that the function will update its internal record for how many executions are running in the system.
KillSwitchBufferStateSupplier will enable/disable the buffering state based on configuration only.
This is handy if you’re evaluating the fundamentals of the QoS system, or you want a break-the-glass operator knob to control QoS.
qos.bufferingState.suppliermust be set to
qos.bufferingState.killSwitch.enabled: Boolean (default
true, QoS will be enabled.
NaivePromotionPolicy will promote N executions every promotion cycle.
qos.promotionPolicy.naive.enabled: Boolean (default
true). Whether or not this policy is enabled.
qos.promotionPolicy.naive.size: Integer (default
1). The max number of executions to promote.
qos.executionsBuffered: Counter. The number of executions that have been buffered.
qos.executionsEnqueued: Counter. The number of executions that have been enqueued (e.g. passed through the system and were judged not to be buffered).
qos.actuator.elapsedTime: Timer. The amount of time that is spent passing an execution through all enabled
qos.promoter.elapsedTime: Timer. The amount of time that is spent passing an execution through all enabled
PromotionPolicys. Since the promoter is run on a static interval, this should usually be a relatively high, yet constant, number.
qos.promoter.executionsPromoted: Counter. The number of executions that have been promoted.
The QoS system is currently shared-nothing state. Each Orca instance will maintain its own state (aside from configuration) about whether or not it should be buffering executions, or when it should be running pollers.