UBSAN was complaining about undefined behavior when trying to access the
`SAV` inside a promise after an actor had been cancelled. If we are
cancelled, don't try to return an error, just throw.
See the comment contained in this commit. This bug could only manifest
under a specific set of circumstances:
1. A coordinator change is started
2. The coordinator change succeeds, but its action of clearing
`previousCoordinatorsKey` is delayed.
3. A minority of `ConfigNode`s have an old state of the configuration
database, compared to the majority.
4. A `ConfigNode` in the majority dies and permanently loses data.
5. A long delay occurs on the `PaxosConfigConsumer` when it tries to
read the latest changes from the `ConfigNode`s.
In the above circumstances, the `ConfigBroadcaster` could incorrectly
send a snapshot of an old state of the configuration database to a
majority of `ConfigNode`s. This would cause new, durable, and
acknowledged commit data to be overwritten.
Note that this bug only affects the configuration database (used for
knob storage). It does not affect the normal keyspace.
Contains the following fixes:
* When handling the special case rollforward where nodes can be rolled
forward even if a majority are at version 0, we don't want to reset
the live version of the node being rolled forward. This is because a
quorum of nodes at version 0 can continue handing out and incrementing
their live version, and if they are rolled forward there is the
potential for them to go back in time in regard to their live version.
So in this one special case, they should maintain their existing live
version.
* Fixes some unseed issues due to fields not being initialized properly.
* Temporarily disables a coordinator restart in the recovery path (in
the coordinated state) due to it causing a timeout. This needs more
investigation in the future.
Configuration database data lives on the coordinators. When a change
coordinators command is issued, the data must be sent to the new
coordinators to keep the database consistent.
* proof of concept
* use code-probe instead of test
* code probe working on gcc
* code probe implemented
* renamed TestProbe to CodeProbe
* fixed refactoring typo
* support filtered output
* print probes at end of simulation
* fix missed probes print
* fix deduplication
* Fix refactoring issues
* revert bad refactor
* make sure file paths are relative
* fix more wrong refactor changes
* Fix a heap-use-after-free in PaxosConfigConsumer.actor.cpp
* Two more defensive local promises
* Two more defensive promise copies
* Fix latent logic error
In rare circumstances where the cluster controller dies / moves to a new
machine, sometimes only a minority of `ConfigNode`s received messages
telling them they were registered. When the `ConfigNode`s attempt to
register with the new broadcaster (on the new cluster controller), the
knob system would get stuck because only a minority would be registered.
Part of this change allows registration of unregistered `ConfigNode`s if
there is no path to a majority of registered nodes.
This commit also removes an attempt to read the latest configuration
snapshot when a rollforward timeout occurs. The normal retry loop will
eventually fetch an up to date snapshot and the rollforward will be
retried.
The calculation to determine how many non-timeout replies had been
received was incorrect, causing rollback/rollforward requests to not be
sent, causing the dynamic knob subsystem to get stuck.
Fixes an issue where commit versions from previous requests sent to
ConfigNodes were being reused when a new quorum of commit versions was
requested. This was occurring due to a failure to reset the state of
GetCommittedVersionQuorum after a full snapshot request.
Fixes an issue where commit versions from previous requests sent to
ConfigNodes were being reused when a new quorum of commit versions was
requested. This was occurring due to a failure to reset the state of
GetCommittedVersionQuorum after a full snapshot request.