Commit Graph

36 Commits

Author SHA1 Message Date
Syed Paymaan Raza
c3e7542cda Update end year in copyright header 2024-08-02 09:40:11 -07:00
Lukas Joswiak
611849cd5c Throw when we get cancelled instead of sending an error
UBSAN was complaining about undefined behavior when trying to access the
`SAV` inside a promise after an actor had been cancelled. If we are
cancelled, don't try to return an error, just throw.
2023-06-07 15:26:31 -07:00
Lukas Joswiak
795b666e23 Fix a rare configuration database data loss bug
See the comment contained in this commit. This bug could only manifest
under a specific set of circumstances:

1. A coordinator change is started
2. The coordinator change succeeds, but its action of clearing
   `previousCoordinatorsKey` is delayed.
3. A minority of `ConfigNode`s have an old state of the configuration
   database, compared to the majority.
4. A `ConfigNode` in the majority dies and permanently loses data.
5. A long delay occurs on the `PaxosConfigConsumer` when it tries to
   read the latest changes from the `ConfigNode`s.

In the above circumstances, the `ConfigBroadcaster` could incorrectly
send a snapshot of an old state of the configuration database to a
majority of `ConfigNode`s. This would cause new, durable, and
acknowledged commit data to be overwritten.

Note that this bug only affects the configuration database (used for
knob storage). It does not affect the normal keyspace.
2022-11-22 11:20:04 -08:00
Lukas Joswiak
8d237ba493 Fix various correctness and timeout issues
Contains the following fixes:

* When handling the special case rollforward where nodes can be rolled
  forward even if a majority are at version 0, we don't want to reset
  the live version of the node being rolled forward. This is because a
  quorum of nodes at version 0 can continue handing out and incrementing
  their live version, and if they are rolled forward there is the
  potential for them to go back in time in regard to their live version.
  So in this one special case, they should maintain their existing live
  version.
* Fixes some unseed issues due to fields not being initialized properly.
* Temporarily disables a coordinator restart in the recovery path (in
  the coordinated state) due to it causing a timeout. This needs more
  investigation in the future.
2022-09-13 16:53:54 -07:00
Lukas Joswiak
74ac617a34 Add support for changing coordinators to the configuration database
Configuration database data lives on the coordinators. When a change
coordinators command is issued, the data must be sent to the new
coordinators to keep the database consistent.
2022-09-13 16:53:54 -07:00
Markus Pilman
1de37afd52 Make TEST macros C++ only (#7558)
* proof of concept

* use code-probe instead of test

* code probe working on gcc

* code probe implemented

* renamed TestProbe to CodeProbe

* fixed refactoring typo

* support filtered output

* print probes at end of simulation

* fix missed probes print

* fix deduplication

* Fix refactoring issues

* revert bad refactor

* make sure file paths are relative

* fix more wrong refactor changes
2022-07-19 13:15:51 -07:00
Lukas Joswiak
9ca8a3c683 Reenable status json for dynamic knobs, add unit test 2022-06-21 11:43:05 -07:00
Andrew Noyes
6f500b59c0 Fix a heap-use-after-free in PaxosConfigConsumer.actor.cpp (#7244)
* Fix a heap-use-after-free in PaxosConfigConsumer.actor.cpp

* Two more defensive local promises

* Two more defensive promise copies

* Fix latent logic error
2022-05-25 12:08:30 -07:00
Renxuan Wang
154de018ff One place in PaxosConfigConsumer was missed out in #6926. (#7006)
* One place in PaxosConfigConsumer was missed out.

* Minor improvements.
2022-04-28 18:32:55 -07:00
Renxuan Wang
c69a07a858 Check in the new Hostname logic. (#6926)
* Revert #6655.

20220407-031010-renxuan-c101052c21da8346           compressed=True data_size=31004844 duration=4310801 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:04:15 sanity=False started=100047 stopped=20220407-041425 submitted=20220407-031010 timeout=5400 username=renxuan

* Revert #6271.

20220407-051532-renxuan-470f0fe6aac1c217           compressed=True data_size=30982370 duration=3491067 ended=100002 fail_fast=10 max_runs=100000 pass=100002 priority=100 remaining=0 runtime=0:59:57 sanity=False started=100141 stopped=20220407-061529 submitted=20220407-051532 timeout=5400 username=renxuan

* Revert #6266.

Remove resolving-related functionalities in connection string. Connection string will be used for storing purpose only, and non-mutable.

20220407-175119-renxuan-55d30ee1a4b42c2f           compressed=True data_size=30970443 duration=5437659 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:31 sanity=False started=100154 stopped=20220407-185050 submitted=20220407-175119 timeout=5400 username=renxuan

* Add hostname to coordinator interfaces.

* Turn on the new hostname logic.

* Add the corresponding change in config txns.

The most notable change is before calling basicLoadBalance(), we need to call tryInitializeRequestStream() to initialize request streams first.

Passed correctness tests.

* Return error when hostnames cannot be resolved in coordinators command.

* Minor fixes.
2022-04-27 21:54:13 -07:00
Chaoguang Lin
af9deeabc2 Move the Promise<QuorumVersion> before the Future vector to be destroyed after the vector 2022-03-22 16:12:41 -07:00
sfc-gh-tclinkenbeard
a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
sfc-gh-tclinkenbeard
0e7dc83f25 Fix compilation issues with ModelInterface construction in configuration database code 2022-03-16 14:25:32 -07:00
Lukas Joswiak
c3e48fff9f Update fdbserver/PaxosConfigConsumer.actor.cpp
Co-authored-by: Trevor Clinkenbeard <trevor.clinkenbeard@snowflake.com>
2022-03-16 08:59:12 -07:00
Lukas Joswiak
582ba5d519 Fix issue with stuck config nodes
In rare circumstances where the cluster controller dies / moves to a new
machine, sometimes only a minority of `ConfigNode`s received messages
telling them they were registered. When the `ConfigNode`s attempt to
register with the new broadcaster (on the new cluster controller), the
knob system would get stuck because only a minority would be registered.
Part of this change allows registration of unregistered `ConfigNode`s if
there is no path to a majority of registered nodes.
2022-03-15 11:42:58 -07:00
Lukas Joswiak
d0da6c63c1 Rollforward out of date nodes, compaction fixes 2022-03-14 11:20:56 -07:00
Lukas Joswiak
a8828db58e Load balance dynamic knob requests
This commit also removes an attempt to read the latest configuration
snapshot when a rollforward timeout occurs. The normal retry loop will
eventually fetch an up to date snapshot and the rollforward will be
retried.
2022-02-22 10:53:48 -08:00
Lukas Joswiak
e8354d82bd Fix timeout issue when using >3 coordinators
The calculation to determine how many non-timeout replies had been
received was incorrect, causing rollback/rollforward requests to not be
sent, causing the dynamic knob subsystem to get stuck.
2022-02-09 13:43:33 -08:00
Lukas Joswiak
7fc4f0d649 Reuse existing quorum timeout error code 2022-02-09 13:43:33 -08:00
Lukas Joswiak
d5a562e6b8 Fix dynamic knobs correctness issues 2022-02-09 13:43:32 -08:00
Lukas Joswiak
30b525a607 Add assertions to check rollback 2021-10-25 12:03:22 -07:00
Lukas Joswiak
c96f560cbe Verify rollback of a single version in simulation, other small fixes 2021-10-25 12:03:22 -07:00
Lukas Joswiak
6078664792 clang-format 2021-10-25 12:03:22 -07:00
Lukas Joswiak
57c2cf4a24 Retry messages to well known endpoints, add notes for future work 2021-10-25 12:03:22 -07:00
Lukas Joswiak
92998fd20b Merge rollback message into rollforward message 2021-10-25 12:03:22 -07:00
Lukas Joswiak
7357d7714c Retry with well known endpoints, move last committed check to consumer 2021-10-25 12:03:22 -07:00
Lukas Joswiak
1631a1b352 Update fdbserver/PaxosConfigConsumer.actor.cpp
Co-authored-by: Trevor Clinkenbeard <trevor.clinkenbeard@snowflake.com>
2021-10-25 12:03:22 -07:00
Lukas Joswiak
e79c6c7456 Fix issue where previous commit messages were reused
Fixes an issue where commit versions from previous requests sent to
ConfigNodes were being reused when a new quorum of commit versions was
requested. This was occurring due to a failure to reset the state of
GetCommittedVersionQuorum after a full snapshot request.
2021-10-25 12:03:22 -07:00
Lukas Joswiak
9d78604c5b Add rollback and rollforward logic to ConfigBroadcaster 2021-10-25 12:03:22 -07:00
Lukas Joswiak
9a39da85b1 Fix issue where previous commit messages were reused
Fixes an issue where commit versions from previous requests sent to
ConfigNodes were being reused when a new quorum of commit versions was
requested. This was occurring due to a failure to reset the state of
GetCommittedVersionQuorum after a full snapshot request.
2021-10-25 12:03:22 -07:00
Lukas Joswiak
48dc91dd7f Add rollback and rollforward logic to ConfigBroadcaster 2021-10-25 12:03:22 -07:00
sfc-gh-tclinkenbeard
b15daf1886 Added PImpl class
This class propogates the constness of methods to their pimpl
implementations
2021-08-09 10:04:34 -07:00
sfc-gh-tclinkenbeard
9cfd6ed955 Add simple implementation to PaxosConfigConsumer 2021-07-18 17:07:10 -07:00
sfc-gh-tclinkenbeard
748a3ebfbe Add GetSnapshotAndChangesRequest type 2021-05-18 15:28:44 -07:00
sfc-gh-tclinkenbeard
ea8396c9be Improve decoupling of configuration database interfaces and implementations 2021-05-17 15:31:03 -07:00
sfc-gh-tclinkenbeard
32f38394b1 Added dummy PaxosConfigConsumer implementation 2021-05-17 13:41:50 -07:00