Commit Graph

424 Commits

Author SHA1 Message Date
Zhe Wang
42e17d8bd1 BulkLoading Use RangeLock (#11741)
* use range lock in bulk load

* refactor BulkLoading workload and nits

* add background traffic

* nits

* address comments
2024-10-31 12:58:13 -07:00
Zhe Wang
43446204ed Database Per-Range Lock (#11693)
* range lock framework

* improve the framework

* persist to txnStateStore

* fix bugs

* code clean

* code clean

* bug fix

* address comments

* add complex test workload and fix bugs found by the workload

* add workload correctness check and fix bugs

* code clean up

* add random range lock injection

* fix bugs in RandomRangeLock.actor.cpp

* enable random range lock injection in general workloads

* add rangelockcycle test

* disable random range lock in backup workloads

* nits

* add range lock ownership concept

* enable lock ownership to rangeLock

* api deal with tenant

* fix CI

* add test for multiple rangeLock owners

* nits

* address comments and renaming

* address comments
2024-10-23 16:25:56 -07:00
Jingyu Zhou
2313fdaa0e Add rocksdb, sharded rocksdb to configure workload (#11654)
* Add rocksdb, sharded rocksdb to configure workload

Also remove mentioning of ssd-redwood-1-experimental.

* Fix test failure when SHARD_ENCODE_LOCATION_METADATA is off
2024-09-12 21:03:06 -07:00
Syed Paymaan Raza
c3e7542cda Update end year in copyright header 2024-08-02 09:40:11 -07:00
Zhe Wang
74990e44bd Bulk Loading Framework (#11369) 2024-07-23 14:57:28 -07:00
Jingyu Zhou
d9e4c49503 Fix more -Wunused-variable warnings 2024-07-17 15:35:49 -07:00
Johannes M. Scheuermann
64b45088ae Make sure server list is validated against the excluded localites 2023-10-24 10:28:23 +02:00
Jingyu Zhou
ad259d48cf Merge pull request #11001 from jzhou77/main
Fix RemoveServersSafely workload timeout error
2023-10-23 09:26:33 -07:00
Johannes M. Scheuermann
fa1fd913a6 Issue get worker request concurrently 2023-10-20 10:26:32 +02:00
Jingyu Zhou
f99a8151a0 Fix RemoveServersSafely workload timeout error
This workload can have timeout error when using locality-based exclusion. The
sequence is:

1. RemoveServerSafely workload exclude locality by processid
2. Attrition reboots the target process, thus changing the processid, because
processid is generated for each worker process at fdbd()
3. RemoveServerSafely waits for the process exclusion, which never succeed
4. Timeout

The fix monitors processid locality changes and reissue the exclusion with the
correct locality.

To reproduce:
seed: -f ./tests/fast/SwizzledRollbackSideband.toml -s 879108103 -b on
commit: a3dbd4baf release-7.1
2023-10-19 21:46:44 -07:00
Johannes M. Scheuermann
05383066b4 Make sure the get all excluded servers method returns servers that are excluded based on locality 2023-10-19 10:06:29 +02:00
William Dowling
0f752473be Merge branch 'main' into radixtree-production 2023-09-25 09:52:20 +02:00
Zhe Wang
7e8f326277 Audit storage for specific engine (#10781)
* audit storage for specific engine

* fix getStorageType

* fix budget of skipAuditOnRange

* fix budget in scheduleAuditOnRange

* fix CI error

* improve trace events

* address comments
2023-08-23 10:51:24 -07:00
Zhe Wu
ab4ae712e8 Add PerpetualWiggleStorageMigrationWorkload documentation. 2023-08-10 09:35:57 -07:00
Zhe Wu
863038a44c Add improvement for initializing storage server using new perpetual_wiggle_storage_engine config 2023-08-10 09:35:57 -07:00
Zhe Wang
3426fc3c1a Add DD Security Mode (#10646)
* dd-security-mode

* address comments

* cleanup

* revise tr option set in loadAndUpdateAuditMetadataWithNewDDId

* address comments

* reset auditStorageInitStarted before DD init

* decouple audit resume and audit launch

* audit launch new request should wait for resuming existing requests

* address comment/clean up/fix

* fix

* fix initAuditMetadata retry

* fix initAuditMetadata retry should reset tr
2023-07-21 17:06:25 -07:00
William Dowling
8625eb68b0 Allow both memory-radixtree and memory-radixtree-beta modes 2023-07-06 14:51:23 +02:00
William Dowling
3ea1ba1648 Remove beta status from RadixTree storage engine 2023-07-05 17:54:54 +02:00
Hui Liu
630013cfd9 Fix MoveKeysClean.toml failure (#10470) 2023-06-13 08:45:03 -07:00
Zhe Wang
f8f8f72c4e Add audit storage cancellation (#10386)
* list audits

* cancel audits and corresponding tests

* make audit storage dblock aware

* increase audit retry since we are able to cancel

* fix updateAuditState and fdb github ci

* fmt

* fix fdbcli audit_storage and fix CI issue

* fix fdb cli

* address comments

* fmt
2023-06-06 14:29:53 -07:00
A.J. Beamon
ccf61ac2e5 Do not allow changing the coordinators to a set that is unreliable, because otherwise we could delete our coordinated state 2023-05-03 15:03:03 -07:00
Zhe Wang
d6e7b5f736 Audit storage: validate consistency of replica and shard location metadata (#9628)
* Implemented AuditUtils.actor.cpp

Moved AuditUtils to fdbserver/

* Persist AuditStorageState.

* Passed persisted AuditStorageState test.

* Added audit_storage_error to indicate a corruption is caught.

Throw/Send audit_storage_error when there is a data corruption.

Added doAuditStorage() for resuming Audit.

* Load and resume AuditStorage when DD restarts.

* Generate audit id monotonically.

* Fixed minor issue AuditId/Type was not set.

* Adding getLatestAuditStates.

* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.

* Added `audit_storage` fdbcli command.

* fmt.

* Fixed null shared_ptr issue.

* Improve audit data.

* Change DDAuditFailed to SevWarn.

* Sev.

* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.

* Moved AuditUtils* to fdbclient/.

* Added getAuditStatus fdbcli command.

* Refactor audit storage fdb cli commands.

* Added auditStorage in sim.

* Cleanup.

* Resolved comments.

* Resolved comments.

* Added SystemData for metadata audit.

Refactored audit workflow to make sure all sub-tasks are executed w/o
early exit.

* Improvements.

* Persisted Failed state after too many retries.

* Added retryCount for resumeAuditStorage().

* resolving conflict.

* Resolved conflicts.

* allow-merged-to-run

* add timeout to audit client

* fmt

* validate replica

* add audit serverKey

* address comments and fmt

* fix audit_storage_exceeded_request_limit

* fix segfault in getLatestAuditStatesImpl

* fix bugs

* remove timeout from workload

* fix bugs

* audit local view of shard assignment

* fmt

* fix-stuck-issue-and-make-dd-audit-storage-self-retry

* fix timeout

* fix timeout

* fix bugs and cleanup

* fix nit

* change name state to coreState for audit metadata

* address comments

* code clean

* fmt

* setup debug

* cleanup

* clean up

* code cleanup

* code clean

* remove tmp file

* fmt

* trace portion of shards that of anonymous physical shard

* remove unnecessary actor cleanup

* do not give up when tr is too old

* address commits

* refactor

* clean

* fmt

* fix-command-help-text

* fix-auditstate-restore-and-enable-restore-to-metadata-audit

* address comments

* fmrt

* debug and improve efficient of resume audit

* small change

* fix audit cli

* bypass completed audit when dd restart

* fix auditStorageCommandActor

* make mismatch key range more visable

* address comments

* make local shard metadata check can make progress by retries

* address comments

* address comments

* partition location metadata validation by range and server

* unset MIN_TRACE_SEVERITY

* address comments and SS auto proceed until failed then notify dd

* persistNewAuditState should checkMoveKeysLock

* audit storage location metadata partitioned by range and move shard assignment history def to the end of SS structure

* code cleanup

* fix error message in metadata validation

* fix registerAuditsForShardAssignmentHistoryCollection input for local shard validation

* add comments to code and add guard to make sure the SS audit does not proceeds automatically for many times without being notified by DD --- to support audit cancellation later

* fix coalesceRangeList

* replace rangeOverlapping func with operator and use struct instead of complicated type for return value of getKeyServer/serverKey/shardInfo

* simplify shard assignment history

* shardAssignmentRecordRequests should be unorder_map

* address comments, make trackShardAssignment simple, make anyChildAuditFailed cover all audit children, keep only one audit actor run at a time on each SS

* only run validate shard info once at a time, other audit type does not have this limitation

---------

Co-authored-by: He Liu <heliu05023@gmail.com>
Co-authored-by: He Liu <heliu@apple.com>
Co-authored-by: Zhe Wang <zhewang@Zhes-Laptop.local>
2023-05-01 10:35:52 -07:00
Josh Slocum
aef5130da2 adding system priority option to getDatabaseConfiguration, and several debugging improvements (#9864) 2023-04-06 15:08:40 -05:00
Zhe Wu
50a20946d1 Implement check if locality is already excluded in exclude locality command 2023-04-01 19:04:58 -07:00
Zhe Wu
6e1bb08677 Update documentation 2023-03-24 15:29:47 -07:00
Zhe Wu
8211b5d097 Add a check in excludeServer function that if the exclusion list already exists, don't need to issue new writes. 2023-03-24 14:57:31 -07:00
Steve Atherton
50d567b5a5 Refactored some parts of database configuration to support log_engine=<name> and storage_engine=<name> and generate these when converting a DatabaseConfig JSON object to a configure command. Refactored fileconfigure and simulation setup to use the same JSON -> configure function as the same code was copy/pasted to both places but only one has been kept up to date with new features. Renamed Redwood to ssd-redwood-1 canonically but the experimental name is still supported for backward compatibility. 2023-03-04 20:52:31 -08:00
Jingyu Zhou
9a257a60a4 Address review comments 2023-02-24 10:47:32 -08:00
Jingyu Zhou
1f1dc5e768 Allow a comma separated list of excluded addresses 2023-02-23 14:29:08 -08:00
Jingyu Zhou
6ac8720364 Add exclude to fdbcli's configure command
Right now this only allows one server address being excluded. This is useful
when the database is unavailable but we want the recruitment to skip some
particular processes.

Manually tested the concept works with a loopback cluster.
2023-02-23 14:28:20 -08:00
A.J. Beamon
72c5abc0f5 Refactor storage quotas to store them in a key backed map in the tenant metadata space 2023-01-25 20:48:17 -08:00
Xiaoge Su
0a60142160 Extract ProcessInfo, MachineInfo, KillType out from ISimulator 2023-01-24 14:48:42 -08:00
He Liu
00203c8732 Validate Storage part II (#8471)
* Implemented AuditUtils.actor.cpp

Moved AuditUtils to fdbserver/

* Persist AuditStorageState.

* Passed persisted AuditStorageState test.

* Added audit_storage_error to indicate a corruption is caught.

Throw/Send audit_storage_error when there is a data corruption.

Added doAuditStorage() for resuming Audit.

* Load and resume AuditStorage when DD restarts.

* Generate audit id monotonically.

* Fixed minor issue AuditId/Type was not set.

* Adding getLatestAuditStates.

* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.

* Added `audit_storage` fdbcli command.

* fmt.

* Fixed null shared_ptr issue.

* Improve audit data.

* Change DDAuditFailed to SevWarn.

* Sev.

* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.

* Moved AuditUtils* to fdbclient/.

* Added getAuditStatus fdbcli command.

* Refactor audit storage fdb cli commands.

* Added auditStorage in sim.

* Cleanup.

* Resolved comments.

* Resolved comments.

* Test disabling audit for sims.

* Cleanup.

Co-authored-by: He Liu <heliu@apple.com>
2023-01-15 21:46:14 -08:00
FoundationDB CI
86d6106dc1 format source code after switch to clang 15 2022-12-08 17:26:45 +00:00
Nim Wijetunga
97713cadff Update Encryption Mode DB Config Values (#8839)
* add encryption db config

* address pr comments

* address pr comments

* add comments

* add comment

* modify check

* change condition

* address pr comments

* simplify check

* address pr comments
2022-11-22 16:33:59 -08:00
Lukas Joswiak
795b666e23 Fix a rare configuration database data loss bug
See the comment contained in this commit. This bug could only manifest
under a specific set of circumstances:

1. A coordinator change is started
2. The coordinator change succeeds, but its action of clearing
   `previousCoordinatorsKey` is delayed.
3. A minority of `ConfigNode`s have an old state of the configuration
   database, compared to the majority.
4. A `ConfigNode` in the majority dies and permanently loses data.
5. A long delay occurs on the `PaxosConfigConsumer` when it tries to
   read the latest changes from the `ConfigNode`s.

In the above circumstances, the `ConfigBroadcaster` could incorrectly
send a snapshot of an old state of the configuration database to a
majority of `ConfigNode`s. This would cause new, durable, and
acknowledged commit data to be overwritten.

Note that this bug only affects the configuration database (used for
knob storage). It does not affect the normal keyspace.
2022-11-22 11:20:04 -08:00
Markus Pilman
f105cb1809 Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging 2022-11-14 10:11:52 -07:00
Ankita Kejriwal
8bc9a2bffd Add a command to clear storage quota 2022-11-11 17:59:54 -08:00
Ankita Kejriwal
abc4b45af1 Set the storage quota on tenant groups instead of tenants
Update all the relevant data structures and monitors accordingly.
2022-11-10 18:56:43 -08:00
Markus Pilman
2e7385891a fix formatting 2022-11-08 15:26:28 -07:00
Markus Pilman
f1fea14255 Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging 2022-11-01 13:51:35 -06:00
Lukas Joswiak
5ca2b89bdf Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-27 13:56:13 -07:00
Markus Pilman
2eaf674faa Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging 2022-10-26 09:33:54 -06:00
Ankita Kejriwal
c34a23152c Change the storage quota type from unit64_t to int64_t
With this change, the storage quota will be of the same type
as the storage bytes used returned by `getEstimatedRangeSizeBytes`.
2022-10-21 16:18:52 -07:00
Jingyu Zhou
a8391caf23 Revert "Data loss protection v2" 2022-10-20 18:09:58 -05:00
Markus Pilman
d1c80659b5 Remember disk corruptions and downgrade trace severity if a corruption was injected 2022-10-19 16:18:00 -06:00
Lukas Joswiak
7f889c87e3 Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-18 21:37:42 -07:00
He Liu
b52edd8658 Merge branch 'main' of https://github.com/apple/foundationdb into validate-data-consistency 2022-10-10 11:00:05 -07:00
Markus Pilman
23edfd0d59 Fix formatting 2022-10-04 18:33:30 -06:00
Markus Pilman
550488b020 Merge remote-tracking branch 'origin/main' into bugfixes/open-for-ide
# Conflicts:
#	bindings/c/CMakeLists.txt
#	fdbclient/include/fdbclient/GetEncryptCipherKeys.actor.h
#	fdbserver/BackupWorker.actor.cpp
#	fdbserver/BlobWorker.actor.cpp
#	fdbserver/CommitProxyServer.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/StorageCache.actor.cpp
#	fdbserver/include/fdbserver/GetEncryptCipherKeys.actor.h
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/PhysicalShardMove.actor.cpp
#	flow/CMakeLists.txt
2022-10-04 18:27:48 -06:00