apple-foundationdb

mirror of https://github.com/apple/foundationdb.git synced 2026-01-25 12:28:19 +00:00

Author	SHA1	Message	Date
Zhe Wang	42e17d8bd1	BulkLoading Use RangeLock (#11741 ) * use range lock in bulk load * refactor BulkLoading workload and nits * add background traffic * nits * address comments	2024-10-31 12:58:13 -07:00
Zhe Wang	43446204ed	Database Per-Range Lock (#11693 ) * range lock framework * improve the framework * persist to txnStateStore * fix bugs * code clean * code clean * bug fix * address comments * add complex test workload and fix bugs found by the workload * add workload correctness check and fix bugs * code clean up * add random range lock injection * fix bugs in RandomRangeLock.actor.cpp * enable random range lock injection in general workloads * add rangelockcycle test * disable random range lock in backup workloads * nits * add range lock ownership concept * enable lock ownership to rangeLock * api deal with tenant * fix CI * add test for multiple rangeLock owners * nits * address comments and renaming * address comments	2024-10-23 16:25:56 -07:00
Jingyu Zhou	2313fdaa0e	Add rocksdb, sharded rocksdb to configure workload (#11654 ) * Add rocksdb, sharded rocksdb to configure workload Also remove mentioning of ssd-redwood-1-experimental. * Fix test failure when SHARD_ENCODE_LOCATION_METADATA is off	2024-09-12 21:03:06 -07:00
Syed Paymaan Raza	c3e7542cda	Update end year in copyright header	2024-08-02 09:40:11 -07:00
Zhe Wang	74990e44bd	Bulk Loading Framework (#11369 )	2024-07-23 14:57:28 -07:00
Jingyu Zhou	d9e4c49503	Fix more -Wunused-variable warnings	2024-07-17 15:35:49 -07:00
Johannes M. Scheuermann	64b45088ae	Make sure server list is validated against the excluded localites	2023-10-24 10:28:23 +02:00
Jingyu Zhou	ad259d48cf	Merge pull request #11001 from jzhou77/main Fix RemoveServersSafely workload timeout error	2023-10-23 09:26:33 -07:00
Johannes M. Scheuermann	fa1fd913a6	Issue get worker request concurrently	2023-10-20 10:26:32 +02:00
Jingyu Zhou	f99a8151a0	Fix RemoveServersSafely workload timeout error This workload can have timeout error when using locality-based exclusion. The sequence is: 1. RemoveServerSafely workload exclude locality by processid 2. Attrition reboots the target process, thus changing the processid, because processid is generated for each worker process at fdbd() 3. RemoveServerSafely waits for the process exclusion, which never succeed 4. Timeout The fix monitors processid locality changes and reissue the exclusion with the correct locality. To reproduce: seed: -f ./tests/fast/SwizzledRollbackSideband.toml -s 879108103 -b on commit: `a3dbd4baf` release-7.1	2023-10-19 21:46:44 -07:00
Johannes M. Scheuermann	05383066b4	Make sure the get all excluded servers method returns servers that are excluded based on locality	2023-10-19 10:06:29 +02:00
William Dowling	0f752473be	Merge branch 'main' into radixtree-production	2023-09-25 09:52:20 +02:00
Zhe Wang	7e8f326277	Audit storage for specific engine (#10781 ) * audit storage for specific engine * fix getStorageType * fix budget of skipAuditOnRange * fix budget in scheduleAuditOnRange * fix CI error * improve trace events * address comments	2023-08-23 10:51:24 -07:00
Zhe Wu	ab4ae712e8	Add PerpetualWiggleStorageMigrationWorkload documentation.	2023-08-10 09:35:57 -07:00
Zhe Wu	863038a44c	Add improvement for initializing storage server using new perpetual_wiggle_storage_engine config	2023-08-10 09:35:57 -07:00
Zhe Wang	3426fc3c1a	Add DD Security Mode (#10646 ) * dd-security-mode * address comments * cleanup * revise tr option set in loadAndUpdateAuditMetadataWithNewDDId * address comments * reset auditStorageInitStarted before DD init * decouple audit resume and audit launch * audit launch new request should wait for resuming existing requests * address comment/clean up/fix * fix * fix initAuditMetadata retry * fix initAuditMetadata retry should reset tr	2023-07-21 17:06:25 -07:00
William Dowling	8625eb68b0	Allow both memory-radixtree and memory-radixtree-beta modes	2023-07-06 14:51:23 +02:00
William Dowling	3ea1ba1648	Remove beta status from RadixTree storage engine	2023-07-05 17:54:54 +02:00
Hui Liu	630013cfd9	Fix MoveKeysClean.toml failure (#10470 )	2023-06-13 08:45:03 -07:00
Zhe Wang	f8f8f72c4e	Add audit storage cancellation (#10386 ) * list audits * cancel audits and corresponding tests * make audit storage dblock aware * increase audit retry since we are able to cancel * fix updateAuditState and fdb github ci * fmt * fix fdbcli audit_storage and fix CI issue * fix fdb cli * address comments * fmt	2023-06-06 14:29:53 -07:00
A.J. Beamon	ccf61ac2e5	Do not allow changing the coordinators to a set that is unreliable, because otherwise we could delete our coordinated state	2023-05-03 15:03:03 -07:00
Zhe Wang	d6e7b5f736	Audit storage: validate consistency of replica and shard location metadata (#9628 ) * Implemented AuditUtils.actor.cpp Moved AuditUtils to fdbserver/ * Persist AuditStorageState. * Passed persisted AuditStorageState test. * Added audit_storage_error to indicate a corruption is caught. Throw/Send audit_storage_error when there is a data corruption. Added doAuditStorage() for resuming Audit. * Load and resume AuditStorage when DD restarts. * Generate audit id monotonically. * Fixed minor issue AuditId/Type was not set. * Adding getLatestAuditStates. * Improved persisted errors and added AuditStorageCommand.actor.cpp for fdbcli. * Added `audit_storage` fdbcli command. * fmt. * Fixed null shared_ptr issue. * Improve audit data. * Change DDAuditFailed to SevWarn. * Sev. * set SERVE_AUDIT_STORAGE_PARALLELISM to 1. * Moved AuditUtils* to fdbclient/. * Added getAuditStatus fdbcli command. * Refactor audit storage fdb cli commands. * Added auditStorage in sim. * Cleanup. * Resolved comments. * Resolved comments. * Added SystemData for metadata audit. Refactored audit workflow to make sure all sub-tasks are executed w/o early exit. * Improvements. * Persisted Failed state after too many retries. * Added retryCount for resumeAuditStorage(). * resolving conflict. * Resolved conflicts. * allow-merged-to-run * add timeout to audit client * fmt * validate replica * add audit serverKey * address comments and fmt * fix audit_storage_exceeded_request_limit * fix segfault in getLatestAuditStatesImpl * fix bugs * remove timeout from workload * fix bugs * audit local view of shard assignment * fmt * fix-stuck-issue-and-make-dd-audit-storage-self-retry * fix timeout * fix timeout * fix bugs and cleanup * fix nit * change name state to coreState for audit metadata * address comments * code clean * fmt * setup debug * cleanup * clean up * code cleanup * code clean * remove tmp file * fmt * trace portion of shards that of anonymous physical shard * remove unnecessary actor cleanup * do not give up when tr is too old * address commits * refactor * clean * fmt * fix-command-help-text * fix-auditstate-restore-and-enable-restore-to-metadata-audit * address comments * fmrt * debug and improve efficient of resume audit * small change * fix audit cli * bypass completed audit when dd restart * fix auditStorageCommandActor * make mismatch key range more visable * address comments * make local shard metadata check can make progress by retries * address comments * address comments * partition location metadata validation by range and server * unset MIN_TRACE_SEVERITY * address comments and SS auto proceed until failed then notify dd * persistNewAuditState should checkMoveKeysLock * audit storage location metadata partitioned by range and move shard assignment history def to the end of SS structure * code cleanup * fix error message in metadata validation * fix registerAuditsForShardAssignmentHistoryCollection input for local shard validation * add comments to code and add guard to make sure the SS audit does not proceeds automatically for many times without being notified by DD --- to support audit cancellation later * fix coalesceRangeList * replace rangeOverlapping func with operator and use struct instead of complicated type for return value of getKeyServer/serverKey/shardInfo * simplify shard assignment history * shardAssignmentRecordRequests should be unorder_map * address comments, make trackShardAssignment simple, make anyChildAuditFailed cover all audit children, keep only one audit actor run at a time on each SS * only run validate shard info once at a time, other audit type does not have this limitation --------- Co-authored-by: He Liu <heliu05023@gmail.com> Co-authored-by: He Liu <heliu@apple.com> Co-authored-by: Zhe Wang <zhewang@Zhes-Laptop.local>	2023-05-01 10:35:52 -07:00
Josh Slocum	aef5130da2	adding system priority option to getDatabaseConfiguration, and several debugging improvements (#9864 )	2023-04-06 15:08:40 -05:00
Zhe Wu	50a20946d1	Implement check if locality is already excluded in exclude locality command	2023-04-01 19:04:58 -07:00
Zhe Wu	6e1bb08677	Update documentation	2023-03-24 15:29:47 -07:00
Zhe Wu	8211b5d097	Add a check in excludeServer function that if the exclusion list already exists, don't need to issue new writes.	2023-03-24 14:57:31 -07:00
Steve Atherton	50d567b5a5	Refactored some parts of database configuration to support log_engine=<name> and storage_engine=<name> and generate these when converting a DatabaseConfig JSON object to a `configure` command. Refactored `fileconfigure` and simulation setup to use the same JSON -> configure function as the same code was copy/pasted to both places but only one has been kept up to date with new features. Renamed Redwood to `ssd-redwood-1` canonically but the experimental name is still supported for backward compatibility.	2023-03-04 20:52:31 -08:00
Jingyu Zhou	9a257a60a4	Address review comments	2023-02-24 10:47:32 -08:00
Jingyu Zhou	1f1dc5e768	Allow a comma separated list of excluded addresses	2023-02-23 14:29:08 -08:00
Jingyu Zhou	6ac8720364	Add exclude to fdbcli's configure command Right now this only allows one server address being excluded. This is useful when the database is unavailable but we want the recruitment to skip some particular processes. Manually tested the concept works with a loopback cluster.	2023-02-23 14:28:20 -08:00
A.J. Beamon	72c5abc0f5	Refactor storage quotas to store them in a key backed map in the tenant metadata space	2023-01-25 20:48:17 -08:00
Xiaoge Su	0a60142160	Extract ProcessInfo, MachineInfo, KillType out from ISimulator	2023-01-24 14:48:42 -08:00
He Liu	00203c8732	Validate Storage part II (#8471 ) * Implemented AuditUtils.actor.cpp Moved AuditUtils to fdbserver/ * Persist AuditStorageState. * Passed persisted AuditStorageState test. * Added audit_storage_error to indicate a corruption is caught. Throw/Send audit_storage_error when there is a data corruption. Added doAuditStorage() for resuming Audit. * Load and resume AuditStorage when DD restarts. * Generate audit id monotonically. * Fixed minor issue AuditId/Type was not set. * Adding getLatestAuditStates. * Improved persisted errors and added AuditStorageCommand.actor.cpp for fdbcli. * Added `audit_storage` fdbcli command. * fmt. * Fixed null shared_ptr issue. * Improve audit data. * Change DDAuditFailed to SevWarn. * Sev. * set SERVE_AUDIT_STORAGE_PARALLELISM to 1. * Moved AuditUtils* to fdbclient/. * Added getAuditStatus fdbcli command. * Refactor audit storage fdb cli commands. * Added auditStorage in sim. * Cleanup. * Resolved comments. * Resolved comments. * Test disabling audit for sims. * Cleanup. Co-authored-by: He Liu <heliu@apple.com>	2023-01-15 21:46:14 -08:00
FoundationDB CI	86d6106dc1	format source code after switch to clang 15	2022-12-08 17:26:45 +00:00
Nim Wijetunga	97713cadff	Update Encryption Mode DB Config Values (#8839 ) * add encryption db config * address pr comments * address pr comments * add comments * add comment * modify check * change condition * address pr comments * simplify check * address pr comments	2022-11-22 16:33:59 -08:00
Lukas Joswiak	795b666e23	Fix a rare configuration database data loss bug See the comment contained in this commit. This bug could only manifest under a specific set of circumstances: 1. A coordinator change is started 2. The coordinator change succeeds, but its action of clearing `previousCoordinatorsKey` is delayed. 3. A minority of `ConfigNode`s have an old state of the configuration database, compared to the majority. 4. A `ConfigNode` in the majority dies and permanently loses data. 5. A long delay occurs on the `PaxosConfigConsumer` when it tries to read the latest changes from the `ConfigNode`s. In the above circumstances, the `ConfigBroadcaster` could incorrectly send a snapshot of an old state of the configuration database to a majority of `ConfigNode`s. This would cause new, durable, and acknowledged commit data to be overwritten. Note that this bug only affects the configuration database (used for knob storage). It does not affect the normal keyspace.	2022-11-22 11:20:04 -08:00
Markus Pilman	f105cb1809	Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging	2022-11-14 10:11:52 -07:00
Ankita Kejriwal	8bc9a2bffd	Add a command to clear storage quota	2022-11-11 17:59:54 -08:00
Ankita Kejriwal	abc4b45af1	Set the storage quota on tenant groups instead of tenants Update all the relevant data structures and monitors accordingly.	2022-11-10 18:56:43 -08:00
Markus Pilman	2e7385891a	fix formatting	2022-11-08 15:26:28 -07:00
Markus Pilman	f1fea14255	Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging	2022-11-01 13:51:35 -06:00
Lukas Joswiak	5ca2b89bdf	Fix simulation issue where process switch was ignored The simulator tracks only active processes. Rebooted or killed processes are removed from the list of processes, and only get added back when the process is rebooted and starts up again. This causes a problem for the `RebootProcessAndSwitch` kill type, which wants to simultaneously reboot all machines in a cluster and change their cluster file. If a machine is currently being rebooted, it will miss the reboot process and switch command. The fix is to add a check when a process is being started in simulation. If the process has had its cluster file changed and the cluster is in a state where all processes should have had their cluster files reverted to the original value, the simulator will now send a `RebootProcessAndSwitch` signal right when the process is started. This will cause an extra reboot, but should correctly switch the process back to its original, correct cluster file, allowing the cluster to fully recover all clusters. Note that the above issue should only affect simulation, due to how the simulator tracks processes and handles kill signals. This commit also adds a field to each process struct to determine whether the process is being run in a DR cluster in the simulation run. This is needed because simulation does not differentiate between processes in different clusters (other than by the IP), and some processes needed to switch clusters and some simply needed to be rebooted.	2022-10-27 13:56:13 -07:00
Markus Pilman	2eaf674faa	Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging	2022-10-26 09:33:54 -06:00
Ankita Kejriwal	c34a23152c	Change the storage quota type from unit64_t to int64_t With this change, the storage quota will be of the same type as the storage bytes used returned by `getEstimatedRangeSizeBytes`.	2022-10-21 16:18:52 -07:00
Jingyu Zhou	a8391caf23	Revert "Data loss protection v2"	2022-10-20 18:09:58 -05:00
Markus Pilman	d1c80659b5	Remember disk corruptions and downgrade trace severity if a corruption was injected	2022-10-19 16:18:00 -06:00
Lukas Joswiak	7f889c87e3	Fix simulation issue where process switch was ignored The simulator tracks only active processes. Rebooted or killed processes are removed from the list of processes, and only get added back when the process is rebooted and starts up again. This causes a problem for the `RebootProcessAndSwitch` kill type, which wants to simultaneously reboot all machines in a cluster and change their cluster file. If a machine is currently being rebooted, it will miss the reboot process and switch command. The fix is to add a check when a process is being started in simulation. If the process has had its cluster file changed and the cluster is in a state where all processes should have had their cluster files reverted to the original value, the simulator will now send a `RebootProcessAndSwitch` signal right when the process is started. This will cause an extra reboot, but should correctly switch the process back to its original, correct cluster file, allowing the cluster to fully recover all clusters. Note that the above issue should only affect simulation, due to how the simulator tracks processes and handles kill signals. This commit also adds a field to each process struct to determine whether the process is being run in a DR cluster in the simulation run. This is needed because simulation does not differentiate between processes in different clusters (other than by the IP), and some processes needed to switch clusters and some simply needed to be rebooted.	2022-10-18 21:37:42 -07:00
He Liu	b52edd8658	Merge branch 'main' of https://github.com/apple/foundationdb into validate-data-consistency	2022-10-10 11:00:05 -07:00
Markus Pilman	23edfd0d59	Fix formatting	2022-10-04 18:33:30 -06:00
Markus Pilman	550488b020	Merge remote-tracking branch 'origin/main' into bugfixes/open-for-ide # Conflicts: # bindings/c/CMakeLists.txt # fdbclient/include/fdbclient/GetEncryptCipherKeys.actor.h # fdbserver/BackupWorker.actor.cpp # fdbserver/BlobWorker.actor.cpp # fdbserver/CommitProxyServer.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/StorageCache.actor.cpp # fdbserver/include/fdbserver/GetEncryptCipherKeys.actor.h # fdbserver/storageserver.actor.cpp # fdbserver/workloads/PhysicalShardMove.actor.cpp # flow/CMakeLists.txt	2022-10-04 18:27:48 -06:00

1 2 3 4 5 ...

424 Commits