* range lock framework
* improve the framework
* persist to txnStateStore
* fix bugs
* code clean
* code clean
* bug fix
* address comments
* add complex test workload and fix bugs found by the workload
* add workload correctness check and fix bugs
* code clean up
* add random range lock injection
* fix bugs in RandomRangeLock.actor.cpp
* enable random range lock injection in general workloads
* add rangelockcycle test
* disable random range lock in backup workloads
* nits
* add range lock ownership concept
* enable lock ownership to rangeLock
* api deal with tenant
* fix CI
* add test for multiple rangeLock owners
* nits
* address comments and renaming
* address comments
* acs framework
* code refactor and fix bugs
* add ss crash loop protector
* use sharedptr instead of raw pointer
* fixed critical bugs and add provate mutation acs to the framework
* enable ACS for all mutations except for clear serverTag mutation and fix bugs
* fix restarting tests
* refactor code and fix bugs
* fix AccumulativeChecksumState toString
* fix bugs
* allow all mutations in acs and fixed bugs
* fix bugs and code cleanup
* code clean up for adding recovery support
* simplify code and support recovery
* clear acs state at ss
* fix bug
* terminate validator if ss will be removed in the current batch
* simplify code
* add trace
* address comments
* optimize code
* deep copy when adding mutation to acs validator
* warp encode and decode persist acs key
* make acstable private
* remove unless func
* remove unless func
* remove epoch in ACS validator
* add acs mutation counter in SS metrics
* code cleanup and make knob check better
* make mutation buffer global
* simplify code
* add comments
* make knob randomly set
* address comments
* ss reboot after acs mismatch found
To retrieve storage metadata for every status json request is very expensive
for clusters with a large number of storage servers. So I change the logic so
that ClusterController actively monitors changes to storage metadata, and only
retrieves them when there is a change.
* Remove duplicate getRange() for DB handles and update existing GetRange to accept DB handles.
* Initial progress checkpoint on new ConsistencyScan role.
* Updated TODOs, finished most if not all state updates.
* placeholder
* Add more TODOs, documentation and comment improvements.
* Checkpoint round state to avoid advancing progress if commit fails.
* Bug fix, check is supposed to be for overlap, not lack of overlap.
* Added more TODO's and added faked read results / exceptions and faked DB size retrieval to prove the consistencyScanCore logic works.
* Update JSON schemas and command help.
* Add comment about lifetime stats reset.
* More TODO comments and some renames for clarity, some bug fixes.
* properly stopping consistency scan in simulation so that it doesn't run forever and cause quiet database to fail
* removing trailing comma from consistency_scan json schema
* Making CC inconsistency not an error if it's intentional tss corruption
* consistency scan actually reads storage locations
* added check that consistency scan actually completes a round in simulation, fixed bug and added debugging around consistency scan getting stuck
* made consistency scan properly fetch database size
* refactoring data check to be used in both consistency scan and consistency check
* checking that consistency scan always completes at least one round and doesn't get stuck
* cleanup
* fixing ide build
* consistencyscan fdbcli command wasn't actually changing db state
* consistencyscan fdbcli command always said enabled even when it wasn't
---------
Co-authored-by: Steve Atherton <steve.atherton@snowflake.com>
* clean up old audit metadata
* change comments
* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester
* address comment
* clear audit metadata should not throw error
* cleanup progress metadata by type
* control number of AuditStatistic events
* carefully persist new audit state
* add unit tests and fix issues
* cleanup
* allow audit concurrent run for different types and fix some bug in auditutl
* fix ci issue and nits
* cleanup audit progress metadata and tester directly issue audit requests to DD instead of CC
* address comments and fix test dd issue request but dd not present
* Implemented AuditUtils.actor.cpp
Moved AuditUtils to fdbserver/
* Persist AuditStorageState.
* Passed persisted AuditStorageState test.
* Added audit_storage_error to indicate a corruption is caught.
Throw/Send audit_storage_error when there is a data corruption.
Added doAuditStorage() for resuming Audit.
* Load and resume AuditStorage when DD restarts.
* Generate audit id monotonically.
* Fixed minor issue AuditId/Type was not set.
* Adding getLatestAuditStates.
* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.
* Added `audit_storage` fdbcli command.
* fmt.
* Fixed null shared_ptr issue.
* Improve audit data.
* Change DDAuditFailed to SevWarn.
* Sev.
* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.
* Moved AuditUtils* to fdbclient/.
* Added getAuditStatus fdbcli command.
* Refactor audit storage fdb cli commands.
* Added auditStorage in sim.
* Cleanup.
* Resolved comments.
* Resolved comments.
* Added SystemData for metadata audit.
Refactored audit workflow to make sure all sub-tasks are executed w/o
early exit.
* Improvements.
* Persisted Failed state after too many retries.
* Added retryCount for resumeAuditStorage().
* resolving conflict.
* Resolved conflicts.
* allow-merged-to-run
* add timeout to audit client
* fmt
* validate replica
* add audit serverKey
* address comments and fmt
* fix audit_storage_exceeded_request_limit
* fix segfault in getLatestAuditStatesImpl
* fix bugs
* remove timeout from workload
* fix bugs
* audit local view of shard assignment
* fmt
* fix-stuck-issue-and-make-dd-audit-storage-self-retry
* fix timeout
* fix timeout
* fix bugs and cleanup
* fix nit
* change name state to coreState for audit metadata
* address comments
* code clean
* fmt
* setup debug
* cleanup
* clean up
* code cleanup
* code clean
* remove tmp file
* fmt
* trace portion of shards that of anonymous physical shard
* remove unnecessary actor cleanup
* do not give up when tr is too old
* address commits
* refactor
* clean
* fmt
* fix-command-help-text
* fix-auditstate-restore-and-enable-restore-to-metadata-audit
* address comments
* fmrt
* debug and improve efficient of resume audit
* small change
* fix audit cli
* bypass completed audit when dd restart
* fix auditStorageCommandActor
* make mismatch key range more visable
* address comments
* make local shard metadata check can make progress by retries
* address comments
* address comments
* partition location metadata validation by range and server
* unset MIN_TRACE_SEVERITY
* address comments and SS auto proceed until failed then notify dd
* persistNewAuditState should checkMoveKeysLock
* audit storage location metadata partitioned by range and move shard assignment history def to the end of SS structure
* code cleanup
* fix error message in metadata validation
* fix registerAuditsForShardAssignmentHistoryCollection input for local shard validation
* add comments to code and add guard to make sure the SS audit does not proceeds automatically for many times without being notified by DD --- to support audit cancellation later
* fix coalesceRangeList
* replace rangeOverlapping func with operator and use struct instead of complicated type for return value of getKeyServer/serverKey/shardInfo
* simplify shard assignment history
* shardAssignmentRecordRequests should be unorder_map
* address comments, make trackShardAssignment simple, make anyChildAuditFailed cover all audit children, keep only one audit actor run at a time on each SS
* only run validate shard info once at a time, other audit type does not have this limitation
---------
Co-authored-by: He Liu <heliu05023@gmail.com>
Co-authored-by: He Liu <heliu@apple.com>
Co-authored-by: Zhe Wang <zhewang@Zhes-Laptop.local>