* Add restore validation feature with simplified backup gap fix
Implements restore validation using audit_storage to verify backup/restore
correctness. Includes a minimal fix for the backup gap bug.
Key components:
- ValidateRestore audit type: compares source keys against restored keys
at \xff\x02/rlog/ prefix in storage server
- DD audit fixes: propagate validation errors, handle DD failover correctly
- RestoreValidation and BackupAndRestoreValidation workloads for testing
- Simplified backup gap fix: prevent snapshot from finishing in the same
iteration it dispatches the last tasks (single flag + one check)
* remainingBudgetForAuditTasks should be managed within audit
* fix CI
* add audit storage test for various ranges
* clean DD
* new auditStorageUserDataQ
* fix assert fail in startTrackShardAssignment
* fix assert fail in ssaudit
* address comments
* replace assert with audit_cancel in ss audits
* add audit check progress tool
* add observability to audit progress and fix audit bugs
* fix audit progress issues and add sim test for audit progress and add trace event for the audit progress and add fdbcli to track the audit progress
* remove old audit storage on SS
* check audit progress when auditCore completes
* list audits
* cancel audits and corresponding tests
* make audit storage dblock aware
* increase audit retry since we are able to cancel
* fix updateAuditState and fdb github ci
* fmt
* fix fdbcli audit_storage and fix CI issue
* fix fdb cli
* address comments
* fmt
* Added location_metadata fdbcli to query shard locations, assignements, numbers etc.
* Added `listshards` to get some random physical/non-physical shards.
* Resolved comments.
* clean up old audit metadata
* change comments
* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester
* address comment
* clear audit metadata should not throw error
* cleanup progress metadata by type
* control number of AuditStatistic events
* carefully persist new audit state
* add unit tests and fix issues
* cleanup
* allow audit concurrent run for different types and fix some bug in auditutl
* fix ci issue and nits
* Added `get_audit_status checkmigration` to print out the number of data shards and `physical shards`, so that we know the progress of migration to `shard_encode_location_metadata`
* Fixed print format.
* Addressed comments.
* Implemented AuditUtils.actor.cpp
Moved AuditUtils to fdbserver/
* Persist AuditStorageState.
* Passed persisted AuditStorageState test.
* Added audit_storage_error to indicate a corruption is caught.
Throw/Send audit_storage_error when there is a data corruption.
Added doAuditStorage() for resuming Audit.
* Load and resume AuditStorage when DD restarts.
* Generate audit id monotonically.
* Fixed minor issue AuditId/Type was not set.
* Adding getLatestAuditStates.
* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.
* Added `audit_storage` fdbcli command.
* fmt.
* Fixed null shared_ptr issue.
* Improve audit data.
* Change DDAuditFailed to SevWarn.
* Sev.
* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.
* Moved AuditUtils* to fdbclient/.
* Added getAuditStatus fdbcli command.
* Refactor audit storage fdb cli commands.
* Added auditStorage in sim.
* Cleanup.
* Resolved comments.
* Resolved comments.
* Added SystemData for metadata audit.
Refactored audit workflow to make sure all sub-tasks are executed w/o
early exit.
* Improvements.
* Persisted Failed state after too many retries.
* Added retryCount for resumeAuditStorage().
* resolving conflict.
* Resolved conflicts.
* allow-merged-to-run
* add timeout to audit client
* fmt
* validate replica
* add audit serverKey
* address comments and fmt
* fix audit_storage_exceeded_request_limit
* fix segfault in getLatestAuditStatesImpl
* fix bugs
* remove timeout from workload
* fix bugs
* audit local view of shard assignment
* fmt
* fix-stuck-issue-and-make-dd-audit-storage-self-retry
* fix timeout
* fix timeout
* fix bugs and cleanup
* fix nit
* change name state to coreState for audit metadata
* address comments
* code clean
* fmt
* setup debug
* cleanup
* clean up
* code cleanup
* code clean
* remove tmp file
* fmt
* trace portion of shards that of anonymous physical shard
* remove unnecessary actor cleanup
* do not give up when tr is too old
* address commits
* refactor
* clean
* fmt
* fix-command-help-text
* fix-auditstate-restore-and-enable-restore-to-metadata-audit
* address comments
* fmrt
* debug and improve efficient of resume audit
* small change
* fix audit cli
* bypass completed audit when dd restart
* fix auditStorageCommandActor
* make mismatch key range more visable
* address comments
* make local shard metadata check can make progress by retries
* address comments
* address comments
* partition location metadata validation by range and server
* unset MIN_TRACE_SEVERITY
* address comments and SS auto proceed until failed then notify dd
* persistNewAuditState should checkMoveKeysLock
* audit storage location metadata partitioned by range and move shard assignment history def to the end of SS structure
* code cleanup
* fix error message in metadata validation
* fix registerAuditsForShardAssignmentHistoryCollection input for local shard validation
* add comments to code and add guard to make sure the SS audit does not proceeds automatically for many times without being notified by DD --- to support audit cancellation later
* fix coalesceRangeList
* replace rangeOverlapping func with operator and use struct instead of complicated type for return value of getKeyServer/serverKey/shardInfo
* simplify shard assignment history
* shardAssignmentRecordRequests should be unorder_map
* address comments, make trackShardAssignment simple, make anyChildAuditFailed cover all audit children, keep only one audit actor run at a time on each SS
* only run validate shard info once at a time, other audit type does not have this limitation
---------
Co-authored-by: He Liu <heliu05023@gmail.com>
Co-authored-by: He Liu <heliu@apple.com>
Co-authored-by: Zhe Wang <zhewang@Zhes-Laptop.local>
* Implemented AuditUtils.actor.cpp
Moved AuditUtils to fdbserver/
* Persist AuditStorageState.
* Passed persisted AuditStorageState test.
* Added audit_storage_error to indicate a corruption is caught.
Throw/Send audit_storage_error when there is a data corruption.
Added doAuditStorage() for resuming Audit.
* Load and resume AuditStorage when DD restarts.
* Generate audit id monotonically.
* Fixed minor issue AuditId/Type was not set.
* Adding getLatestAuditStates.
* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.
* Added `audit_storage` fdbcli command.
* fmt.
* Fixed null shared_ptr issue.
* Improve audit data.
* Change DDAuditFailed to SevWarn.
* Sev.
* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.
* Moved AuditUtils* to fdbclient/.
* Added getAuditStatus fdbcli command.
* Refactor audit storage fdb cli commands.
* Added auditStorage in sim.
* Cleanup.
* Resolved comments.
* Resolved comments.
* Test disabling audit for sims.
* Cleanup.
Co-authored-by: He Liu <heliu@apple.com>