Commit Graph

13 Commits

Author SHA1 Message Date
Zhe Wang
ecf0746709 [Release-7.3] Cherry-pick multiple fixes to AuditStorage (#10895)
* fix audit state serializer (#10889)

* Fix SSShard Audit (#10896)

* fix ssshard

* address comments

* fmt

* fix multiple requests from DDDoAudit (#10902)

* address conflict
2023-09-15 15:55:37 -07:00
Zhe Wang
57e2f08c50 [Release-7.3] Cherry-pick audit storage optimizations (#10840)
* Multiple improvements to AuditStorages (#10685)

* remove danger DDAudit assert, add AuditRate knob, add progress check when ssshard complete, add progress check for ssshard in fdbcli

* throttle progress check for ssshard

* fix getAuditProgressByServer

* fix trace event for ss audit

* using name -- checkMoveKeysLockForAudit

* new scheduleAuditLocationMetadata

* address comments

* shorten progress summary for ssshard

* simplify getAuditProgressByServer in fdbcli

* Audit storage for specific engine (#10781)

* audit storage for specific engine

* fix getStorageType

* fix budget of skipAuditOnRange

* fix budget in scheduleAuditOnRange

* fix CI error

* improve trace events

* address comments

* Audit location metadata in DD (#10820)

* Audit location metadata in DD

* nits

* Fix auditStorage: Audit task should not retry if the task is issued by an outdated DD (copy from main PR 10844)
2023-08-29 11:09:28 -07:00
Zhe Wang
69a4b15907 [Release-7.3] Cherry-pick new implementation of auditing replica and complete check of location metadata (#10631) 2023-07-19 16:09:31 -07:00
Zhe Wang
aa47c0c722 [Release 7.3] Cherry-pick Add audit storage cancellation (#10386) (#10430)
* Add audit storage cancellation (#10386)

* list audits

* cancel audits and corresponding tests

* make audit storage dblock aware

* increase audit retry since we are able to cancel

* fix updateAuditState and fdb github ci

* fmt

* fix fdbcli audit_storage and fix CI issue

* fix fdb cli

* address comments

* fmt

* Fix audit storage actor cancel issue (#10443)

* init

* add testAuditStorageConcurrentRunForSameType test

* init (#10458)
2023-06-09 10:43:54 -07:00
He Liu
32e8ba7474 Added location_metadata fdbcli to query shard locations, assignements… (#10395) (#10428)
* Added location_metadata fdbcli to query shard locations, assignements, numbers etc.

* Added `listshards` to get some random physical/non-physical shards.

* Resolved comments.
2023-06-07 15:08:19 -07:00
Zhe Wang
92d7f5ce47 SS Audit Storage Throttling (#10322)
* ss audit storage throttling

* add audit manager to ss

* reduce CONCURRENT_AUDIT_TASK_COUNT_MAX

* revises comments

* fix audit cli

* fix getAuditStates

* remove toStringForCLI
2023-05-29 14:46:51 -07:00
Zhe Wang
0d8406dde3 Adding cleanup of old audit metadata (#10137)
* clean up old audit metadata

* change comments

* fix audit cleanup rule as PR description claim and reduce timeout of auditStorageCorrectness in tester

* address comment

* clear audit metadata should not throw error

* cleanup progress metadata by type

* control number of AuditStatistic events

* carefully persist new audit state

* add unit tests and fix issues

* cleanup

* allow audit concurrent run for different types and fix some bug in auditutl

* fix ci issue and nits
2023-05-10 21:18:21 -07:00
Zhe Wang
05f416d1b6 cleanup merge notion 2023-05-10 21:11:31 -07:00
Zhe Wang
298ab3f662 Audit storage: validate consistency of replica and shard location metadata (#9628)
* Implemented AuditUtils.actor.cpp

Moved AuditUtils to fdbserver/

* Persist AuditStorageState.

* Passed persisted AuditStorageState test.

* Added audit_storage_error to indicate a corruption is caught.

Throw/Send audit_storage_error when there is a data corruption.

Added doAuditStorage() for resuming Audit.

* Load and resume AuditStorage when DD restarts.

* Generate audit id monotonically.

* Fixed minor issue AuditId/Type was not set.

* Adding getLatestAuditStates.

* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.

* Added `audit_storage` fdbcli command.

* fmt.

* Fixed null shared_ptr issue.

* Improve audit data.

* Change DDAuditFailed to SevWarn.

* Sev.

* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.

* Moved AuditUtils* to fdbclient/.

* Added getAuditStatus fdbcli command.

* Refactor audit storage fdb cli commands.

* Added auditStorage in sim.

* Cleanup.

* Resolved comments.

* Resolved comments.

* Added SystemData for metadata audit.

Refactored audit workflow to make sure all sub-tasks are executed w/o
early exit.

* Improvements.

* Persisted Failed state after too many retries.

* Added retryCount for resumeAuditStorage().

* resolving conflict.

* Resolved conflicts.

* allow-merged-to-run

* add timeout to audit client

* fmt

* validate replica

* add audit serverKey

* address comments and fmt

* fix audit_storage_exceeded_request_limit

* fix segfault in getLatestAuditStatesImpl

* fix bugs

* remove timeout from workload

* fix bugs

* audit local view of shard assignment

* fmt

* fix-stuck-issue-and-make-dd-audit-storage-self-retry

* fix timeout

* fix timeout

* fix bugs and cleanup

* fix nit

* change name state to coreState for audit metadata

* address comments

* code clean

* fmt

* setup debug

* cleanup

* clean up

* code cleanup

* code clean

* remove tmp file

* fmt

* trace portion of shards that of anonymous physical shard

* remove unnecessary actor cleanup

* do not give up when tr is too old

* address commits

* refactor

* clean

* fmt

* fix-command-help-text

* fix-auditstate-restore-and-enable-restore-to-metadata-audit

* address comments

* fmrt

* debug and improve efficient of resume audit

* small change

* fix audit cli

* bypass completed audit when dd restart

* fix auditStorageCommandActor

* make mismatch key range more visable

* address comments

* make local shard metadata check can make progress by retries

* address comments

* address comments

* partition location metadata validation by range and server

* unset MIN_TRACE_SEVERITY

* address comments and SS auto proceed until failed then notify dd

* persistNewAuditState should checkMoveKeysLock

* audit storage location metadata partitioned by range and move shard assignment history def to the end of SS structure

* code cleanup

* fix error message in metadata validation

* fix registerAuditsForShardAssignmentHistoryCollection input for local shard validation

* add comments to code and add guard to make sure the SS audit does not proceeds automatically for many times without being notified by DD --- to support audit cancellation later

* fix coalesceRangeList

* replace rangeOverlapping func with operator and use struct instead of complicated type for return value of getKeyServer/serverKey/shardInfo

* simplify shard assignment history

* shardAssignmentRecordRequests should be unorder_map

* address comments, make trackShardAssignment simple, make anyChildAuditFailed cover all audit children, keep only one audit actor run at a time on each SS

* only run validate shard info once at a time, other audit type does not have this limitation

---------

Co-authored-by: He Liu <heliu05023@gmail.com>
Co-authored-by: He Liu <heliu@apple.com>
Co-authored-by: Zhe Wang <zhewang@Zhes-Laptop.local>
2023-05-10 20:59:04 -07:00
He Liu
de592cb983 Added get_audit_status checkmigration to print out the number of da… (#10188)
* Added `get_audit_status checkmigration` to print out the number of data shards and `physical shards`, so that we know the progress of migration to `shard_encode_location_metadata`

* Fixed print format.

* Addressed comments.
2023-05-10 13:19:46 -07:00
He Liu
eb4839d046 Fix audit_storage issues (#9265)
* Erase audit from DD after completion.

* Use CLIENT_KNOBS->TOO_MANY as the default number of rows for
get_audit_status.
2023-01-31 15:39:01 -08:00
stack
2770767255 Edit of audit_storage usage and help message.
Add explanation of command return. Amend return format (slightly).
2023-01-27 15:56:28 -08:00
He Liu
00203c8732 Validate Storage part II (#8471)
* Implemented AuditUtils.actor.cpp

Moved AuditUtils to fdbserver/

* Persist AuditStorageState.

* Passed persisted AuditStorageState test.

* Added audit_storage_error to indicate a corruption is caught.

Throw/Send audit_storage_error when there is a data corruption.

Added doAuditStorage() for resuming Audit.

* Load and resume AuditStorage when DD restarts.

* Generate audit id monotonically.

* Fixed minor issue AuditId/Type was not set.

* Adding getLatestAuditStates.

* Improved persisted errors and added AuditStorageCommand.actor.cpp for
fdbcli.

* Added `audit_storage` fdbcli command.

* fmt.

* Fixed null shared_ptr issue.

* Improve audit data.

* Change DDAuditFailed to SevWarn.

* Sev.

* set SERVE_AUDIT_STORAGE_PARALLELISM to 1.

* Moved AuditUtils* to fdbclient/.

* Added getAuditStatus fdbcli command.

* Refactor audit storage fdb cli commands.

* Added auditStorage in sim.

* Cleanup.

* Resolved comments.

* Resolved comments.

* Test disabling audit for sims.

* Cleanup.

Co-authored-by: He Liu <heliu@apple.com>
2023-01-15 21:46:14 -08:00