Commit Graph

7476 Commits

Author SHA1 Message Date
Zhe Wang
ca4ab1eca9 Fix traceTooManyEvents and externalTimeouts in BulkLoad test (#11769) 2024-11-11 11:05:43 -08:00
Syed Paymaan Raza
7a5f61cc65 Address feedback in PR #11753 (#11757) 2024-11-05 20:30:40 -08:00
Zhe Wang
ab9ce0df15 Cherrypick recent DD changes from release-7.3 (#11754)
* [Release-7.3] TeamRedundant and TeamUnhealthy data moves choose best destination with probability (#11668)

* team redundant and unhealthy data moves can choose best dest with probability

* nits

* nits

* enable wantTrueBestIfMoveout

* fix getteam stuck

* [Release-7.3] Delay team remover when space pivot is low (#11665)

* [Release-7.3] Validate ServerTeam count per server in simulation (#11678)

* validate server team count in simulation

* change naming (not relevant to the PR title)

* address comments and add a new trace event BuildTeamsLastBuildTeamsFailed triggered when buildTeam failed
2024-11-05 18:52:40 -08:00
Syed Paymaan Raza
84fb8f843c Gray failure allows storage servers to complain (#11753) 2024-11-05 16:53:02 -08:00
Syed Paymaan Raza
7d529ec724 Invalidate gray failure complaints from excluded processes (#11749) 2024-11-04 13:49:25 -08:00
Zhe Wang
42e17d8bd1 BulkLoading Use RangeLock (#11741)
* use range lock in bulk load

* refactor BulkLoading workload and nits

* add background traffic

* nits

* address comments
2024-10-31 12:58:13 -07:00
neethuhaneesha
64030de741 Max range deletions knob update to prevent OOMs. (#11738) 2024-10-28 11:45:44 -07:00
Yao Xiao
afbcf5ef5f Enable backward read in consistency checker. (#11721)
* Do backward reads in consistency checker.

* Add knob for read options in consistency checker.
2024-10-26 09:57:36 -07:00
Syed Paymaan Raza
36b113993c [gray_failure] Update CC_ONLY_CONSIDER_INTRA_DC_LATENCY knob documentation (#11728) 2024-10-24 07:48:37 -07:00
Zhe Wang
43446204ed Database Per-Range Lock (#11693)
* range lock framework

* improve the framework

* persist to txnStateStore

* fix bugs

* code clean

* code clean

* bug fix

* address comments

* add complex test workload and fix bugs found by the workload

* add workload correctness check and fix bugs

* code clean up

* add random range lock injection

* fix bugs in RandomRangeLock.actor.cpp

* enable random range lock injection in general workloads

* add rangelockcycle test

* disable random range lock in backup workloads

* nits

* add range lock ownership concept

* enable lock ownership to rangeLock

* api deal with tenant

* fix CI

* add test for multiple rangeLock owners

* nits

* address comments and renaming

* address comments
2024-10-23 16:25:56 -07:00
Syed Paymaan Raza
5f480947ad [fdbserver] Gray failure and simulator improvements related to remote processes (#11717)
* [fdbserver][simulator] Add remoteDesiredTLogCount option

* [fdbserver][simulator] Allow explicitly specifying number of stateless classes in each DC

* [fdbserver][gray_failure] RemoteTLog lagging SS simulation test

* [fdbserver][gray_failure] Consider remote processes + CC inter/intra latency awareness

* [fdbserver][cc] Make processInSameDC O(1)
2024-10-23 13:15:29 -07:00
Dan Lambright
a87e940e05 fix bug TxnStateStoreCycleTest for version vector (#11723)
* fix bug TxnStateStoreCycleTest for version vector

* Respond to review comment

---------

Co-authored-by: Dan Lambright <hlambright@apple.com>
2024-10-23 15:53:15 -04:00
Yao Xiao
7290369aac Use a single iterator pool for all physical shards. (#11699)
* Rewrite iterator pool.

* simulation fix
2024-10-15 17:28:54 -07:00
Jingyu Zhou
7c6c8ae095 Merge pull request #11709 from yao-xiao-github/knob-main
Update sharded rocksdb knobs.
2024-10-10 15:11:43 -07:00
Yao Xiao
6a87e6042f update knobs 2024-10-10 13:51:50 -07:00
Zhe Wang
fcb0030883 add probability that Memory gets selected 2024-10-10 11:24:40 -07:00
Zhe Wang
7d95b87483 improve the probability that sharded rocksdb is selected in simulation tests 2024-10-10 09:48:23 -07:00
Syed Paymaan Raza
0946f49579 [gray-failure] Remove CC_PAUSE_HEALTH_MONITOR (#11675) 2024-10-07 11:59:19 -07:00
Jingyu Zhou
f86058fba6 Remove the usage of txsTag (#11688)
* Add assertions to code paths with txsTag

txsTag should be obsolete by now, since it's used in 6.1, which is no longer
supported for upgrade.

* Actually remove txsTag usage

20240926-225930-jzhou-7ed3304c415ae65e

* Remove more code

20240926-235242-jzhou-7ed3304c415ae65e

* Disable two verbose trace events

They can cause TraceTooManyLines errors.
2024-09-30 07:53:37 -04:00
neethuhaneesha
5637f23231 Increasing minimum age to wiggle to avoid re-wiggling migrated rocksdb storage servers (#11683) 2024-09-26 12:59:37 -07:00
Yao Xiao
83dd1f202e Fix block cache size error and improve logging. (#11681) 2024-09-24 13:08:31 -07:00
Jingyu Zhou
b872a8ea57 Merge pull request #11637 from neethuhaneesha/direct_io_enabling
Enabling rocksdb direct_io and wiggle knobs
2024-09-19 10:12:14 -07:00
Jingyu Zhou
712f88a1ff More protocol version related code removal
Removed code handle old protocol versions, i.e., before 7.1
2024-09-18 13:28:06 -07:00
Jingyu Zhou
fc30fc269e Remove dead code after removing tagLocalityUpgraded usage
20240918-170752-jzhou-33111b2c3e6776aa
2024-09-18 11:23:09 -07:00
Jingyu Zhou
7b76561bb9 Remove tagLocalityUpgraded usage at various places
Since we have removed old tlog implementation, so the code path using this tag
can be deleted to simplify the code.
2024-09-18 11:23:09 -07:00
Jingyu Zhou
80ca71833b Make xxhash checksum the default for TLog
Update downgrade tests to use the xxhash.
2024-09-17 12:46:42 -07:00
Keijo Kapp
b9926aefe5 Fix the key range affected by setting version stamped key (#11424)
When doing version stamped key operation, the affected key range should
start from the next read version, not the current one.
2024-09-14 14:52:01 -07:00
Syed Paymaan Raza
e1c7cdd3e3 [CC+Worker] Enable WORKER_HEALTH_MONITOR related knobs in simulation tests (#11657) 2024-09-12 23:27:47 -07:00
Jingyu Zhou
2313fdaa0e Add rocksdb, sharded rocksdb to configure workload (#11654)
* Add rocksdb, sharded rocksdb to configure workload

Also remove mentioning of ssd-redwood-1-experimental.

* Fix test failure when SHARD_ENCODE_LOCATION_METADATA is off
2024-09-12 21:03:06 -07:00
Yao Xiao
289d02899f Add knobs for caching index blocks. (#11650) 2024-09-11 22:58:26 -07:00
neethuhaneesha
8ff623e523 Enabling rocksdb direct_io and wiggle knobs 2024-09-11 09:28:44 -07:00
Sepeth
3854dbfe4d Upgrade fmt from 8.1.1 to 11.0.2 (#11601)
And, added via cmake FetchContent, and removed contrib/fmt-8.1.1
2024-09-10 14:42:43 -07:00
Jingyu Zhou
d730db521a Fix a Valgrind error (#11645)
buggifyShortReadWindow used unitialized variable ENABLE_VERSION_VECTOR.
2024-09-10 15:29:36 -04:00
Dan Lambright
5eafd46351 Disable version vector on batches with backed up mutations (#11634)
Co-authored-by: Dan Lambright <hlambright@apple.com>
2024-09-09 21:18:59 -04:00
Zhe Wang
5ee0db13e6 Fix external timeout with ShardedRocksDB and re-enable ShardedRocksDB in simulation tests (#11638)
* speedup sharded rocksdb in simulation

* re-enable shardedrocksdb and disable physical shard move
2024-09-08 10:57:55 -07:00
hao fu
f092e19026 address comments 2024-09-05 16:07:01 -07:00
hao fu
5295920ded Check bucket exist or not, rather than listBucket 2024-09-05 16:06:18 -07:00
hao fu
640b0fe7f3 Finish testing, set default to false 2024-09-05 15:17:34 -07:00
hao fu
04e02c2908 Retry with dryrun in the presence of s3 token error
s3 token is from local disk and might be expired or invalid,
before this change backup retries to upload data to s3 indefinitely,
thus it is a waste of network bandwidth.

Now retry with a get request of list all buckets in the case of
s3 token error, and only retry the upload when token error disappears.
2024-09-05 15:17:27 -07:00
Syed Paymaan Raza
48064f6cf1 Make some codeprobes rare (#11607)
* Make BlobGranule code probes rare

* Make encryption related code probes rare

* fixup! Fix formatting
2024-08-26 22:33:38 -07:00
Jingyu Zhou
cf188a99d4 Convert most actors in Watches workload into coroutines
There is one left that doesn't seem to have a good way for conversion. To make
sure the converted code is behaving correctly, I added a few CodeProbes to
ensure code coverage.
2024-08-23 12:11:57 -07:00
Zhe Wang
3305d2e3ee fix storage engine selection (#11586) 2024-08-20 09:24:30 -07:00
Jingyu Zhou
5d5f1a2dc7 Merge pull request #11575 from brownleej/backup-transaction-options
Capture default database options in fdbbackup in a local variable.
2024-08-16 14:16:38 -07:00
Zhe Wang
6c502e9707 Solve RocksDB external timeout error and re-enable RocksDB simulation tests (#11577)
* init knob tune

* include rocksdb in tests

* probably reuse rocksdb iterator in simulation

* clear unnecessary knob change
2024-08-16 12:37:18 -07:00
John Brownlee
cd2962f10c Rename fields in new trace events to match formatting standards. 2024-08-16 11:44:25 -07:00
John Brownlee
860963ba05 Reformat changes. 2024-08-15 12:31:17 -07:00
John Brownlee
cd4eb794b5 Add debug logging to help validate the transaction options set in fdbbackup. 2024-08-15 12:01:56 -07:00
Jingyu Zhou
bd2e108531 Merge pull request #11555 from jzhou77/fix
Reduce chance of running rare tests
2024-08-06 10:13:03 -07:00
Syed Paymaan Raza
392bad2bd3 More copyright end year updates (#11556) 2024-08-05 14:00:32 -07:00
Jingyu Zhou
5d2deddb7d Reduce the chance to run some rare tests
E.g., StatusBuilderPerf and TLogVersionMessagesOverheadFactor are more like
performance tests, which shouldn't be running so many times.

Without the change, a 100k-run has this many for these tests:

   1318 tests/rare/CycleWithKills.toml
   1591 tests/rare/TLogVersionMessagesOverheadFactor.toml
   1647 tests/rare/ConfigDBUnitTest.toml
   1839 tests/rare/StatusBuilderPerf.toml

After the change, a 100k-run has:

    129 tests/rare/TLogVersionMessagesOverheadFactor.toml
    151 tests/rare/CycleWithKills.toml
    160 tests/rare/StatusBuilderPerf.toml
    375 tests/rare/ConfigDBUnitTest.toml
2024-08-02 17:24:30 -07:00