28256 Commits

Author SHA1 Message Date
Jingyu Zhou
6fe4d74522 Remove an unused parameter from cleanupRecoveryActorCollection (#12654)
No functional change
2026-01-22 14:29:04 -08:00
Michael Stack
9d0e169496 Add error_code_audit_storage_task_outdated to bypass list rather than do special-case handling (#12652) 2026-01-22 13:49:11 -08:00
Jingyu Zhou
2d2a2144f4 Update copyright years to 2013-2026 (#12653)
No functional changes.
2026-01-22 10:49:41 -08:00
Michael Stack
7802b83882 Fix s3client_test ctest 'ERROR: Missing s3client/ls_test/sub1/file2_2 in ls output' (#12650)
The correct number of files are listed:

2026-01-16T04:59:53+00:00 ERROR: Missing s3client/ls_test/sub1/file2_2 in ls output
2026-01-16T04:59:53+00:00 === DEBUG: Recursive ls output ===
Contents of blobstore://127.0.0.1:8081/s3client/ls_test?bucket=test-bucket&region=us-east-1&secure_connection=0:
 s3client/ls_test/file1_1 26.00 B
 s3client/ls_test/file1_2 26.00 B
 s3client/ls_test/sub1/file2_1 26.00 B
 s3client/ls_test/sub1/file2_2 26.00 B
 s3client/ls_test/sub1/sub2/file3_1 26.00 B
 s3client/ls_test/sub1/sub2/file3_2 26.00 B
2026-01-16T04:59:53+00:00 === END DEBUG ===

... but we overcount because of the HTTP loggging. Lines like this...

[4a60d5628a4137592fe32d5a5b949bb8] HTTP starting GET /test-bucket/s3client/ls_test/sub1/file2_2?tagging=

.... matches the pattern. Just look at stdout. Don't mix in stderr (HTTP
logs).

Co-authored-by: michael stack <stack@duboce.com>
2026-01-22 08:46:36 -08:00
Michael Stack
564e95b681 Integrate BulkDump/BulkLoad with backup/restore system (#12608)
* Integrate BulkDump/BulkLoad with backup/restore system

This commit adds the ability to use BulkDump for creating backup snapshots
and BulkLoad for restoring them, providing faster backup/restore operations
for large databases.

Key changes:
- Add BulkDumpTaskFunc to create SST file snapshots during backup
- Add BulkLoadRestoreTaskFunc to restore from BulkDump snapshots
- Store bulkDumpJobId in snapshot metadata for restore coordination
- Add snapshotMode parameter (0=RANGEFILE, 1=BULKDUMP) to control backup type
- Add useRangeFileRestore parameter to control restore method
- Add CLIENT_KNOBS for configurable job timeouts
- Add test assertions to verify BulkDump/BulkLoad execution
- Check for existing running jobs to avoid conflicts when multiple agents run
- Properly scope state variables for error handling in Flow actors

New test: tests/slow/BackupS3BlobBulkLoadRestore.toml

* Update design/bulkload-restore-integration.md
2026-01-21 21:29:23 -08:00
walter
e2baa88a84 Add rocksdb options index_block_restart_interval and index_type (#12639) 2026-01-21 12:42:38 -08:00
Syed Paymaan Raza
0c67384ac5 Remove unused fdbservice directory (#12623)
The fdbservice directory contained Windows-specific service code that is
no longer maintained and does not look like it is used elsewhere.

This removes the directory and its corresponding CMake configuration.
2026-01-20 11:12:20 -08:00
Jingyu Zhou
8c1a69ba60 Refactor log router monitoring and re-recruitment (#12642) 2026-01-19 18:24:05 -08:00
Jingyu Zhou
3d41289ffb Merge pull request #12644 from jzhou77/mailmap 2026-01-19 18:22:16 -08:00
Jingyu Zhou
7f06b8e334 Fix IDE build compiling errors 2026-01-19 14:29:12 -08:00
Jingyu Zhou
75fae7434f Remove leftover storage cache role after #12486
They are unused declarations now.
2026-01-19 13:52:38 -08:00
Jingyu Zhou
5789c55556 Fix a few file read cancellation bugs (#12643) 2026-01-16 17:57:37 -08:00
Jingyu Zhou
9d7431dbba Remove unused code probes at master role (#12637)
Coverage tool found no match for these events, because the refactoring has moved
monitoring of txn system failures from master role to CC now.
2026-01-14 15:12:27 -08:00
gxglass
b1848af90a HealthMetricsApi workload: only check when we've received metrics (#12636)
rdar://166184432

Sometimes we don't get enough data to compute all of the stats that this workload wants to see as non-zero.
Maintain a flag gotMetrics and do checks on metrics only if this flag is set.

Debugged by the obvious method of adding TraceEvents to see what was happening with this workload.

Also some minor TraceEvent updates and simplify a variable name.

20260113-232530-gglass-05eb3d48db99757b compressed=True data_size=35600940 duration=3977581 ended=100000 fail_fast=1000 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=8:02:14 sanity=False started=100000 stopped=20260114-072744 submitted=20260113-232530 timeout=5400 username=gglass

* HealthMetricsApi workload: only check() aggressively when we have received full metrics

* HealthMetricsApi.actor.cpp: address review comments, and add a overall comment saying that this seems testable outside simulation
2026-01-14 14:55:13 -08:00
dependabot[bot]
20a5722935 Bump authlib (#12628)
Bumps the pip group with 1 update in the /tests/TestRunner directory: [authlib](https://github.com/authlib/authlib).


Updates `authlib` from 1.6.5 to 1.6.6
- [Release notes](https://github.com/authlib/authlib/releases)
- [Changelog](https://github.com/authlib/authlib/blob/main/docs/changelog.rst)
- [Commits](https://github.com/authlib/authlib/compare/v1.6.5...v1.6.6)

---
updated-dependencies:
- dependency-name: authlib
  dependency-version: 1.6.6
  dependency-type: direct:production
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-14 11:18:38 -08:00
Jingyu Zhou
3b574c4e9b Fix code coverage data by removing hits from old binaries (#12635)
* Fix code coverage data by removing hits from old binaries

Otherwise, the data are counting from old binaries that may have different lines
and thus inflating the total number of probes.

* Address a review comment
2026-01-13 16:44:36 -08:00
walter
5614ae64aa Add rocksdb option max_bytes_for_level_multiplier (#12634) 2026-01-13 09:46:24 -08:00
Syed Paymaan Raza
ba1d659587 Delete sharded rocks extraneous code (logWriteSize function and its callsites) (#12633) 2026-01-12 22:39:45 -08:00
Vishesh Yadav
af732673f1 Implement Exclude commands in gRPC (#12603) 2026-01-12 11:38:36 -08:00
Akanksha Mahajan
2c06c99f1d EncryptionBackup: Log encryption key file access failures at SevError severity (#12629)
* Change the error to Sev40

* Fix formatting error
2026-01-09 10:30:17 -08:00
Michael Stack
a5ab18f449 Fix CycleBadRead when BackupS3BlobCorrectness.toml -s 1157546047 -b off (#12627)
Take a lock while backup and restore are running so we can leave cycle running while backup and restore in operation
2026-01-08 12:23:45 -08:00
Michael Stack
dde3488bcf s3client_test ctest failed in nightly. Add retry on non-recursive listing (to match retry on recursive listing) (#12617)
* Retry because mocks3 update is not immediate

* Undo disown watchdog on shutdown. Do a couple of kill strategies instead.
(Fixes s3_backup_test hang on cleanup after PASSED seen in PR build).

* Remove useless comment

* Add -a on grep .. could be binary in fdbserver output
2026-01-08 12:04:59 -08:00
Jingyu Zhou
771dec1278 Update mailmap for some authors (#12622) 2026-01-07 16:54:35 -08:00
Michael Stack
aa35d6cc29 Add restore validation feature: restores to special keyspace allowing validating backup/restore in single cluster (space willing) (#12573)
* Add restore validation feature with simplified backup gap fix

Implements restore validation using audit_storage to verify backup/restore
correctness. Includes a minimal fix for the backup gap bug.

Key components:
- ValidateRestore audit type: compares source keys against restored keys
  at \xff\x02/rlog/ prefix in storage server
- DD audit fixes: propagate validation errors, handle DD failover correctly
- RestoreValidation and BackupAndRestoreValidation workloads for testing
- Simplified backup gap fix: prevent snapshot from finishing in the same
  iteration it dispatches the last tasks (single flag + one check)
2026-01-07 15:23:02 -08:00
VXTLS
31d7eadd52 Make FDB_USE_CSHARP_TOOLS authoritative and consistently honored across the build (#12615)
* Make FDB_USE_CSHARP_TOOLS authoritative across the build

Historically, FDB_USE_CSHARP_TOOLS acted as a preference hint, and parts of the build could still probe for or assume the presence of C# tooling even when it was disabled.

This change makes the option authoritative and consistently honored across the build system. C# tooling is now used only when explicitly enabled and available, and all downstream assumptions are gated accordingly.

The default configuration and tool preference order remain unchanged.

* cmake files changes

* WIP: tmp test

* Honor reviewer feedback on C# toolchain detection and actor comparison

- Stop assuming C# tooling availability on Windows; explicitly probe for
  .NET using find_program.
- Prefer .NET over mono on all platforms, with mono used only as a fallback.
- Fail explicitly when FDB_USE_CSHARP_TOOLS=ON but no C# toolchain is found.
- Preserve Python/C# actor output comparison when C# tooling is available,
  skipping it only when C# is explicitly disabled or unavailable.
- Simplify Python argument parsing and remove unnecessary textwrap usage.
2026-01-06 20:55:33 -08:00
Syed Paymaan Raza
2f91f6338c Remove unused DummyWorkload (#12624)
DummyWorkload is not referenced by any test files or other code.
2026-01-06 20:20:59 -08:00
Jingyu Zhou
dfbde65a14 Remove blob failure injections (#12620)
* Remove blob failure injections

Follow-up for the cleanup done at #12435. These functions are unused now.

* Fix an assertion failure in simulation

sim2 has "ASSERT(seconds >= -0.0001);" in delay() function, which was
triggering from the tlog code.

Reproduction:

-f ./tests/fast/SidebandSingle.toml -s 3567205446 -b on
2026-01-06 16:09:11 -08:00
Copilot
c797e35cd0 Add .mailmap for contributor identity mapping (#12602)
* Initial plan

* Add .mailmap with Mohamad Gebai entry

Co-authored-by: saintstack <48398+saintstack@users.noreply.github.com>

* Clarify .mailmap format with improved comments

Co-authored-by: saintstack <48398+saintstack@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: saintstack <48398+saintstack@users.noreply.github.com>
2026-01-06 13:06:30 -08:00
Akanksha Mahajan
e8a1cfb8b8 Add FileLevelEncryption field to BackupDescription JSON output (#12619) 2026-01-06 12:48:02 -08:00
Alexey Pavlenko
7bb558ddb0 Increase mode_bytes for the StreamingMode.WANT_ALL (#12616) 2026-01-05 20:55:46 -08:00
Michael Stack
a2bb2e39c1 Run BulkDumpingS3 tests with 3 replicas instead of 1 (#12609)
Co-authored-by: stack <stack@duboce.com>
2026-01-05 14:56:11 -08:00
Xiaoge Su
e8732a8ccd Do not force using pypi.python.org when installing sphinx
In certain situation, e.g., internal build of FoundationDB that does not
allows access to external websites, forcing a
https://pypi.python.org/simple will prevent from using other pypis.

This patch will remove this hard requirement.
2026-01-05 13:56:07 -08:00
daleiz
401032d042 fix: add virtual dtor for RequestBase (#12614) 2026-01-05 13:25:36 -08:00
daleiz
13f784b554 Fix misaligned access UB in Endpoint class (#12613) 2026-01-02 20:38:16 -08:00
Xiaoge Su
f22ed30b97 Use Findbenchmark instead of hardcoding the google-benchmark path in … (#12612)
* Use Findbenchmark instead of hardcoding the google-benchmark path in flowbench

* fixup! Add missing include directories

* fixup! Remove extra PRIVATE token

* target_link_libraries -> target_include_directories

* Support downloading google-benchmark if not found locally

This may need to be reconsidered, as it seems more reasonable that the
developer prepares the library rather than letting CMake takes the
responsibility of monitoring the availability of an external project.

* Remove extra creation of flowbench target
2026-01-02 13:55:51 -08:00
Michael Stack
6e8be999cc XDB-432-7.4 flexible keys parsing for backup cli (#12605)
Author: Mark Shabanov <mshabanov@openintegration.inc>
Signed-off-by: dlambrig
Signed-off-by: sbodagala
2025-12-22 14:55:05 -08:00
neethuhaneesha
0d21f35e70 Tx mutations, LogProtocolMessage, SpanContextMessage need to be skipped before coverting them to mutations (#12575) 2025-12-21 12:39:47 -08:00
neethuhaneesha
afca911361 Backup worker using proxy from command line to upload to S3. (#12565) 2025-12-19 10:37:28 -08:00
Jingyu Zhou
549d324f13 Remove verbose actorcompiler output (#12600)
To keep the compiling output clean after PR #12559.
2025-12-18 21:04:30 -08:00
Michael Stack
4772e6f1e6 Exclude single_process_fdbcli_tests (as we do single_process_external_client_fdbcli_tests); it fails with ASAN enabled (#12601)
Signed-off-by: jzhou77
2025-12-18 11:43:39 -08:00
Jingyu Zhou
271851906e Re-recruit log routers after failures to avoid recoveries (#12558)
* Re-recruit log routers after failures

Log routers are stateless roles that can reconstruct its state after crash.
This is an attemp to avoid triggering recovery if one of log routers crashed.
To simplify the work, only the current generations of log routers are monitored
and re-recruited after crashes. Previous generations of log routers are not
handled in this change, as they are short lived and purged after recovery
reaches the fully_recovered state.

* Monitor log routers after full recovery

I.e., monitorAndRecruitLogRouters() waits for full recovery.

* Some cleanup

* Add WorkerCache for log routers

To avoid duplicated log routers running, though only one will be used (but it's
confusing when debugging).

20251114-230514-jzhou-59d6afe1e475c495

* Fix monitoring to happen after full recovery

20251115-041144-jzhou-4278fe608ed051b7

* Keep monitoring log routers before recovery completion

20251115-050228-jzhou-7c37cfb1d6e36ced

* monitorAndRecruitLogRouters detects recoveries

20251115-205132-jzhou-4d50f7c5914e883a

* Make monitorAndRecruitLogRouters long running

20251115-210426-jzhou-e63a8fcf26a76c81

* Recruit failed log routers in parallel

20251115-222239-jzhou-9eb2287e12c93f7a

* Rix replaced log router's begin version

Use the TLog's reply.popped version as its start version.

20251116-025329-jzhou-51a4def306038241

* clang-format fix

* Address review comments

20251120-221940-jzhou-28d51f8a0400377e             compressed=True data_size=37463028 duration=8246235 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:51:39 sanity=False started=100000 stopped=20251120-231119 submitted=20251120-221940 timeout=5400 username=jzhou

* Add exponential backoff for log router re-recruitment

20251121-041700-jzhou-7126a109c1e39c76

* Fix crashes

20251121-043947-jzhou-6f373c1c64faa1b2

* Disable a verbose event

* Add CC_RERECRUIT_LOG_ROUTER_ENABLED to control this feature

20251215-220843-jzhou-70dad477b39640e9
2025-12-17 15:42:49 -08:00
Trevor Clinkenbeard
170013d69b Support tracking read latency metrics per read type (#12586) 2025-12-17 15:02:28 -08:00
Dan Lambright
2d52509099 Reserve vector capacity for tempTagMessages in TLog commit path (#12571)
* Reserve Vector Capacity for tempTagMessages

* Add knob ENABLE_TLOG_TEMP_TAG_MESSAGES_RESERVE

---------

Co-authored-by: Dan Lambright <hlambright@apple.com>
2025-12-17 13:50:30 -08:00
Hendrik Hofstadt
1c00763a55 Port actorcompiler to python (#12559)
* Port actorcompiler to python

* Address review and restore C# compiler
2025-12-17 13:33:07 -08:00
Jingyu Zhou
1c2a1dd653 Force simulator to have a cap on satellite logs (#12597)
* Force simulator to have a cap on satellite logs

If not, ChangeConfig workload or simulation may choose a number high than
available machines. As a result, the recruitment will fail, blocking recovery
from finishing, thus making the database unavailable.

To reproduce:

Seed: -f ./tests/fast/LocalRatekeeper.toml -s 1185956409 -b on
Branch: main
Commit ID: 280b10fa49

500k 20251216-233658-jzhou-66053213858cc41d

* Address review comments.
2025-12-17 12:22:07 -08:00
Michael Stack
280b10fa49 Add the watchdog cleanup added to other scripts to bulkload test too (#12595)
* Add the watchdog cleanup added to other scripts to bulkload test too

* Add delay before listing files

* Wait until the listing is complete rather than hard-coded time (Reviewer suggestion)

Signed-off-by: gxglass
2025-12-15 15:46:21 -08:00
Martynas Jurkus
224e0daa8f Go binding: add GetMainThreadBusyness method to Database (#12594) 2025-12-15 08:36:30 -08:00
gxglass
df96a141f5 Add back C library implementations for fdb_database_open_tenant and 3 other methods (#12593)
This allows 7.x python libraries to load these methods. It does this on startup to set up python/C API stuff, regardless
of whether the API user actually invokes this functionality (which was experimental and is now removed).

More details: rdar://166307379

Testing:
ctest -R c_api
ctest -R python

20251211-224723-gglass-17ed16f020aa18de compressed=True data_size=35311862 duration=5560013 ended=100000 fail_fast=1000 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:01:01 sanity=False started=100000 stopped=20251211-234824 submitted=20251211-224723 timeout=5400 username=gglass
2025-12-12 09:34:27 -08:00
daleiz
a34f5a46a7 Improve compiler flag for ARM64 (#12589)
Replace -march=armv8.2-a+crc+simd with -march=armv8.2-a+lse+crc since
SIMD (NEON) is already mandatory in ARMv8, and LSE (Large System
Extensions) is more important, which is supported on Graviton2 and later.
2025-12-10 12:36:41 -08:00
Michael Stack
f8f90c96f0 Add logging around cleanup and add a watchdog to kill regardless (#12587)
* Add logging around cleanup and add a watchdog to kill regardless after 30 seconds

* Address review comments

Signed-off-by: gxglass
2025-12-09 20:09:47 -08:00