The correct number of files are listed:
2026-01-16T04:59:53+00:00 ERROR: Missing s3client/ls_test/sub1/file2_2 in ls output
2026-01-16T04:59:53+00:00 === DEBUG: Recursive ls output ===
Contents of blobstore://127.0.0.1:8081/s3client/ls_test?bucket=test-bucket®ion=us-east-1&secure_connection=0:
s3client/ls_test/file1_1 26.00 B
s3client/ls_test/file1_2 26.00 B
s3client/ls_test/sub1/file2_1 26.00 B
s3client/ls_test/sub1/file2_2 26.00 B
s3client/ls_test/sub1/sub2/file3_1 26.00 B
s3client/ls_test/sub1/sub2/file3_2 26.00 B
2026-01-16T04:59:53+00:00 === END DEBUG ===
... but we overcount because of the HTTP loggging. Lines like this...
[4a60d5628a4137592fe32d5a5b949bb8] HTTP starting GET /test-bucket/s3client/ls_test/sub1/file2_2?tagging=
.... matches the pattern. Just look at stdout. Don't mix in stderr (HTTP
logs).
Co-authored-by: michael stack <stack@duboce.com>
* Integrate BulkDump/BulkLoad with backup/restore system
This commit adds the ability to use BulkDump for creating backup snapshots
and BulkLoad for restoring them, providing faster backup/restore operations
for large databases.
Key changes:
- Add BulkDumpTaskFunc to create SST file snapshots during backup
- Add BulkLoadRestoreTaskFunc to restore from BulkDump snapshots
- Store bulkDumpJobId in snapshot metadata for restore coordination
- Add snapshotMode parameter (0=RANGEFILE, 1=BULKDUMP) to control backup type
- Add useRangeFileRestore parameter to control restore method
- Add CLIENT_KNOBS for configurable job timeouts
- Add test assertions to verify BulkDump/BulkLoad execution
- Check for existing running jobs to avoid conflicts when multiple agents run
- Properly scope state variables for error handling in Flow actors
New test: tests/slow/BackupS3BlobBulkLoadRestore.toml
* Update design/bulkload-restore-integration.md
The fdbservice directory contained Windows-specific service code that is
no longer maintained and does not look like it is used elsewhere.
This removes the directory and its corresponding CMake configuration.
rdar://166184432
Sometimes we don't get enough data to compute all of the stats that this workload wants to see as non-zero.
Maintain a flag gotMetrics and do checks on metrics only if this flag is set.
Debugged by the obvious method of adding TraceEvents to see what was happening with this workload.
Also some minor TraceEvent updates and simplify a variable name.
20260113-232530-gglass-05eb3d48db99757b compressed=True data_size=35600940 duration=3977581 ended=100000 fail_fast=1000 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=8:02:14 sanity=False started=100000 stopped=20260114-072744 submitted=20260113-232530 timeout=5400 username=gglass
* HealthMetricsApi workload: only check() aggressively when we have received full metrics
* HealthMetricsApi.actor.cpp: address review comments, and add a overall comment saying that this seems testable outside simulation
* Fix code coverage data by removing hits from old binaries
Otherwise, the data are counting from old binaries that may have different lines
and thus inflating the total number of probes.
* Address a review comment
* Retry because mocks3 update is not immediate
* Undo disown watchdog on shutdown. Do a couple of kill strategies instead.
(Fixes s3_backup_test hang on cleanup after PASSED seen in PR build).
* Remove useless comment
* Add -a on grep .. could be binary in fdbserver output
* Add restore validation feature with simplified backup gap fix
Implements restore validation using audit_storage to verify backup/restore
correctness. Includes a minimal fix for the backup gap bug.
Key components:
- ValidateRestore audit type: compares source keys against restored keys
at \xff\x02/rlog/ prefix in storage server
- DD audit fixes: propagate validation errors, handle DD failover correctly
- RestoreValidation and BackupAndRestoreValidation workloads for testing
- Simplified backup gap fix: prevent snapshot from finishing in the same
iteration it dispatches the last tasks (single flag + one check)
* Make FDB_USE_CSHARP_TOOLS authoritative across the build
Historically, FDB_USE_CSHARP_TOOLS acted as a preference hint, and parts of the build could still probe for or assume the presence of C# tooling even when it was disabled.
This change makes the option authoritative and consistently honored across the build system. C# tooling is now used only when explicitly enabled and available, and all downstream assumptions are gated accordingly.
The default configuration and tool preference order remain unchanged.
* cmake files changes
* WIP: tmp test
* Honor reviewer feedback on C# toolchain detection and actor comparison
- Stop assuming C# tooling availability on Windows; explicitly probe for
.NET using find_program.
- Prefer .NET over mono on all platforms, with mono used only as a fallback.
- Fail explicitly when FDB_USE_CSHARP_TOOLS=ON but no C# toolchain is found.
- Preserve Python/C# actor output comparison when C# tooling is available,
skipping it only when C# is explicitly disabled or unavailable.
- Simplify Python argument parsing and remove unnecessary textwrap usage.
* Remove blob failure injections
Follow-up for the cleanup done at #12435. These functions are unused now.
* Fix an assertion failure in simulation
sim2 has "ASSERT(seconds >= -0.0001);" in delay() function, which was
triggering from the tlog code.
Reproduction:
-f ./tests/fast/SidebandSingle.toml -s 3567205446 -b on
In certain situation, e.g., internal build of FoundationDB that does not
allows access to external websites, forcing a
https://pypi.python.org/simple will prevent from using other pypis.
This patch will remove this hard requirement.
* Use Findbenchmark instead of hardcoding the google-benchmark path in flowbench
* fixup! Add missing include directories
* fixup! Remove extra PRIVATE token
* target_link_libraries -> target_include_directories
* Support downloading google-benchmark if not found locally
This may need to be reconsidered, as it seems more reasonable that the
developer prepares the library rather than letting CMake takes the
responsibility of monitoring the availability of an external project.
* Remove extra creation of flowbench target
* Re-recruit log routers after failures
Log routers are stateless roles that can reconstruct its state after crash.
This is an attemp to avoid triggering recovery if one of log routers crashed.
To simplify the work, only the current generations of log routers are monitored
and re-recruited after crashes. Previous generations of log routers are not
handled in this change, as they are short lived and purged after recovery
reaches the fully_recovered state.
* Monitor log routers after full recovery
I.e., monitorAndRecruitLogRouters() waits for full recovery.
* Some cleanup
* Add WorkerCache for log routers
To avoid duplicated log routers running, though only one will be used (but it's
confusing when debugging).
20251114-230514-jzhou-59d6afe1e475c495
* Fix monitoring to happen after full recovery
20251115-041144-jzhou-4278fe608ed051b7
* Keep monitoring log routers before recovery completion
20251115-050228-jzhou-7c37cfb1d6e36ced
* monitorAndRecruitLogRouters detects recoveries
20251115-205132-jzhou-4d50f7c5914e883a
* Make monitorAndRecruitLogRouters long running
20251115-210426-jzhou-e63a8fcf26a76c81
* Recruit failed log routers in parallel
20251115-222239-jzhou-9eb2287e12c93f7a
* Rix replaced log router's begin version
Use the TLog's reply.popped version as its start version.
20251116-025329-jzhou-51a4def306038241
* clang-format fix
* Address review comments
20251120-221940-jzhou-28d51f8a0400377e compressed=True data_size=37463028 duration=8246235 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:51:39 sanity=False started=100000 stopped=20251120-231119 submitted=20251120-221940 timeout=5400 username=jzhou
* Add exponential backoff for log router re-recruitment
20251121-041700-jzhou-7126a109c1e39c76
* Fix crashes
20251121-043947-jzhou-6f373c1c64faa1b2
* Disable a verbose event
* Add CC_RERECRUIT_LOG_ROUTER_ENABLED to control this feature
20251215-220843-jzhou-70dad477b39640e9
* Force simulator to have a cap on satellite logs
If not, ChangeConfig workload or simulation may choose a number high than
available machines. As a result, the recruitment will fail, blocking recovery
from finishing, thus making the database unavailable.
To reproduce:
Seed: -f ./tests/fast/LocalRatekeeper.toml -s 1185956409 -b on
Branch: main
Commit ID: 280b10fa49
500k 20251216-233658-jzhou-66053213858cc41d
* Address review comments.
* Add the watchdog cleanup added to other scripts to bulkload test too
* Add delay before listing files
* Wait until the listing is complete rather than hard-coded time (Reviewer suggestion)
Signed-off-by: gxglass
This allows 7.x python libraries to load these methods. It does this on startup to set up python/C API stuff, regardless
of whether the API user actually invokes this functionality (which was experimental and is now removed).
More details: rdar://166307379
Testing:
ctest -R c_api
ctest -R python
20251211-224723-gglass-17ed16f020aa18de compressed=True data_size=35311862 duration=5560013 ended=100000 fail_fast=1000 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:01:01 sanity=False started=100000 stopped=20251211-234824 submitted=20251211-224723 timeout=5400 username=gglass
Replace -march=armv8.2-a+crc+simd with -march=armv8.2-a+lse+crc since
SIMD (NEON) is already mandatory in ARMv8, and LSE (Large System
Extensions) is more important, which is supported on Graviton2 and later.