mirror of
https://github.com/apple/foundationdb.git
synced 2026-01-25 04:18:18 +00:00
These features have been previously marked for deletion per PR #12400. This change necessarily affects a lot of files. In general I found it preferable to cut along the FDB <-> tenant boundary, rather than try to cut tenant into multiple pieces, stitch the Frankenstein tenant implementation back together with FDB, and generally remove the limbs one by one. So it is a single big deletion. Note that some tenant-related metadata has been written in a non-flag-controlled manner by prior releases and probably must be ignored indefinitely. Fortunately this is isolated to include/fdbclient/ClientLogEvents.h. (Details: deleting an Optional from a serialized struct results in deserialization of garbage in upgrade tests. The serialized nullopt to indicate "no Tenant" is formally part of FDB persistent metadata even in FDB clusters that never would have enabled the tenant feature.) During the course of testing these changes, many interesting bugs were encountered. I won't discuss details of them here. Causes range from flat out damage (by me) to production code in the course of removing tenant related bits (mainly in NativeAPI.actor.cpp and CommitProxy.actor.cpp), damage to various workload files (particularly FuzzApiCorrectness.actor.cpp, which is very sensitive to changes), and many toml files needing updated test flags/options. More testing details: https://quip-apple.com/Zr6VAycxoli9 20251209-012852-gglass-8ff850b772d868f2 compressed=True data_size=35311687 duration=21671404 ended=500000 fail_fast=1000 max_runs=500000 pass=500000 priority=100 remaining=0 runtime=2:31:30 sanity=False started=500000 stopped=20251209-040022 submitted=20251209-012852 timeout=5400 username=gglass * remove some unneeded tests, and remove mentions of deleted tests from tests/CmakeLists.txt * Initiate removal of metacluster. NOTE: this seems to also want removal of tenant. Consider removing them together. * work on removing metacluster * delete files with `Tenant` in the name, having reviewed them to ensure that they basically contain what the name implies * fdb_c.h: remove prototypes for C API methods which have been deleted (blob granule) or which are so long deprecated that they are outside any reasonable/documented support window * Surgical removal of tenant references from files in bindings/ top level directory. Compilation not yet attempted. * Surgical removal of tenant related stuff from fdbcli/ top level directory. Compilation not yet attempted. * Misc tenant code removal, and other stuff which I think may not be needed. Compilation still not attempted. * Remove more tenant or tenant-adjacent or blob-granule-adjacent stuff. Or at least stuff that looks adjacent to that stuff. Not compiled or tested. * Start removing Tenant stuff from fdbclient/. Far from complete. Compilation not attempted. * Remove tenant references from many source files. There are still about 7 principal fdbclient/ and fdbserver/ files with a lot of tenant logic left to delete. Also, all of fdbserver/workloads needs to be looked at. Still have not attempted compilation. * Remove tenant entanglement from watch functionality * Remove tenant stuff from fdbserver/tester.actor.cpp * Delete metacluster workloads * Remove tenant related stuff from workloads. Also taken the liberty of removing some functionality that appears unused or untestable by Apple. * Checkpoint tenant removal from FuzzApiCorrectness.actor.cpp * NativeAPI.actor.cpp: `Tenant` has left the building. * SimulatedCluster.actor.cpp: `Tenant` has left the building * DDShardTracker.actor.cpp: Tenant evicted * storageserver.actor.cpp: `tenant` has left the building. * fdbserver/workloads/FuzzApiCorrectness.actor.cpp: remove tenant references, but some lingering cleanup needed in `loadAndRun` * FileBackupAgent.actor.cpp: tenant has left the building * CommitProxyServer.actor.cpp: remove tenant * Remove more tenant references from misc files such as bindings tests, documentation, and some fdbserver headers I left earlier * Fix missing-file errors in CMakeLists.txt files. This is the first attempt to compile this stuff. * checkpoint misc changes to fix compile errors * checkpoint more compile fixes * StorageServerInterface.h: put back more verify() calls * More misc compile fixes * whole bunch of misc fixups including some code put-backs to address compile errors * More compile fixes * More compile fixes. Still does not compile. * incremental compile fixing * ... * ... * Checkpoint a bunch of compile fixes. Not quite there but getting closer * More compile fixes. There seem to be about 10 files left, mainly CommitProxyServer.actor.cpp and storageserver.actor.cpp * IT COMPILES NOW. THIS IS STILL ALL UNTESTED. Unsurprisingly, CommitProxyServer.actor.cpp and storageserver.actor.cpp took the most tweaking. The updates in CMakeLists.txt and workloads/UnitTests.actor.cpp are basically trivial and mainly reflect the ordering of dependencies -- that stuff didn't get attempted until all of fdbserver compiled. * Put back one block relating to encryption at rest mode. Simplify some TODO(gglass) instances. * Put back some encryption related knobs * remove `enable_tenants` from local_cluster.py to maybe fix some ctests * Remove tenant related options from toml files. * feature-status.md: add a line for encryption at rest, which seems to have been added for multi-tenant; status is now in doubt * Fix a pretty bad bug introduced in tenant deletion; ensure we dont attempt to construct a std::string of negative length * workloads/FuzzApiCorrectness.actor.cpp: avoid division by zero * flow/Platform.actor.cpp: add a try/catch wrapper around side threads; emit a better addr2line type command * NativeAPI.actor.cpp: fix a bug introduced in tenant removal relating to reporting conflicting keys under conflictingKeysRange * ReportConflictingKeys.actor.cpp: separate an ANDed assert into two asserts * SpecialKeySPaceCorrectness.actor.cpp: put back some logic removed with tenant removal. This test was failing due to a bug with conflict key range reporting. Fixed separately in NativeAPI.actor.cpp. * remove QuotaCommand.actor.cpp * Force disable tenant and encryption on disk in upgrade tests * Add back file I guess I deleted? who knows * put back another file * design/feature-status.md: update the new row for encryption at rest to firm up the claim that it is experimental, unowned, and scheduled for deletion * Remove EncryptKeyProxyTest since we do not use it * new file tests/slow/BulkDumpingS3WithChaos.toml: remove tenantModes setting * Undo damage to pushToBackupMutations() from removing tenant feature. This caused inverted_range errors and failed commits in backup related simulations. * tests/restarting/from_7.4.0/Snap*-1: ensure that tenantModes = disabled * Try again on workloads/FuzzApiCorrectness.actor.cpp * simplify tenant-free (mostly) FuzzApiCorrectness workload code * try harder to remove lingering tenant-related brokenness from FuzzApiCorrectness.actor.cpp * Explicitly specify tenantModes = ['disabled'] in all the -1 restart files * Remove tenantModes from 7.1-based upgrade tests as its an unknown option. Hopefully the code doesnt actually turn on tenant stuff * do not specify tenantModes in downgrade tests * Downgrade test to_7.4.5: dont say tenantModes * more tenantModes updates * Remove a legacy allowDefaultTenant that no longer is meaningful in downgrade to 8.0 * Put back empty Optional<TenantName> turdlets into serialized log events to avoid breaking ClientTransactionProfilingCorrectness upgrade tests (even with tenantMode = disabled) * disable encryption on a few more upgrade related test cases. That feature is slated for removal anyway * Remove unneeded workload files that have been subject to #if 0 for a while. Remove commented out block in ClusterRecovery * disable encryption in more upgrade tests * Remove choice four-letter words from commentary * Format 42 files * Try to fix a doc bug failing the CI build * More doc compilation error fixes * Delete more tenant junk from documentation * fix spelling mistake in comment * Remove deleted cross-references from documentation. This necessitated editing release 3.0.0 release notes, which is insane. * Remove more tenant stuff from bindings tests * Remove more tenant bits from design/ files * Remove more tenant related stuff * Delete more tenant references. Put back ten-ant spellings as tenant now that grep output is substantially reduced. * Put back some tenant stuff into apitester; its deletion seems to have introduced bugs. Also whine about comments some more, because, really, the comments deserve it. * Updates to workload files and one other thing based on review comments * de-actorify decodeKVPairs * format one source file * Restore transaction tagging doc * Restore throttle doc details in administration.rst * Restore fdbserver/workloads/GetEstimatedRangeSize.actor.cpp and associated toml file, minus tenant stuff * bindings/c/test/{shim related}: update comments and disable functionality that no longer works post-tenant * put the cli-throttle tag back in * bindingtester: fix python syntax errors * remove useless comment * Remove comment about useless comments, and remove the useless comments
699 lines
33 KiB
ReStructuredText
699 lines
33 KiB
ReStructuredText
.. default-domain:: cpp
|
|
.. highlight:: cpp
|
|
|
|
###############
|
|
Client Testing
|
|
###############
|
|
|
|
###################################
|
|
Testing Error Handling with Buggify
|
|
###################################
|
|
|
|
FoundationDB clients need to handle errors correctly. Wrong error handling can lead to many bugs - in the worst case it can
|
|
lead to a corrupted database. Because of this it is important that an application or layer author tests properly their
|
|
application during failure scenarios. But this is non-trivial. In a development environment cluster failures are very
|
|
unlikely and it is therefore possible that certain types of exceptions are never tested in a controlled environment.
|
|
|
|
The simplest way of testing for these kind of errors is a simple mechanism called ``Buggify``. If this option is enabled
|
|
in the client, the client will randomly throw errors that an application might see in a production environment. Enable this
|
|
option in testing will greatly improve the probability that error handling is tested properly.
|
|
|
|
Options to Control Buggify
|
|
==========================
|
|
|
|
There are four network options to control the buggify behavior. By default, buggify is disabled (as it will behave in a way
|
|
that is not desirable in a production environment). The options to control buggify are:
|
|
|
|
- ``buggify_enable``
|
|
This option takes no argument and will enable buggify.
|
|
- ``buggify_disable``
|
|
This can be used to disable buggify again.
|
|
- ``client_buggify_section_activated_probability`` (default ``25``)
|
|
A number between 0 and 100.
|
|
- ``client_buggify_section_fired_probability`` (default ``25``)
|
|
A number between 0 and 100.
|
|
|
|
The way buggify works is by enabling sections in the code first that get only executed with a certain probability. Generally
|
|
these code sections will simply introduce a synthetic error.
|
|
|
|
When a section is passed for the first time, the client library will decide randomly whether that code section will be enabled
|
|
or not. It will be enabled with a probability of ``client_buggify_section_activated_probability``.
|
|
|
|
Whenever the client executes a buggify-enabled code-block, it will randomly execute it. This is to make sure that a certain
|
|
exception doesn't always fire. The probably for executing such a section is ``client_buggify_section_fired_probability``.
|
|
|
|
################################
|
|
Simulation and Cluster Workloads
|
|
################################
|
|
|
|
|
|
FoundationDB comes with its own testing framework. Tests are implemented as workloads. A workload is nothing more than a class
|
|
that gets called by server processes running the ``tester`` role. Additionally, a ``fdbserver`` process can run a simulator that
|
|
simulates a full fdb cluster with several machines and different configurations in one process. This simulator can run the same
|
|
workloads you can run on a real cluster. It will also inject random failures like network partitions and disk failures.
|
|
|
|
This tutorial explains how one can implement a workload, how one can orchestrate a workload on a cluster with multiple clients, and
|
|
how one can run a workload within a simulator. Running in a simulator is also useful as it does not require any setup: you can simply
|
|
run one command that will provide you with a fully functional FoundationDB cluster.
|
|
|
|
General Overview
|
|
================
|
|
|
|
Workloads in FoundationDB are generally compiled into the binary. However, FoundationDB also provides the ability to load workloads
|
|
dynamically. This is done through ``dlopen`` (on Unix like operating systems) or ``LoadLibrary`` (on Windows).
|
|
|
|
Parallelism and Determinism
|
|
===========================
|
|
|
|
A workload can run either in a simulation or on a real cluster. In simulation, ``fdbserver`` will simulate a whole cluster and will
|
|
use a deterministic random number generator to simulate random behavior and random failures. This random number generator is initialized
|
|
with a random seed. In case of a test failure, the user can reuse the given seed and rerun the same test in order to further observe
|
|
and debug the behavior.
|
|
|
|
However, this will only work as long as the workload doesn't introduce any non-deterministic behavior. One example of non-deterministic
|
|
behavior is the running multiple threads.
|
|
|
|
The workload is created in the main network thread and it will run in the main network thread. Because of this, using any blocking
|
|
function (for example ``blockUntilReady`` on a future object) will result in a deadlock. Using the callback API is therefore required
|
|
if one wants to keep the simulator's deterministic behavior.
|
|
|
|
For existing applications and layers, however, not using the blocking API might not be an option. For these use-cases, a user can chose
|
|
to start new threads and use the blocking API from within these threads. This will mean that test failures will be non-deterministic and
|
|
might be hard to reproduce.
|
|
|
|
To start a new thread, one has to "bind" operating system threads to their simulated processes. This can be done by setting the
|
|
``ProcessId`` in the child threads when they get created. In Java this is done by only starting new threads through the provided
|
|
``Executor``. In the C++ API one can use the ``FDBWorkloadContext`` to do that. For example:
|
|
|
|
.. code-block:: C++
|
|
|
|
template<class Fun>
|
|
std::thread startThread(FDBWorkloadContext* context, Fun fun) {
|
|
auto processId = context->getProcessID();
|
|
return std::thread([context, processID, fun](
|
|
context->setProcessID(processID);
|
|
fun();
|
|
));
|
|
}
|
|
|
|
Finding the Shared Object
|
|
=========================
|
|
|
|
When the test starts, ``fdbserver`` needs to find the shared object to load. The name of this shared object has to be provided.
|
|
|
|
For Java, we provide an implementation in ``libjava_workloads.so`` which can be built out of the sources. The tester will look
|
|
for the key ``libraryName`` in the test file which should be the name of the library without extension and without the ``lib``
|
|
prefix (so ``java_workloads`` if you want to write a Java workload).
|
|
|
|
By default, the process will look for the library in the directory ``../shared/foundationdb/`` relative to the location of the
|
|
``fdbserver`` binary. If the library is somewhere else on the system, one can provide the absolute path to the library (only
|
|
the folder, not the file name) in the test file with the ``libraryPath`` option.
|
|
|
|
Implementing a C++ Workload
|
|
===========================
|
|
|
|
In order to implement a workload, one has to build a shared library that links against the fdb client library. This library has to
|
|
expose a function (with C linkage) called workloadFactory which needs to return a pointer to an object of type ``FDBWorkloadFactory``.
|
|
This mechanism allows the author to implement as many workloads within one library as she wants. To do this the pure virtual classes
|
|
``FDBWorkloadFactory`` and ``FDBWorkload`` have to be implemented.
|
|
|
|
.. function:: FDBWorkloadFactory* workloadFactory(FDBLogger*)
|
|
|
|
This function has to be defined within the shared library and will be called by ``fdbserver`` for looking up a specific workload.
|
|
``FDBLogger`` will be passed and is guaranteed to survive for the lifetime of the process. This class can be used to write to the
|
|
FoundationDB traces. Logging anything with severity ``FDBSeverity::Error`` will result in a hard test failure. This function needs
|
|
to have c-linkage, so define it in a ``extern "C"`` block.
|
|
|
|
.. function:: std::shared_ptr<FDBWorkload> FDBWorkload::create(const std::string& name)
|
|
|
|
This is the only method to be implemented in ``FDBWorkloadFactory``. If the test file contains a key-value pair ``workloadName``
|
|
the value will be passed to this method (empty string otherwise). This way, a library author can implement many workloads in one
|
|
library and use the test file to chose which one to run (or run multiple workloads either concurrently or serially).
|
|
|
|
.. function:: std::string FDBWorkload::description() const
|
|
|
|
This method has to return the name of the workload. This can be a static name and is primarily used for tracing.
|
|
|
|
.. function:: bool FDBWorkload::init(FDBWorkloadContext* context)
|
|
|
|
Right after initialization
|
|
|
|
.. function:: void FDBWorkload::setup(FDBDatabase* db, GenericPromise<bool> done)
|
|
|
|
This method will be called by the tester during the setup phase. It should be used to populate the database.
|
|
|
|
.. function:: void FDBWorkload::start(FDBDatabase* db, GenericPromise<bool> done)
|
|
|
|
This method should run the actual test.
|
|
|
|
.. function:: void FDBWorkload::check(FDBDatabase* db, GenericPromise<bool> done)
|
|
|
|
When the tester completes, this method will be called. A workload should run any consistency/correctness tests
|
|
during this phase.
|
|
|
|
.. function:: void FDBWorkload::getMetrics(std::vector<FDBPerfMetric>& out) const
|
|
|
|
If a workload collects metrics (like latencies or throughput numbers), these should be reported back here.
|
|
The multitester (or test orchestrator) will collect all metrics from all test clients and it will aggregate them.
|
|
|
|
Implementing a Java Workload
|
|
============================
|
|
|
|
In order to implement your own workload in Java you can simply create an implementation of the abstract class ``AbstractWorkload``.
|
|
A minimal implementation will look like this:
|
|
|
|
.. code-block:: java
|
|
|
|
package my.package;
|
|
import com.apple.foundationdb.testing.Promise;
|
|
import com.apple.foundationdb.testing.AbstractWorkload;
|
|
import com.apple.foundationdb.testing.WorkloadContext;
|
|
|
|
class MinimalWorkload extends AbstractWorkload {
|
|
public MinimalWorkload(WorkloadContext ctx) {
|
|
super(ctx);
|
|
}
|
|
|
|
@Override
|
|
public void setup(Database db, Promise promise) {
|
|
log(20, "WorkloadSetup", null);
|
|
promise.send(true);
|
|
}
|
|
|
|
@Override
|
|
public void start(Database db) {
|
|
log(20, "WorkloadStarted", null);
|
|
promise.send(true);
|
|
}
|
|
|
|
@Override
|
|
public boolean check(Database db) {
|
|
log(20, "WorkloadFailureCheck", null);
|
|
promise.send(true);
|
|
}
|
|
}
|
|
|
|
The lifecycle of a test will look like this:
|
|
|
|
1. All testers will create an instance of the ``AbstractWorkload`` implementation.
|
|
2. All testers will (in parallel but not guaranteed exactly at the same time) call
|
|
``setup`` and they will wait for all of them to finish. This phase can be used to
|
|
pre-populate data.
|
|
3. All tester will then call start (again, in parallel) and wait for all of them to
|
|
finish.
|
|
4. All testers will then call ``check`` on all testers and use the returned boolean
|
|
to determine whether the test succeeded.
|
|
|
|
All these methods take a ``Database`` object as an argument. This object can be used
|
|
to create and execute transactions against the cluster.
|
|
|
|
When implementing workloads, an author has to follow these rules:
|
|
|
|
- To write tracing to the trace-files one should use ``AbstractWorkload.log``. This
|
|
Method takes three arguments: an integer for severity (5 means debug, 10 means log,
|
|
20 means warning, 30 means warn always, and 40 is a severe error). If any tester
|
|
logs something of severity 40, the test run is considered to have failed.
|
|
- In order to increase throughput on the cluster, an author might want to spawn several
|
|
threads. However, threads *MUST* only be spawn through the ``Executor`` instance one
|
|
can get from ``AbstractWorkload.getExecutor()``. Otherwise, a simulation test will
|
|
probably segfault. The reason for this is that we need to keep track of which simulated
|
|
machine a thread corresponds to internally.
|
|
|
|
Within a workload you have access to the ``WorkloadContext`` which provides additional
|
|
information about the current execution environment. The context can be accessed through
|
|
``this.context`` and provides the following methods:
|
|
|
|
- ``String getOption(String name, String defaultValue)``. A user can provide parameters to workloads
|
|
through a configuration file (explained further down). These parameters are provided to
|
|
all clients through the context and can be accessed with this method.
|
|
- ``int getClientId()`` and ``int getClientCount()``. An author can determine how many
|
|
clients are running in the cluster and each of those will get a globally unique ID (a number
|
|
between 0 and clientCount - 1). This is useful for example if you want to generate transactions
|
|
that are guaranteed to not conflict with transactions from other clients.
|
|
- ``int getSharedRandomNumber()``. At startup a random number will be generated. This will allow for
|
|
generating the same random numbers across several machines if this number is used as a seed.
|
|
|
|
|
|
Running a Workload in the Simulator
|
|
===================================
|
|
|
|
We'll first walk how one can run a workload in a simulator. FoundationDB comes already with a large number
|
|
of workloads. But some of them can't be run in simulation while other don't work on a real cluster. Most
|
|
will work on both though. To look for examples how these can be ran, you can find configuration files in
|
|
the ``tests`` directory in the FoundationDB source tree.
|
|
|
|
We will now go through an example how you can write a relatively complex test and run it in the simulator.
|
|
Writing and running tests in the simulator is a simple two-step process.
|
|
|
|
1. Write the test.
|
|
2. Run ``fdbserver`` in simulation mode and provide it with the test file.
|
|
|
|
Write the Test
|
|
--------------
|
|
|
|
A workload is not a test. A test is a simple test file that tells the test orchestrator which workloads it
|
|
should run and in which order. Additionally one can provide parameters to workloads through this file.
|
|
|
|
A test file might look like this:
|
|
|
|
.. code-block:: none
|
|
|
|
testTitle=MyTest
|
|
testName=External
|
|
libraryName=java_workloads
|
|
workloadName=my.package.MinimalWorkload
|
|
classPath=PATH_TO_JAR_OR_DIR_CONTAINING_WORKLOAD,OTHER_DEPENDENCIES
|
|
|
|
testName=Attrition
|
|
testDuration=5.0
|
|
reboot=true
|
|
machinesToKill=3
|
|
|
|
testTitle=AnotherTest
|
|
testName=External
|
|
libraryName=java_workloads
|
|
workloadName=my.package.MinimalWorkload
|
|
classPath=PATH_TO_JAR_OR_DIR_CONTAINING_WORKLOAD,OTHER_DEPENDENCIES
|
|
someOption=foo
|
|
|
|
testName=External
|
|
libraryName=java_workloads
|
|
workloadName=my.package.AnotherWorkload
|
|
classPath=PATH_TO_JAR_OR_DIR_CONTAINING_WORKLOAD,OTHER_DEPENDENCIES
|
|
anotherOption=foo
|
|
|
|
This test will do the following:
|
|
|
|
1. First it will run ``MinimalWorkload`` without any parameter.
|
|
2. After 5.0 seconds the simulator will reboot 3 random machines (this is what Attrition does
|
|
and this workload is provided by FoundationDB. This is one of the few workloads that only
|
|
work in the simulator).
|
|
3. When all workloads are finished, it will run ``MinimalWorkload``
|
|
again. This time it will have the option ``someOption`` set to
|
|
``foo``. Additionally it will run ``AnotherWorkload`` in parallel.
|
|
|
|
How to set the Class Path correctly
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
As you can see from above example, we can set the classpath through two different mechanisms. However, one has
|
|
to be careful as they can't be used interchangeably.
|
|
|
|
- You can set a class path through the JVM argument ``-Djava.class.path=...``. This is how you have to pass the
|
|
path to the FoundationDB client library (as the client library is needed during the initialization phase). However,
|
|
only the first specified section will have any effect as the other Workloads will run in the same VM (and arguments,
|
|
by nature, can only be passed once).
|
|
- The ``classPath`` option. This option will add all paths (directories or JAR-files) to the classPath of the JVM
|
|
while it is running. Not being able to add the path will result in a test failure. This is useful to add different
|
|
dependencies to different workloads. A path can appear more than once across sections. However, they must not
|
|
conflict with each other as we never remove something from the classpath.
|
|
|
|
Run the simulator
|
|
-----------------
|
|
|
|
This step is very simple. You can simply run ``fdbserver`` with role simulator
|
|
and pass the test with ``-f``:
|
|
|
|
.. code-block:: sh
|
|
|
|
fdbserver -r simulation -f testfile.txt
|
|
|
|
|
|
Running a Workload on an actual Cluster
|
|
=======================================
|
|
|
|
Running a workload on a cluster works basically the same way. However, one must
|
|
actually setup a cluster first. This cluster must run between one and many server
|
|
processes with the class test. So above 2-step process becomes a bit more complex:
|
|
|
|
1. Write the test (same as above).
|
|
2. Set up a cluster with as many test clients as you want.
|
|
3. Run the orchestrator to actually execute the test.
|
|
|
|
Step 1. is explained further up. For step 2., please refer to the general FoundationDB
|
|
configuration. The main difference to a normal FoundationDB cluster is that some processes
|
|
must have a test class assigned to them. This can be done in the ``foundationdb.conf``. For
|
|
example this file would create a server with 8 processes of which 4 would act as test clients.
|
|
|
|
.. code-block:: ini
|
|
|
|
[fdbmonitor]
|
|
user = foundationdb
|
|
group = foundationdb
|
|
|
|
[general]
|
|
restart-delay = 60
|
|
cluster-file = /etc/foundationdb/fdb.cluster
|
|
|
|
## Default parameters for individual fdbserver processes
|
|
[fdbserver]
|
|
command = /usr/sbin/fdbserver
|
|
public-address = auto:$ID
|
|
listen-address = public
|
|
datadir = /var/lib/foundationdb/data/$ID
|
|
logdir = /var/log/foundationdb
|
|
|
|
[fdbserver.4500]
|
|
[fdbserver.4501]
|
|
[fdbserver.4502]
|
|
[fdbserver.4503]
|
|
[fdbserver.4510]
|
|
class = test
|
|
[fdbserver.4511]
|
|
class = test
|
|
[fdbserver.4512]
|
|
class = test
|
|
[fdbserver.4513]
|
|
class = test
|
|
|
|
Running the actual test can be done with ``fdbserver`` as well. For this you can call the process
|
|
with the ``multitest`` role:
|
|
|
|
.. code-block:: sh
|
|
|
|
fdbserver -r multitest -f testfile.txt
|
|
|
|
This command will block until all tests are completed.
|
|
|
|
##########
|
|
API Tester
|
|
##########
|
|
|
|
Introduction
|
|
============
|
|
|
|
API tester is a framework for implementing end-to-end tests of FDB C API, i.e. testing the API on a real
|
|
FDB cluster through all layers of the FDB client. Its executable is ``fdb_c_api_tester``, and the source
|
|
code is located in ``bindings/c/test/apitester``. The structure of API Tests is similar to that of the
|
|
Simulation Tests. The tests are implemented as workloads using FDB API, which are all built into the
|
|
``fdb_c_api_tester``. A concrete test configuration is defined as a TOML file, which specifies the
|
|
combination of workloads to be executed by the test together with their parameters. The test can be then
|
|
executed by passing the TOML file as a parameter to ``fdb_c_api_tester``.
|
|
|
|
Since simulation tests rely on the actor model to execute the tests deterministically in single-threaded
|
|
mode, they are not suitable for testing various multi-threaded aspects of the FDB client. End-to-end API
|
|
tests complement the simulation tests by testing the FDB Client layers above the single-threaded Native
|
|
Client.
|
|
|
|
- The specific testing goals of the end-to-end tests are:
|
|
- Check functional correctness of the Multi-Version Client (MVC) and Thread-Safe Client
|
|
- Detecting race conditions. They can be caused by accessing the state of the Native Client from wrong
|
|
threads or introducing other shared state without proper synchronization
|
|
- Detecting memory management errors. Thread-safe reference counting must be used where necessary. MVC
|
|
works with multiple client libraries. Memory allocated by one client library must be also deallocated
|
|
by the same library.
|
|
- Maintaining interoperability with other client versions. The client functionality is made available
|
|
depending on the selected API version. The API changes are correctly adapted.
|
|
- Client API behaves correctly in case of cluster upgrades. Database and transaction state is correctly
|
|
migrated to the upgraded connections. Pending operations are canceled and successfully retried on the
|
|
upgraded connections.
|
|
|
|
Implementing a Workload
|
|
=======================
|
|
|
|
Each workload is declared as a direct or indirect subclass of ``WorkloadBase`` implementing a constructor
|
|
with ``WorkloadConfig`` as a parameter and the method ``start()``, which defines the entry point of the
|
|
workload.
|
|
|
|
``WorkloadBase`` provides a set of methods that serve as building blocks for implementation of a workload:
|
|
|
|
.. function:: execTransaction(start, cont, failOnError = true)
|
|
|
|
creates and executes an FDB transaction. Here ``start`` is a function that takes a transaction context
|
|
as parameter and implements the starting point of the transaction, and ``cont`` is a function implementing
|
|
a continuation to be executed after finishing the transaction execution. Transactions are automatically
|
|
retried on retryable errors. Transactions are retried by calling the ``start`` function again. In case
|
|
of a fatal error, the entire workload is considered as failed unless ``failOnError`` is set to ``false``.
|
|
|
|
.. function:: schedule(task)
|
|
|
|
schedules a task for asynchronous execution. It is usually used in the continuations to schedule
|
|
the next step of the workload.
|
|
|
|
.. function:: info(msg)
|
|
error(msg)
|
|
|
|
are used for logging a message with a tag identifying the workload. Issuing an error message marks
|
|
the workload as failed.
|
|
|
|
The transaction context provides methods for implementation of the transaction logics:
|
|
|
|
.. function:: tx()
|
|
|
|
the reference to the FDB transaction object
|
|
|
|
.. function:: continueAfter(future, cont, retryOnError = true)
|
|
|
|
set a continuation to be executed when the future is ready. The ``retryOnError`` flag controls whether
|
|
the transaction should be automatically retried in case the future results in a retriable error.
|
|
|
|
.. function:: continueAfterAll(futures, cont)
|
|
|
|
takes a vector of futures and sets a continuation to be executed when all of the futures get ready.
|
|
The transaction is retried if at least one of the futures results in an error. This method is useful
|
|
for handling multiple concurrent reads.
|
|
|
|
.. function:: commit()
|
|
|
|
commit and finish the transaction. If the commit is successful, the execution proceeds to the
|
|
continuation of ``execTransaction()``. In case of a retriable error the transaction is
|
|
automatically retried. A fatal error results in a failure of the workoad.
|
|
|
|
|
|
.. function:: done()
|
|
|
|
finish the transaction without committing. This method should be used to finish read transactions.
|
|
The transaction gets destroyed and execution proceeds to the continuation of ``execTransaction()``.
|
|
Each transaction must be finished either by ``commit()`` or ``done()``, because otherwise
|
|
the framework considers that the transaction is still being executed, so it won't destroy it and
|
|
won't call the continuation.
|
|
|
|
.. function:: onError(err)
|
|
|
|
Handle an error: restart the transaction in case of a retriable error, otherwise fail the workload.
|
|
This method is typically used in the continuation of ``continueAfter`` called with
|
|
``retryOnError=false`` as a fallback to the default error handling.
|
|
|
|
A workload execution ends automatically when it is marked as failed or its last continuation does not
|
|
schedule any new task or transaction.
|
|
|
|
The workload class should be defined in the namespace FdbApiTester. The file name convention is
|
|
``Tester{Name}Workload.cpp`` so that we distinguish them from the source files of simulation workloads.
|
|
|
|
Basic Workload Example
|
|
======================
|
|
|
|
The code below implements a workload that consists of only two transactions. The first one sets a
|
|
randomly generated key to a randomly generated value, and the second one reads the key and checks if
|
|
the returned value matches the written one.
|
|
|
|
.. literalinclude:: ../../../bindings/c/test/apitester/TesterExampleWorkload.cpp
|
|
:language: C++
|
|
:lines: 21-
|
|
|
|
The workload is implemented in the method ``setAndGet``. It generates a random key and a random value
|
|
and executes a transaction that writes that key-value pair and commits. In the continuation of the
|
|
first ``execTransaction`` call, we execute the second transaction that reads the same key. The read
|
|
operation returns a future. So we call ``continueAfter`` to set a continuation for that future. In the
|
|
continuation we check if the returned value matches the written one and finish the transaction by
|
|
calling ``ctx->done()``. After completing the second transaction we execute the continuation passed
|
|
as parameter to the ``setAndGet`` method by the start method. In this case it is ``NO_OP_TASK``, which
|
|
does nothing and so finishes the workload.
|
|
|
|
Finally, we declare an instance ``WorkloadFactory`` to register this workload with the name ``SetAndGet``.
|
|
|
|
Note that we use ``workloadId`` as a key prefix. This is necessary for isolating the key space of this
|
|
workload, because the framework may be instructed to create multiple instances of the ``SetAndGet``
|
|
workload. If we do not isolate the key space, another workload can write a different value for the
|
|
same key and so break the assumption of the test.
|
|
|
|
The workload is implemented using the internal C++ API, implemented in ``fdb_api.hpp``. It introduces
|
|
a set of classes representing the FDB objects (transactions, futures, etc.). These classes provide C++-style
|
|
methods wrapping FDB C API calls and automate memory management by means of reference counting.
|
|
|
|
Implementing Control Structures
|
|
===============================
|
|
|
|
Our basic workload executes just 2 transactions, but in practice we want to have workloads that generate
|
|
multiple transactions. The following code demonstrates how we can modify our basic workload to generate
|
|
multiple transactions in a loop.
|
|
|
|
.. code-block:: C++
|
|
|
|
class SetAndGetWorkload : public WorkloadBase {
|
|
public:
|
|
...
|
|
int numIterations;
|
|
int iterationsLeft;
|
|
|
|
SetAndGetWorkload(const WorkloadConfig& config) : WorkloadBase(config) {
|
|
keyPrefix = fdb::toBytesRef(fmt::format("{}/", workloadId));
|
|
numIterations = config.getIntOption("numIterations", 1000);
|
|
}
|
|
|
|
void start() override {
|
|
iterationsLeft = numIterations;
|
|
setAndGetLoop();
|
|
}
|
|
|
|
void setAndGetLoop() {
|
|
if (iterationsLeft == 0) {
|
|
return;
|
|
}
|
|
iterationsLeft--;
|
|
setAndGet([this]() { setAndGetLoop(); });
|
|
}
|
|
...
|
|
}
|
|
|
|
We introduce a workload parameter ``numIterations`` to specify the number of iterations. If not specified
|
|
in the test configuration it defaults to 1000.
|
|
|
|
The method ``setAndGetLoop`` implements the loop that decrements iterationsLeft counter until it reaches 0
|
|
and each iteration calls setAndGet with a continuation that returns the execution to the loop. As you
|
|
can see we don't need any change in ``setAndGet``, just call it with another continuation.
|
|
|
|
The pattern of passing a continuation as a parameter also can be used to decompose the workload into a
|
|
sequence of steps. For example, we can introduce setup and cleanUp steps to our workload and modify the
|
|
``setAndGetLoop`` to make it composable with an arbitrary continuation:
|
|
|
|
.. code-block:: C++
|
|
|
|
void start() override {
|
|
setup([this](){
|
|
iterationsLeft = numIterations;
|
|
setAndGetLoop([this](){
|
|
cleanup(NO_OP_TASK);
|
|
});
|
|
});
|
|
}
|
|
|
|
void setAndGetLoop(TTaskFct cont) {
|
|
if (iterationsLeft == 0) {
|
|
schedule(cont);
|
|
}
|
|
iterationsLeft--;
|
|
setAndGet([this, cont]() { setAndGetLoop(cont); });
|
|
}
|
|
|
|
void setup(TTaskFct cont) { ... }
|
|
|
|
void cleanup(TTaskFct cont) { ... }
|
|
|
|
Note that we call ``schedule(cont)`` in ``setAndGetLoop`` instead of calling the continuation directly.
|
|
In this way we avoid keeping ``setAndGetLoop`` in the call stack, when executing the next step.
|
|
|
|
Subclassing ApiWorkload
|
|
=======================
|
|
|
|
``ApiWorkload`` is an abstract subclass of ``WorkloadBase`` that provides a framework for a typical
|
|
implementation of API test workloads. It implements a workflow consisting of cleaning up the key space
|
|
of the workload, populating it with newly generated data and then running a loop consisting of random
|
|
database operations. The concrete subclasses of ``ApiWorkload`` are expected to override the method
|
|
``randomOperation`` with an implementation of concrete random operations.
|
|
|
|
The ``ApiWorkload`` maintains a local key-value store that mirrors the part of the database state
|
|
relevant to the workload. A successful database write operation should be followed by a continuation
|
|
that performs equivalent changes in the local store, and the results of a database read operation should
|
|
be validated against the values from the local store.
|
|
|
|
Test Configuration
|
|
==================
|
|
|
|
A concrete test configuration is specified by a TOML file. The file must contain one ``[[test]]`` section
|
|
specifying the general settings for test execution followed by one or more ``[[test.workload]]``
|
|
configuration sessions, specifying the workloads to be executed and their parameters. The specified
|
|
workloads are started all at once and executed concurrently.
|
|
|
|
The ``[[test]]`` section can contain the following options:
|
|
|
|
- ``title``: descriptive title of the test
|
|
- ``multiThreaded``: enable multi-threading (default: false)
|
|
- ``minFdbThreads`` and ``maxFdbThreads``: the number of FDB (network) threads to be randomly selected
|
|
from the given range (default: 1-1). Used only if ``multiThreaded=true``. It is also important to use
|
|
multiple database instances to make use of the multithreading.
|
|
- ``minDatabases`` and ``maxDatabases``: the number of database instances to be randomly selected from
|
|
the given range (default 1-1). The transactions of all workloads are randomly load-balanced over the
|
|
pool of database instances.
|
|
- ``minClients`` and ``maxClients``: the number of clients, i.e. instances of each workload, to be
|
|
randomly selected from the given range (default 1-8).
|
|
- ``minClientThreads`` and ``maxClientThreads``: the number of client threads, i.e. the threads used
|
|
for execution of the workload, to be randomly selected from the given range (default 1-1).
|
|
- ``blockOnFutures``: use blocking waits on futures instead of scheduling future callbacks asynchronously
|
|
(default: false)
|
|
- ``buggify``: Enable client-side failure injection (default: false)
|
|
- ``databasePerTransaction``: Create a separate database instance for each transaction (default: false).
|
|
It is a special mode useful for testing bugs related to creation and destruction of database instances.
|
|
- ``fdbCallbacksOnExternalThreads``: Enables the option ``FDB_NET_OPTION_CALLBACKS_ON_EXTERNAL_THREADS``
|
|
causting the callbacks of futures to be executed directly on the threads of the external FDB clients
|
|
rather than on the thread of the local FDB client.
|
|
|
|
The workload section ``[[test.workload]]`` must contain the attribute name matching the registered name
|
|
of the workload to be executed. Other options are workload-specific.
|
|
|
|
The subclasses of the ``ApiWorkload`` inherit the following configuration options:
|
|
|
|
- ``minKeyLength`` and ``maxKeyLength``: the size range of randomly generated keys (default: 1-64)
|
|
- ``minValueLength`` and ``maxValueLength``: the size range of randomly generated values
|
|
(default: 1-1000)
|
|
- ``maxKeysPerTransaction``: the maximum number of keys per transaction (default: 50)
|
|
- ``initialSize``: the number of key-value pairs in the initially populated database (default: 1000)
|
|
- ``readExistingKeysRatio``: the probability of choosing an existing key for read operations
|
|
(default: 0.9)
|
|
- ``numRandomOperations``: the number of random operations to be executed per workload (default: 1000)
|
|
- ``runUntilStop``: run the workload indefinitely until the stop command is received (default: false).
|
|
This execution mode in upgrade tests and other scripted tests, where the workload needs to
|
|
be generated continuously until completion of the scripted test.
|
|
- ``numOperationsForProgressCheck``: the number of operations to be performed to confirm a progress
|
|
check (default: 10). This option is used in combination with ``runUntilStop``. Progress checks are
|
|
initiated by a test script to check if the client workload is successfully progressing after a
|
|
cluster change.
|
|
|
|
The FDB server configuration can be specialized in the section ``[[server]]``:
|
|
|
|
- ``tls_enabled``: enable TLS (default: false)
|
|
- ``tls_client_chain_len``: the length of the client-side TLS chain (default: 2)
|
|
- ``tls_server_chain_len``: the length of the server-side TLS chain (default: 3)
|
|
- ``min_num_processes`` and ``max_num_processes``: the number of FDB server processes to be
|
|
randomly selected from the given range (default 1-3)
|
|
|
|
Executing the Tests
|
|
===================
|
|
|
|
The ``fdb_c_api_tester`` executable takes a single TOML file as a parameter and executes the test
|
|
according to its specification. Before that we must create a FDB cluster and pass its cluster file as
|
|
a parameter to ``fdb_c_api_tester``. Note that multithreaded tests also need to be provided with an
|
|
external client library.
|
|
|
|
The ``run_c_api_tests.py`` script automates execution of the API tests on a local cluster. The cluster
|
|
is created according to the options specified in the ``[[server]]`` section of the given test file.
|
|
|
|
.. code-block:: bash
|
|
|
|
${srcDir}/bindings/c/test/apitester/run_c_api_tests.py
|
|
--build-dir ${buildDir}
|
|
--api-tester-bin ${buildDir}/bin/fdb_c_api_tester
|
|
--external-client-library ${buildDir}/bindings/c/libfdb_c_external.so
|
|
--test-file ${srcDir}/bindings/c/test/apitester/tests/CApiCorrectnessMultiThr.toml
|
|
|
|
The test specifications added to the ``bindings/c/test/apitester/tests/`` directory are executed as a part
|
|
of the regression test suite as ``ctest`` targets with names ``fdb_c_api_test_{file_name}``.
|
|
|
|
The ``ctest`` targets provide a more convenient way for executing the API tests. We can execute
|
|
a single test:
|
|
|
|
.. code-block:: bash
|
|
|
|
ctest -R fdb_c_api_test_CApiCorrectnessMultiThr -VV
|
|
|
|
or execute all of them in parallel (here ``-j20`` specifies the parallelization level):
|
|
|
|
.. code-block:: bash
|
|
|
|
ctest -R fdb_c_api_test_ -j20 --output-on-failure
|
|
|
|
More sophisticated filters can be applied to execute a selected set of tests, e.g. the tests using TLS:
|
|
|
|
.. code-block:: bash
|
|
|
|
ctest -R 'fdb_c_api_test_.*TLS' -j20 --output_on_failure
|