From dfb20ba87fa3fe794bc0cfbc6690e5ee59b52085 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 15 Apr 2026 11:55:57 +0200 Subject: [PATCH] doc: write details of known issues --- doc/book/reference-manual/known-issues.md | 156 +++++++++++++++++++++- 1 file changed, 152 insertions(+), 4 deletions(-) diff --git a/doc/book/reference-manual/known-issues.md b/doc/book/reference-manual/known-issues.md index bd2c3319..3a825db3 100644 --- a/doc/book/reference-manual/known-issues.md +++ b/doc/book/reference-manual/known-issues.md @@ -3,20 +3,131 @@ title = "Known issues" weight = 80 +++ +Issues in each section are roughly sorted by order of decreasing impact, based on actual reports from users. ## Architectural limitations Issues that are caused by design decisions of Garage internals, and that can't be fixed without major architectural changes in the codebase. -### Buckets are not sharded +### Metadata performance issues with many objects + +**Related issues:** + +- [#851 - Performances collapse with 10 millions pictures in a bucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/851) +- [#1222 - Cluster Setup Write Performance Degraded After Writing 10 Million Object (200-300Kb per object)](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1222) ### Very big objects cause performance degradation -### No locking (`if-none-match`, ...) +For each object, there is a single metadata entry called a `Version` that +contains a list of all of the data blocks in the object. For very big objects, +this entry can contain thousands of block references. During the uploading of +an object, this metadata entry needs to be read, deserialized, reserialized and +written for each individual data block uploaded. This means that the +complexity of an upload is `O(n²)` in the number of blocks needed. + +This manifests by excessive metadata I/O and CPU usage, and uploads eventually stalling. + +**Mitigation:** Increase the `block_size` configuration parameter to reduce the +number of blocks. Make sure multipart uploads use chunks that are at least +`block_size` in size, and that are an exact multiple of `block_size` to avoid +the creation of smaller blocks. + +**Long-term solution:** An architectural change in the metadata system would be +required to store block lists in many independent metadata entries instead of +one single big entry per object. + +**Related issues:** + +- [#662 - Large Files fail to upload](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/662) +- [#1366 - High CPU usage and performance degradation during long multipart uploads](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1366) + +### No conditional writes / locking / WORM support (`if-none-match`, ...) + +This is structurally impossible to implement in Garage due to the lack of a consensus algorithm, +which is one of Garage's core design choices which we cannot reconsider. + +A semi-working, *unsafe* implementation of WORM and object locking could be +implemented, with the following constraint: only after the completion of the +first write (in case of WORM) or the setting of a lock (for object lock) can we +guarantee that the object cannot be overwritten. In case where an overwrite +requests arrives at the same time as the initial request to write or to lock +the object, we cannot implement a safe and consistent way to reject it. This +means that many practical use-cases for `if-none-match` cannot be supported +(e.g. using it to implement mutual exclusion between concurrent writers). + +**Related issues:** + +- [#1052 - Support conditional writes](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1052) +- [#1127 - Feature Request: WORM (Write Once Read Many) / Object Lock Support](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1127) ### `CreateBucket` race condition +Also due to the lack of a consensus algorithm, there is no mutual exclusion +between concurrent `CreateBucket` requests using the same bucket name. + +**Related issues:** + +- [#649 - Race condition in CreateBucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/649) + +### Metadata and data have the same replication factor + +There is a single `replication_factor` in the configuration file that applies both to data blocks and metadata entries. +This makes clusters with `replication_factor = 1` particularly vulnerable in cases of metadata corruption (see below), as there +is a single copy of the metadata for each object even in multi-node clusters. + +**Mitigation:** Do not use `replication_factor = 1`. + +**Long-term solution:** We want to allow scenarios such as replicating the +metadata on 2, 3 or more nodes and the data on only 1 or 2 nodes (for example), +so that the metadata can benefit from better redundancy without increasing the +storage costs for the entire dataset. This will require some important changes +in the codebase. + +**Related issues:** + +- [#720 - Separate replication modes for metadata/data](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/720) + +### Node count limitation + +Garage will have issues in clusters with too many nodes, it will not be able to +spread data uniformly among nodes and some nodes will fill up faster than +other. This starts to manifest when the number of nodes is bigger than `10 × +replication_factor`. This is due to the fact that Garage uses only 256 +partitions internally. + +**Mitigation:** Build clusters with fewer, bigger nodes. + +**Potential solution:** This can be fixed by increasing the number of +partitions in Garage. The code paths exist, there is [a `const` +somewhere](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/6fd9bba0cb55062cb1725ab961b7fa8acb9dcc61/src/rpc/layout/mod.rs#L35) +that theoretically allows to increase the number of partitions up to `2^16`, +but this has not been tested so there might be bugs. + +### Buckets are not sharded + +For each bucket, the first metadata layer that contains an index of all objects +is not sharded. This index, which includes the names and all metadata (size, +headers, ...) for each object, is stored on `$replication_factor` nodes. + +For instance with `replication_factor = 3`, a given bucket will use only 3 +specific nodes for this index (chosen at random when the bucket is created) to +store this index. In a multi-zone deployments, these nodes will be spread in +different zones. Each bucket uses a different set of 3 random nodes for its +index. + +As a consequence, very large buckets might cause uneven load distribution +within a cluster. If all of the requests on a cluster are for objects in a +single bucket, then the `$replication_factor` nodes that store the index will +become a hotspot in the cluster, with more intensive metadata access patterns. +There is no way of choosing which nodes will have this role. + +Currently, we have no report of this being an issue in practice. + +**Mitigation:** This impacts in particular clusters that are used for a single +purpose with a single bucket. This can be solved by dividing your dataset among +many buckets, using a client-side sharding strategy that you will have to +design. Use at least as many buckets as you have nodes on your cluster. ## Bugs @@ -26,11 +137,37 @@ fixed yet. ### LMDB metadata corruption +Many users have reported situations where the LMDB metadata db becomes +corrupted, sometimes after a forced shutdown of Garage or in case of power +loss. A corrupted database file is generally not recoverable. + +**Mitigation:** Use a `replication_factor` of at least 2. Configure automatic +snapshotting using `metadata_auto_snapshot_interval` so that in case of +corruption you can rollback to a working database. + +Note that taking filesystem-level snapshots of your `metadata_dir`, although it +is much faster and less I/O intensive than Garage's built-in snapshotting, does +not ensure that the snapshot will be consistent. If the snapshot is taking +during a metadata write, the snapshot itself might be corrupted and thus not +usable as a rollback point. Therefore, prefer using +`metadata_auto_snapshot_interval` in all cases. + ### Layout updates might require manual intervention -### Tag assignement +In case of disconnected nodes, when changing the cluster layout to remove these +nodes and add other nodes instead, Garage might not be able to properly evict +the old nodes from the system. This is a built-in security measure to avoid any +inconsistent cluster states. -You need to repeat mutiple time -t to have multiple tags, pushing tags with commas will result in a single string. +This manifests by several cluster layout versions staying active even after a +full resync. You can diagnose this situation with `garage layout history`, +which will give you instructions to fix it. + +### Tag assignment + +In the `garage layout assign` command, the `-t` argument has to be repeated +multiple times to set multiple tags on a node. Writing multiple tags separated +by commas will result in a single string. ## General footguns @@ -38,3 +175,14 @@ Choices made by the developers that users must be aware of if they don't want to run into potential issues. ### Resync tranquility is conservative by default + +By default, the worker parameters `resync-tranquility` and `resync-worker-count` are set to very conservative values, to avoid overloading nodes with I/O when data needs to be resynchronized between nodes. +This can cause issues where the resync queue grows faster than it can be cleared, which in turn causes performance issues in the rest of Garage. + +This situation is indicated by a big resync queue with few resync errors (the queue is not caused by a disconnected/malfunctionning node). +To fix it, increase the number of resync workers and reduce the resync tranquility. For instance, if you want to resync as fast as possible: + +``` +garage worker set -a resync-worker-count 8 +garage worker set -a resync-tranquility 0 +```