From dfb20ba87fa3fe794bc0cfbc6690e5ee59b52085 Mon Sep 17 00:00:00 2001
From: Alex Auvolat <lx@deuxfleurs.fr>
Date: Wed, 15 Apr 2026 11:55:57 +0200
Subject: [PATCH] doc: write details of known issues

---
 doc/book/reference-manual/known-issues.md | 156 +++++++++++++++++++++-
 1 file changed, 152 insertions(+), 4 deletions(-)

diff --git a/doc/book/reference-manual/known-issues.md b/doc/book/reference-manual/known-issues.md
index bd2c3319..3a825db3 100644
--- a/doc/book/reference-manual/known-issues.md
+++ b/doc/book/reference-manual/known-issues.md
@@ -3,20 +3,131 @@ title = "Known issues"
 weight = 80
 +++
 
+Issues in each section are roughly sorted by order of decreasing impact, based on actual reports from  users.
 
 ## Architectural limitations
 
 Issues that are caused by design decisions of Garage internals, and that can't
 be fixed without major architectural changes in the codebase.
 
-### Buckets are not sharded
+### Metadata performance issues with many objects
+
+**Related issues:**
+
+- [#851 - Performances collapse with 10 millions pictures in a bucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/851)
+- [#1222 - Cluster Setup Write Performance Degraded After Writing 10 Million Object (200-300Kb per object)](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1222)
 
 ### Very big objects cause performance degradation
 
-### No locking (`if-none-match`, ...)
+For each object, there is a single metadata entry called a `Version` that
+contains a list of all of the data blocks in the object.  For very big objects,
+this entry can contain thousands of block references.  During the uploading of
+an object, this metadata entry needs to be read, deserialized, reserialized and
+written for each individual data block uploaded.  This means that the
+complexity of an upload is `O(n²)` in the number of blocks needed.
+
+This manifests by excessive metadata I/O and CPU usage, and uploads eventually stalling.
+
+**Mitigation:** Increase the `block_size` configuration parameter to reduce the
+number of blocks. Make sure multipart uploads use chunks that are at least
+`block_size` in size, and that are an exact multiple of `block_size` to avoid
+the creation of smaller blocks.
+
+**Long-term solution:** An architectural change in the metadata system would be
+required to store block lists in many independent metadata entries instead of
+one single big entry per object.
+
+**Related issues:**
+
+- [#662 - Large Files fail to upload](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/662)
+- [#1366 - High CPU usage and performance degradation during long multipart uploads](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1366)
+
+### No conditional writes / locking / WORM support (`if-none-match`, ...)
+
+This is structurally impossible to implement in Garage due to the lack of a consensus algorithm,
+which is one of Garage's core design choices which we cannot reconsider.
+
+A semi-working, *unsafe* implementation of WORM and object locking could be
+implemented, with the following constraint: only after the completion of the
+first write (in case of WORM) or the setting of a lock (for object lock) can we
+guarantee that the object cannot be overwritten. In case where an overwrite
+requests arrives at the same time as the initial request to write or to lock
+the object, we cannot implement a safe and consistent way to reject it.  This
+means that many practical use-cases for `if-none-match` cannot be supported
+(e.g. using it to implement mutual exclusion between concurrent writers).
+
+**Related issues:**
+
+- [#1052 - Support conditional writes](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1052)
+- [#1127 - Feature Request: WORM (Write Once Read Many) / Object Lock Support](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1127)
 
 ### `CreateBucket` race condition
 
+Also due to the lack of a consensus algorithm, there is no mutual exclusion
+between concurrent `CreateBucket` requests using the same bucket name.
+
+**Related issues:**
+
+- [#649 - Race condition in CreateBucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/649)
+
+### Metadata and data have the same replication factor
+
+There is a single `replication_factor` in the configuration file that applies both to data blocks and metadata entries.
+This makes clusters with `replication_factor = 1` particularly vulnerable in cases of metadata corruption (see below), as there
+is a single copy of the metadata for each object even in multi-node clusters.
+
+**Mitigation:** Do not use `replication_factor = 1`.
+
+**Long-term solution:** We want to allow scenarios such as replicating the
+metadata on 2, 3 or more nodes and the data on only 1 or 2 nodes (for example),
+so that the metadata can benefit from better redundancy without increasing the
+storage costs for the entire dataset. This will require some important changes
+in the codebase.
+
+**Related issues:**
+
+- [#720 - Separate replication modes for metadata/data](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/720)
+
+### Node count limitation
+
+Garage will have issues in clusters with too many nodes, it will not be able to
+spread data uniformly among nodes and some nodes will fill up faster than
+other. This starts to manifest when the number of nodes is bigger than `10 ×
+replication_factor`.  This is due to the fact that Garage uses only 256
+partitions internally.
+
+**Mitigation:** Build clusters with fewer, bigger nodes.
+
+**Potential solution:** This can be fixed by increasing the number of
+partitions in Garage. The code paths exist, there is [a `const`
+somewhere](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/6fd9bba0cb55062cb1725ab961b7fa8acb9dcc61/src/rpc/layout/mod.rs#L35)
+that theoretically allows to increase the number of partitions up to `2^16`,
+but this has not been tested so there might be bugs.
+
+### Buckets are not sharded
+
+For each bucket, the first metadata layer that contains an index of all objects
+is not sharded.  This index, which includes the names and all metadata (size,
+headers, ...) for each object, is stored on `$replication_factor` nodes.
+
+For instance with `replication_factor = 3`, a given bucket will use only 3
+specific nodes for this index (chosen at random when the bucket is created) to
+store this index.  In a multi-zone deployments, these nodes will be spread in
+different zones.  Each bucket uses a different set of 3 random nodes for its
+index.
+
+As a consequence, very large buckets might cause uneven load distribution
+within a cluster.  If all of the requests on a cluster are for objects in a
+single bucket, then the `$replication_factor` nodes that store the index will
+become a hotspot in the cluster, with more intensive metadata access patterns.
+There is no way of choosing which nodes will have this role.
+
+Currently, we have no report of this being an issue in practice.
+
+**Mitigation:** This impacts in particular clusters that are used for a single
+purpose with a single bucket. This can be solved by dividing your dataset among
+many buckets, using a client-side sharding strategy that you will have to
+design. Use at least as many buckets as you have nodes on your cluster.
 
 
 ## Bugs
@@ -26,11 +137,37 @@ fixed yet.
 
 ### LMDB metadata corruption
 
+Many users have reported situations where the LMDB metadata db becomes
+corrupted, sometimes after a forced shutdown of Garage or in case of power
+loss.  A corrupted database file is generally not recoverable.
+
+**Mitigation:** Use a `replication_factor` of at least 2. Configure automatic
+snapshotting using `metadata_auto_snapshot_interval` so that in case of
+corruption you can rollback to a working database.
+
+Note that taking filesystem-level snapshots of your `metadata_dir`, although it
+is much faster and less I/O intensive than Garage's built-in snapshotting, does
+not ensure that the snapshot will be consistent. If the snapshot is taking
+during a metadata write, the snapshot itself might be corrupted and thus not
+usable as a rollback point. Therefore, prefer using
+`metadata_auto_snapshot_interval` in all cases.
+
 ### Layout updates might require manual intervention
 
-### Tag assignement
+In case of disconnected nodes, when changing the cluster layout to remove these
+nodes and add other nodes instead, Garage might not be able to properly evict
+the old nodes from the system. This is a built-in security measure to avoid any
+inconsistent cluster states.
 
-You need to repeat mutiple time -t to have multiple tags, pushing tags with commas will result in a single string.
+This manifests by several cluster layout versions staying active even after a
+full resync. You can diagnose this situation with `garage layout history`,
+which will give you instructions to fix it.
+
+### Tag assignment
+
+In the `garage layout assign` command, the `-t` argument has to be repeated
+multiple times to set multiple tags on a node. Writing multiple tags separated
+by commas will result in a single string.
 
 ## General footguns
 
@@ -38,3 +175,14 @@ Choices made by the developers that users must be aware of if they don't want
 to run into potential issues.
 
 ### Resync tranquility is conservative by default
+
+By default, the worker parameters `resync-tranquility` and `resync-worker-count` are set to very conservative values, to avoid overloading nodes with I/O when data needs to be resynchronized between nodes.
+This can cause issues where the resync queue grows faster than it can be cleared, which in turn causes performance issues in the rest of Garage.
+
+This situation is indicated by a big resync queue with few resync errors (the queue is not caused by a disconnected/malfunctionning node).
+To fix it, increase the number of resync workers and reduce the resync tranquility. For instance, if you want to resync as fast as possible:
+
+```
+garage worker set -a resync-worker-count 8
+garage worker set -a resync-tranquility 0
+```