From 4d30a11ab61e1f61f8eceb0b7104536db22ac90a Mon Sep 17 00:00:00 2001
From: Tobias Schmidt <tobidt@gmail.com>
Date: Thu, 26 Oct 2017 22:33:45 +0200
Subject: [PATCH] Import storage and federation documentation from docs

---
 docs/federation.md |  81 ++++++++++
 docs/index.md      |   2 +
 docs/storage.md    | 357 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 440 insertions(+)
 create mode 100644 docs/federation.md
 create mode 100644 docs/storage.md

diff --git a/docs/federation.md b/docs/federation.md
new file mode 100644
index 0000000000..283f044abe
--- /dev/null
+++ b/docs/federation.md
@@ -0,0 +1,81 @@
+---
+title: Federation
+sort_rank: 6
+---
+
+# Federation
+
+Federation allows a Prometheus server to scrape selected time series from
+another Prometheus server.
+
+## Use cases
+
+There are different use cases for federation. Commonly, it is used to either
+achieve scalable Prometheus monitoring setups or to pull related metrics from
+one service's Prometheus into another.
+
+### Hierarchical federation
+
+Hierarchical federation allows Prometheus to scale to environments with tens of
+data centers and millions of nodes. In this use case, the federation topology
+resembles a tree, with higher-level Prometheus servers collecting aggregated
+time series data from a larger number of subordinated servers.
+
+For example, a setup might consist of many per-datacenter Prometheus servers
+that collect data in high detail (instance-level drill-down), and a set of
+global Prometheus servers which collect and store only aggregated data
+(job-level drill-down) from those local servers. This provides an aggregate
+global view and detailed local views.
+
+### Cross-service federation
+
+In cross-service federation, a Prometheus server of one service is configured
+to scrape selected data from another service's Prometheus server to enable
+alerting and queries against both datasets within a single server.
+
+For example, a cluster scheduler running multiple services might expose
+resource usage information (like memory and CPU usage) about service instances
+running on the cluster. On the other hand, a service running on that cluster
+will only expose application-specific service metrics. Often, these two sets of
+metrics are scraped by separate Prometheus servers. Using federation, the
+Prometheus server containing service-level metrics may pull in the cluster
+resource usage metrics about its specific service from the cluster Prometheus,
+so that both sets of metrics can be used within that server.
+
+## Configuring federation
+
+On any given Prometheus server, the `/federate` endpoint allows retrieving the
+current value for a selected set of time series in that server. At least one
+`match[]` URL parameter must be specified to select the series to expose. Each
+`match[]` argument needs to specify an
+[instant vector selector](querying/basics.md#instant-vector-selectors) like
+`up` or `{job="api-server"}`. If multiple `match[]` parameters are provided,
+the union of all matched series is selected.
+
+To federate metrics from one server to another, configure your destination
+Prometheus server to scrape from the `/federate` endpoint of a source server,
+while also enabling the `honor_labels` scrape option (to not overwrite any
+labels exposed by the source server) and passing in the desired `match[]`
+parameters. For example, the following `scrape_config` federates any series
+with the label `job="prometheus"` or a metric name starting with `job:` from
+the Prometheus servers at `source-prometheus-{1,2,3}:9090` into the scraping
+Prometheus:
+
+```yaml
+- job_name: 'federate'
+  scrape_interval: 15s
+
+  honor_labels: true
+  metrics_path: '/federate'
+
+  params:
+    'match[]':
+      - '{job="prometheus"}'
+      - '{__name__=~"job:.*"}'
+
+  static_configs:
+    - targets:
+      - 'source-prometheus-1:9090'
+      - 'source-prometheus-2:9090'
+      - 'source-prometheus-3:9090'
+```
diff --git a/docs/index.md b/docs/index.md
index 8641cd1b07..abab28508a 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -15,3 +15,5 @@ The documentation is available alongside all the project documentation at
 - [Getting started](getting_started.md)
 - [Configuration](configuration.md)
 - [Querying](querying/basics.md)
+- [Storage](storage.md)
+- [Federation](federation.md)
diff --git a/docs/storage.md b/docs/storage.md
new file mode 100644
index 0000000000..392e0c5a7a
--- /dev/null
+++ b/docs/storage.md
@@ -0,0 +1,357 @@
+---
+title: Storage
+sort_rank: 5
+---
+
+# Storage
+
+Prometheus has a sophisticated local storage subsystem. For indexes,
+it uses [LevelDB](https://github.com/google/leveldb). For the bulk
+sample data, it has its own custom storage layer, which organizes
+sample data in chunks of constant size (1024 bytes payload). These
+chunks are then stored on disk in one file per time series.
+
+This sections deals with the various configuration settings and issues you
+might run into. To dive deeper into the topic, check out the following talks:
+
+* [The Prometheus Time Series Database](https://www.youtube.com/watch?v=HbnGSNEjhUc).
+* [Configuring Prometheus for High Performance](https://www.youtube.com/watch?v=hPC60ldCGm8).
+
+## Memory usage
+
+Prometheus keeps all the currently used chunks in memory. In addition, it keeps
+as many most recently used chunks in memory as possible. You have to tell
+Prometheus how much memory it may use for this caching. The flag
+`storage.local.target-heap-size` allows you to set the heap size (in bytes)
+Prometheus aims not to exceed. Note that the amount of physical memory the
+Prometheus server will use is the result of complex interactions of the Go
+runtime and the operating system and very hard to predict precisely. As a rule
+of thumb, you should have at least 50% headroom in physical memory over the
+configured heap size. (Or, in other words, set `storage.local.target-heap-size`
+to a value of two thirds of the physical memory limit Prometheus should not
+exceed.)
+
+The default value of `storage.local.target-heap-size` is 2GiB and thus tailored
+to 3GiB of physical memory usage. If you have less physical memory available,
+you have to lower the flag value. If you have more memory available, you should
+raise the value accordingly. Otherwise, Prometheus will not make use of the
+memory and thus will perform much worse than it could.
+
+Because Prometheus uses most of its heap for long-lived allocations of memory
+chunks, the
+[garbage collection target percentage](https://golang.org/pkg/runtime/debug/#SetGCPercent)
+is set to 40 by default. You can still override this setting via the `GOGC`
+environment variable as usual. If you need to conserve CPU capacity and can
+accept running with fewer memory chunks, try higher values.
+
+For high-performance set-ups, you might need to adjust more flags. Please read
+through the sections below for details.
+
+NOTE: Prior to v1.6, there was no flag `storage.local.target-heap-size`.
+Instead, the number of chunks kept in memory had to be configured using the
+flags `storage.local.memory-chunks` and `storage.local.max-chunks-to-persist`.
+These flags still exist for compatibility reasons. However,
+`storage.local.max-chunks-to-persist` has no effect anymore, and if
+`storage.local.memory-chunks` is set to a non-zero value _x_, it is used to
+override the value for `storage.local.target-heap-size` to 3072*_x_.
+
+## Disk usage
+
+Prometheus stores its on-disk time series data under the directory specified by
+the flag `storage.local.path`. The default path is `./data` (relative to the
+working directory), which is good to try something out quickly but most likely
+not what you want for actual operations. The flag `storage.local.retention`
+allows you to configure the retention time for samples. Adjust it to your needs
+and your available disk space.
+
+## Chunk encoding
+
+Prometheus currently offers three different types of chunk encodings. The chunk
+encoding for newly created chunks is determined by the
+`-storage.local.chunk-encoding-version` flag. The valid values are 0, 1,
+or 2.
+
+Type 0 is the simple delta encoding implemented for Prometheus's first chunked
+storage layer. Type 1 is the current default encoding, a double-delta encoding
+with much better compression behavior than type 0. Both encodings feature a
+fixed byte width per sample over the whole chunk, which allows fast random
+access. While type 0 is the fastest encoding, the difference in encoding cost
+compared to encoding 1 is tiny. Due to the better compression behavior of type
+1, there is really no reason to select type 0 except compatibility with very
+old Prometheus versions.
+
+Type 2 is a variable bit-width encoding, i.e. each sample in the chunk can use
+a different number of bits. Timestamps are double-delta encoded, too, but with
+a slightly different algorithm. A number of different encoding schemes are
+available for sample values. The choice is made per chunk based on the nature
+of the sample values (constant, integer, regularly increasing, random…). Major
+parts of the type 2 encoding are inspired by a paper published by Facebook
+engineers:
+[_Gorilla: A Fast, Scalable, In-Memory Time Series Database_](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
+
+With type 2, access within a chunk has to happen sequentially, and the encoding
+and decoding cost is a bit higher. Overall, type 2 will cause more CPU usage
+and increased query latency compared to type 1 but offers a much improved
+compression ratio. The exact numbers depend heavily on the data set and the
+kind of queries. Below are results from a typical production server with a
+fairly expensive set of recording rules.
+
+Chunk type | bytes per sample | cores | rule evaluation duration
+:------:|:-----:|:----:|:----:
+1 | 3.3 | 1.6 | 2.9s
+2 | 1.3 | 2.4 | 4.9s
+
+You can change the chunk encoding each time you start the server, so
+experimenting with your own use case is encouraged. Take into account, however,
+that only newly created chunks will use the newly selected chunk encoding, so
+it will take a while until you see the effects.
+
+For more details about the trade-off between the chunk encodings, see
+[this blog post](https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chunks/).
+
+## Settings for high numbers of time series
+
+Prometheus can handle millions of time series. However, with the above
+mentioned default setting for `storage.local.target-heap-size`, you will be
+limited to about 200,000 time series simultaneously present in memory. For more
+series, you need more memory, and you need to configure Prometheus to make use
+of it as described above.
+
+Each of the aforementioned chunks contains samples of a single time series. A
+time series is thus represented as a series of chunks, which ultimately end up
+in a time series file (one file per time series) on disk.
+
+A series that has recently received new samples will have an open incomplete
+_head chunk_. Once that chunk is completely filled, or the series hasn't
+received samples in a while, the head chunk is closed and becomes a chunk
+waiting to be appended to its corresponding series file, i.e. it is _waiting
+for persistence_. After the chunk has been persisted to disk, it becomes
+_evictable_, provided it is not currently used by a query. Prometheus will
+evict evictable chunks from memory to satisfy the configured target heap
+size. A series with an open head chunk is called an _active series_. This is
+different from a _memory series_, which also includes series without an open
+head chunk but still other chunks in memory (whether waiting for persistence,
+used in a query, or evictable). A series without any chunks in memory may be
+_archived_, upon which it ceases to have any mandatory memory footprint.
+
+The amount of chunks Prometheus can keep in memory depends on the flag value
+for `storage.local.target-heap-size` and on the amount of memory used by
+everything else. If there are not enough chunks evictable to satisfy the target
+heap size, Prometheus will throttle ingestion of more samples (by skipping
+scrapes and rule evaluations) until the heap has shrunk enough. _Throttled
+ingestion is really bad for various reasons. You really do not want to be in
+that situation._
+
+Open head chunks, chunks still waiting for persistence, and chunks being used
+in a query are not evictable. Thus, the reasons for the inability to evict
+enough chunks include the following:
+
+1. Queries that use too many chunks.
+2. Chunks are piling up waiting for persistence because the storage layer
+   cannot keep up writing chunks.
+3. There are too many active time series, which results in too many open head
+   chunks.
+
+Currently, Prometheus has no defence against case (1). Abusive queries will
+essentially OOM the server.
+
+To defend against case (2), there is a concept of persistence urgency explained
+in the next section.
+
+Case (3) depends on the targets you monitor. To mitigate an unplanned explosion
+of the number of series, you can limit the number of samples per individual
+scrape (see `sample_limit` in the [scrape config](configuration.md#scrape_config)).
+If the number of active time series exceeds the number of memory chunks the
+Prometheus server can afford, the server will quickly throttle ingestion as
+described above. The only way out of this is to give Prometheus more RAM or
+reduce the number of time series to ingest.
+
+In fact, you want many more memory chunks than you have series in
+memory. Prometheus tries to batch up disk writes as much as possible as it
+helps for both HDD (write as much as possible after each seek) and SSD (tiny
+writes create write amplification, which limits the effective throughput and
+burns much more quickly through the lifetime of the device). The more
+Prometheus can batch up writes, the more efficient is the process of persisting
+chunks to disk. which helps case (2).
+
+In conclusion, to keep the Prometheus server healthy, make sure it has plenty
+of headroom of memory chunks available for the number of memory series. A
+factor of three is a good starting point. Refer to the
+[section about helpful metrics](#helpful-metrics) to find out what to look
+for. A very broad rule of thumb for an upper limit of memory series is the
+total available physical memory divided by 10,000, e.g. About 6M memory series
+on a 64GiB server.
+
+If you combine a high number of time series with very fast and/or large
+scrapes, the number of pre-allocated mutexes for series locking might not be
+sufficient. If you see scrape hiccups while Prometheus is writing a checkpoint
+or processing expensive queries, try increasing the value of the
+`storage.local.num-fingerprint-mutexes` flag. Sometimes tens of thousands or
+even more are required.
+
+PromQL queries that involve a high number of time series will make heavy use of
+the LevelDB-backed indexes. If you need to run queries of that kind, tweaking
+the index cache sizes might be required. The following flags are relevant:
+
+* `-storage.local.index-cache-size.label-name-to-label-values`: For regular
+  expression matching.
+* `-storage.local.index-cache-size.label-pair-to-fingerprints`: Increase the
+  size if a large number of time series share the same label pair or name.
+* `-storage.local.index-cache-size.fingerprint-to-metric` and
+  `-storage.local.index-cache-size.fingerprint-to-timerange`: Increase the size
+  if you have a large number of archived time series, i.e. series that have not
+  received samples in a while but are still not old enough to be purged
+  completely.
+
+You have to experiment with the flag values to find out what helps. If a query
+touches 100,000+ time series, hundreds of MiB might be reasonable. If you have
+plenty of memory available, using more of it for LevelDB cannot harm. More
+memory for LevelDB will effectively reduce the number of memory chunks
+Prometheus can afford.
+
+## Persistence urgency and “rushed mode”
+
+Naively, Prometheus would all the time try to persist completed chunk to disk
+as soon as possible. Such a strategy would lead to many tiny write operations,
+using up most of the I/O bandwidth and keeping the server quite busy. Spinning
+disks will appear to be very slow because of the many slow seeks required, and
+SSDs will suffer from write amplification. Prometheus tries instead to batch up
+write operations as much as possible, which works better if it is allowed to
+use more memory.
+
+Prometheus will also sync series files after each write (with
+`storage.local.series-sync-strategy=adaptive`, which is the default) and use
+the disk bandwidth for more frequent checkpoints (based on the count of “dirty
+series”, see [below](#crash-recovery)), both attempting to minimize data loss
+in case of a crash.
+
+But what to do if the number of chunks waiting for persistence grows too much?
+Prometheus calculates a score for urgency to persist chunks. The score is
+between 0 and 1, where 1 corresponds to the highest urgency. Depending on the
+score, Prometheus will write to disk more frequently. Should the score ever
+pass the threshold of 0.8, Prometheus enters “rushed mode” (which you can see
+in the logs). In rushed mode, the following strategies are applied to speed up
+persisting chunks:
+
+* Series files are not synced after write operations anymore (making better use
+  of the OS's page cache at the price of an increased risk of losing data in
+  case of a server crash – this behavior can be overridden with the flag
+  `storage.local.series-sync-strategy`).
+* Checkpoints are only created as often as configured via the
+  `storage.local.checkpoint-interval` flag (freeing more disk bandwidth for
+  persisting chunks at the price of more data loss in case of a crash and an
+  increased time to run the subsequent crash recovery).
+* Write operations to persist chunks are not throttled anymore and performed as
+  fast as possible.
+
+Prometheus leaves rushed mode once the score has dropped below 0.7.
+
+Throttling of ingestion happens if the urgency score reaches 1. Thus, the
+rushed mode is not _per se_ something to be avoided. It is, on the contrary, a
+measure the Prometheus server takes to avoid the really bad situation of
+throttled ingestion. Occasionally entering rushed mode is OK, if it helps and
+ultimately leads to leaving rushed mode again. _If rushed mode is entered but
+the urgency score still goes up, the server has a real problem._
+
+## Settings for very long retention time
+
+If you have set a very long retention time via the `storage.local.retention`
+flag (more than a month), you might want to increase the flag value
+`storage.local.series-file-shrink-ratio`.
+
+Whenever Prometheus needs to cut off some chunks from the beginning of a series
+file, it will simply rewrite the whole file. (Some file systems support “head
+truncation”, which Prometheus currently does not use for several reasons.) To
+not rewrite a very large series file to get rid of very few chunks, the rewrite
+only happens if at least 10% of the chunks in the series file are removed. This
+value can be changed via the mentioned `storage.local.series-file-shrink-ratio`
+flag. If you have a lot of disk space but want to minimize rewrites (at the
+cost of wasted disk space), increase the flag value to higher values, e.g. 0.3
+for 30% of required chunk removal.
+
+## Crash recovery
+
+Prometheus saves chunks to disk as soon as possible after they are
+complete. Incomplete chunks are saved to disk during regular
+checkpoints. You can configure the checkpoint interval with the flag
+`storage.local.checkpoint-interval`. Prometheus creates checkpoints
+more frequently than that if too many time series are in a “dirty”
+state, i.e. their current incomplete head chunk is not the one that is
+contained in the most recent checkpoint. This limit is configurable
+via the `storage.local.checkpoint-dirty-series-limit` flag.
+
+More active time series to cycle through lead in general to more chunks waiting
+for persistence, which in turns leads to larger checkpoints and ultimately more
+time needed for checkpointing. There is a clear trade-off between limiting the
+loss of data in case of a crash and the ability to scale to high number of
+active time series. To not spend the majority of the disk throughput for
+checkpointing, you have to increase the checkpoint interval. Prometheus itself
+limits the time spent in checkpointing to 50% by waiting after each
+checkpoint's completion for at least as long as the previous checkpoint took.
+
+Nevertheless, should your server crash, you might still lose data, and
+your storage might be left in an inconsistent state. Therefore,
+Prometheus performs a crash recovery after an unclean shutdown,
+similar to an `fsck` run for a file system. Details about the crash
+recovery are logged, so you can use it for forensics if required. Data
+that cannot be recovered is moved to a directory called `orphaned`
+(located under `storage.local.path`). Remember to delete that data if
+you do not need it anymore.
+
+The crash recovery usually takes less than a minute. Should it take much
+longer, consult the log to find out what is going on. With increasing number of
+time series in the storage (archived or not), the re-indexing tends to dominate
+the recovery time and can take tens of minutes in extreme cases.
+
+## Data corruption
+
+If you suspect problems caused by corruption in the database, you can
+enforce a crash recovery by starting the server with the flag
+`storage.local.dirty`.
+
+If that does not help, or if you simply want to erase the existing
+database, you can easily start fresh by deleting the contents of the
+storage directory:
+
+   1. Stop Prometheus.
+   1. `rm -r <storage path>/*`
+   1. Start Prometheus.
+
+## Helpful metrics
+
+Out of the metrics that Prometheus exposes about itself, the following are
+particularly useful to tweak flags and find out about the required
+resources. They also help to create alerts to find out in time if a Prometheus
+server has problems or is out of capacity.
+
+* `prometheus_local_storage_memory_series`: The current number of series held
+  in memory.
+* `prometheus_local_storage_open_head_chunks`: The number of open head chunks.
+* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks
+  that still need to be persisted to disk.
+* `prometheus_local_storage_memory_chunks`: The current number of chunks held
+  in memory. If you substract the previous two, you get the number of persisted
+  chunks (which are evictable if not currently in use by a query).
+* `prometheus_local_storage_series_chunks_persisted`: A histogram of the number
+  of chunks persisted per batch.
+* `prometheus_local_storage_persistence_urgency_score`: The urgency score as
+  discussed [above](#persistence-urgency-and-rushed-mode).
+* `prometheus_local_storage_rushed_mode` is 1 if Prometheus is in “rushed
+  mode”, 0 otherwise. Can be used to calculate the percentage of time
+  Prometheus is in rushed mode.
+* `prometheus_local_storage_checkpoint_last_duration_seconds`: How long the
+  last checkpoint took.
+* `prometheus_local_storage_checkpoint_last_size_bytes`: Size of the last
+  checkpoint in bytes.
+* `prometheus_local_storage_checkpointing` is 1 while Prometheus is
+  checkpointing, 0 otherwise. Can be used to calculate the percentage of time
+  Prometheus is checkpointing.
+* `prometheus_local_storage_inconsistencies_total`: Counter for storage
+  inconsistencies found. If this is greater than 0, restart the server for
+  recovery.
+* `prometheus_local_storage_persist_errors_total`: Counter for persist errors.
+* `prometheus_local_storage_memory_dirty_series`: Current number of dirty series.
+* `process_resident_memory_bytes`: Broadly speaking the physical memory
+  occupied by the Prometheus process.
+* `go_memstats_alloc_bytes`: Go heap size (allocated objects in use plus allocated
+  objects not in use anymore but not yet garbage-collected).