From 4d30a11ab61e1f61f8eceb0b7104536db22ac90a Mon Sep 17 00:00:00 2001 From: Tobias Schmidt Date: Thu, 26 Oct 2017 22:33:45 +0200 Subject: [PATCH] Import storage and federation documentation from docs --- docs/federation.md | 81 ++++++++++ docs/index.md | 2 + docs/storage.md | 357 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 440 insertions(+) create mode 100644 docs/federation.md create mode 100644 docs/storage.md diff --git a/docs/federation.md b/docs/federation.md new file mode 100644 index 0000000000..283f044abe --- /dev/null +++ b/docs/federation.md @@ -0,0 +1,81 @@ +--- +title: Federation +sort_rank: 6 +--- + +# Federation + +Federation allows a Prometheus server to scrape selected time series from +another Prometheus server. + +## Use cases + +There are different use cases for federation. Commonly, it is used to either +achieve scalable Prometheus monitoring setups or to pull related metrics from +one service's Prometheus into another. + +### Hierarchical federation + +Hierarchical federation allows Prometheus to scale to environments with tens of +data centers and millions of nodes. In this use case, the federation topology +resembles a tree, with higher-level Prometheus servers collecting aggregated +time series data from a larger number of subordinated servers. + +For example, a setup might consist of many per-datacenter Prometheus servers +that collect data in high detail (instance-level drill-down), and a set of +global Prometheus servers which collect and store only aggregated data +(job-level drill-down) from those local servers. This provides an aggregate +global view and detailed local views. + +### Cross-service federation + +In cross-service federation, a Prometheus server of one service is configured +to scrape selected data from another service's Prometheus server to enable +alerting and queries against both datasets within a single server. + +For example, a cluster scheduler running multiple services might expose +resource usage information (like memory and CPU usage) about service instances +running on the cluster. On the other hand, a service running on that cluster +will only expose application-specific service metrics. Often, these two sets of +metrics are scraped by separate Prometheus servers. Using federation, the +Prometheus server containing service-level metrics may pull in the cluster +resource usage metrics about its specific service from the cluster Prometheus, +so that both sets of metrics can be used within that server. + +## Configuring federation + +On any given Prometheus server, the `/federate` endpoint allows retrieving the +current value for a selected set of time series in that server. At least one +`match[]` URL parameter must be specified to select the series to expose. Each +`match[]` argument needs to specify an +[instant vector selector](querying/basics.md#instant-vector-selectors) like +`up` or `{job="api-server"}`. If multiple `match[]` parameters are provided, +the union of all matched series is selected. + +To federate metrics from one server to another, configure your destination +Prometheus server to scrape from the `/federate` endpoint of a source server, +while also enabling the `honor_labels` scrape option (to not overwrite any +labels exposed by the source server) and passing in the desired `match[]` +parameters. For example, the following `scrape_config` federates any series +with the label `job="prometheus"` or a metric name starting with `job:` from +the Prometheus servers at `source-prometheus-{1,2,3}:9090` into the scraping +Prometheus: + +```yaml +- job_name: 'federate' + scrape_interval: 15s + + honor_labels: true + metrics_path: '/federate' + + params: + 'match[]': + - '{job="prometheus"}' + - '{__name__=~"job:.*"}' + + static_configs: + - targets: + - 'source-prometheus-1:9090' + - 'source-prometheus-2:9090' + - 'source-prometheus-3:9090' +``` diff --git a/docs/index.md b/docs/index.md index 8641cd1b07..abab28508a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,3 +15,5 @@ The documentation is available alongside all the project documentation at - [Getting started](getting_started.md) - [Configuration](configuration.md) - [Querying](querying/basics.md) +- [Storage](storage.md) +- [Federation](federation.md) diff --git a/docs/storage.md b/docs/storage.md new file mode 100644 index 0000000000..392e0c5a7a --- /dev/null +++ b/docs/storage.md @@ -0,0 +1,357 @@ +--- +title: Storage +sort_rank: 5 +--- + +# Storage + +Prometheus has a sophisticated local storage subsystem. For indexes, +it uses [LevelDB](https://github.com/google/leveldb). For the bulk +sample data, it has its own custom storage layer, which organizes +sample data in chunks of constant size (1024 bytes payload). These +chunks are then stored on disk in one file per time series. + +This sections deals with the various configuration settings and issues you +might run into. To dive deeper into the topic, check out the following talks: + +* [The Prometheus Time Series Database](https://www.youtube.com/watch?v=HbnGSNEjhUc). +* [Configuring Prometheus for High Performance](https://www.youtube.com/watch?v=hPC60ldCGm8). + +## Memory usage + +Prometheus keeps all the currently used chunks in memory. In addition, it keeps +as many most recently used chunks in memory as possible. You have to tell +Prometheus how much memory it may use for this caching. The flag +`storage.local.target-heap-size` allows you to set the heap size (in bytes) +Prometheus aims not to exceed. Note that the amount of physical memory the +Prometheus server will use is the result of complex interactions of the Go +runtime and the operating system and very hard to predict precisely. As a rule +of thumb, you should have at least 50% headroom in physical memory over the +configured heap size. (Or, in other words, set `storage.local.target-heap-size` +to a value of two thirds of the physical memory limit Prometheus should not +exceed.) + +The default value of `storage.local.target-heap-size` is 2GiB and thus tailored +to 3GiB of physical memory usage. If you have less physical memory available, +you have to lower the flag value. If you have more memory available, you should +raise the value accordingly. Otherwise, Prometheus will not make use of the +memory and thus will perform much worse than it could. + +Because Prometheus uses most of its heap for long-lived allocations of memory +chunks, the +[garbage collection target percentage](https://golang.org/pkg/runtime/debug/#SetGCPercent) +is set to 40 by default. You can still override this setting via the `GOGC` +environment variable as usual. If you need to conserve CPU capacity and can +accept running with fewer memory chunks, try higher values. + +For high-performance set-ups, you might need to adjust more flags. Please read +through the sections below for details. + +NOTE: Prior to v1.6, there was no flag `storage.local.target-heap-size`. +Instead, the number of chunks kept in memory had to be configured using the +flags `storage.local.memory-chunks` and `storage.local.max-chunks-to-persist`. +These flags still exist for compatibility reasons. However, +`storage.local.max-chunks-to-persist` has no effect anymore, and if +`storage.local.memory-chunks` is set to a non-zero value _x_, it is used to +override the value for `storage.local.target-heap-size` to 3072*_x_. + +## Disk usage + +Prometheus stores its on-disk time series data under the directory specified by +the flag `storage.local.path`. The default path is `./data` (relative to the +working directory), which is good to try something out quickly but most likely +not what you want for actual operations. The flag `storage.local.retention` +allows you to configure the retention time for samples. Adjust it to your needs +and your available disk space. + +## Chunk encoding + +Prometheus currently offers three different types of chunk encodings. The chunk +encoding for newly created chunks is determined by the +`-storage.local.chunk-encoding-version` flag. The valid values are 0, 1, +or 2. + +Type 0 is the simple delta encoding implemented for Prometheus's first chunked +storage layer. Type 1 is the current default encoding, a double-delta encoding +with much better compression behavior than type 0. Both encodings feature a +fixed byte width per sample over the whole chunk, which allows fast random +access. While type 0 is the fastest encoding, the difference in encoding cost +compared to encoding 1 is tiny. Due to the better compression behavior of type +1, there is really no reason to select type 0 except compatibility with very +old Prometheus versions. + +Type 2 is a variable bit-width encoding, i.e. each sample in the chunk can use +a different number of bits. Timestamps are double-delta encoded, too, but with +a slightly different algorithm. A number of different encoding schemes are +available for sample values. The choice is made per chunk based on the nature +of the sample values (constant, integer, regularly increasing, random…). Major +parts of the type 2 encoding are inspired by a paper published by Facebook +engineers: +[_Gorilla: A Fast, Scalable, In-Memory Time Series Database_](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf). + +With type 2, access within a chunk has to happen sequentially, and the encoding +and decoding cost is a bit higher. Overall, type 2 will cause more CPU usage +and increased query latency compared to type 1 but offers a much improved +compression ratio. The exact numbers depend heavily on the data set and the +kind of queries. Below are results from a typical production server with a +fairly expensive set of recording rules. + +Chunk type | bytes per sample | cores | rule evaluation duration +:------:|:-----:|:----:|:----: +1 | 3.3 | 1.6 | 2.9s +2 | 1.3 | 2.4 | 4.9s + +You can change the chunk encoding each time you start the server, so +experimenting with your own use case is encouraged. Take into account, however, +that only newly created chunks will use the newly selected chunk encoding, so +it will take a while until you see the effects. + +For more details about the trade-off between the chunk encodings, see +[this blog post](https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chunks/). + +## Settings for high numbers of time series + +Prometheus can handle millions of time series. However, with the above +mentioned default setting for `storage.local.target-heap-size`, you will be +limited to about 200,000 time series simultaneously present in memory. For more +series, you need more memory, and you need to configure Prometheus to make use +of it as described above. + +Each of the aforementioned chunks contains samples of a single time series. A +time series is thus represented as a series of chunks, which ultimately end up +in a time series file (one file per time series) on disk. + +A series that has recently received new samples will have an open incomplete +_head chunk_. Once that chunk is completely filled, or the series hasn't +received samples in a while, the head chunk is closed and becomes a chunk +waiting to be appended to its corresponding series file, i.e. it is _waiting +for persistence_. After the chunk has been persisted to disk, it becomes +_evictable_, provided it is not currently used by a query. Prometheus will +evict evictable chunks from memory to satisfy the configured target heap +size. A series with an open head chunk is called an _active series_. This is +different from a _memory series_, which also includes series without an open +head chunk but still other chunks in memory (whether waiting for persistence, +used in a query, or evictable). A series without any chunks in memory may be +_archived_, upon which it ceases to have any mandatory memory footprint. + +The amount of chunks Prometheus can keep in memory depends on the flag value +for `storage.local.target-heap-size` and on the amount of memory used by +everything else. If there are not enough chunks evictable to satisfy the target +heap size, Prometheus will throttle ingestion of more samples (by skipping +scrapes and rule evaluations) until the heap has shrunk enough. _Throttled +ingestion is really bad for various reasons. You really do not want to be in +that situation._ + +Open head chunks, chunks still waiting for persistence, and chunks being used +in a query are not evictable. Thus, the reasons for the inability to evict +enough chunks include the following: + +1. Queries that use too many chunks. +2. Chunks are piling up waiting for persistence because the storage layer + cannot keep up writing chunks. +3. There are too many active time series, which results in too many open head + chunks. + +Currently, Prometheus has no defence against case (1). Abusive queries will +essentially OOM the server. + +To defend against case (2), there is a concept of persistence urgency explained +in the next section. + +Case (3) depends on the targets you monitor. To mitigate an unplanned explosion +of the number of series, you can limit the number of samples per individual +scrape (see `sample_limit` in the [scrape config](configuration.md#scrape_config)). +If the number of active time series exceeds the number of memory chunks the +Prometheus server can afford, the server will quickly throttle ingestion as +described above. The only way out of this is to give Prometheus more RAM or +reduce the number of time series to ingest. + +In fact, you want many more memory chunks than you have series in +memory. Prometheus tries to batch up disk writes as much as possible as it +helps for both HDD (write as much as possible after each seek) and SSD (tiny +writes create write amplification, which limits the effective throughput and +burns much more quickly through the lifetime of the device). The more +Prometheus can batch up writes, the more efficient is the process of persisting +chunks to disk. which helps case (2). + +In conclusion, to keep the Prometheus server healthy, make sure it has plenty +of headroom of memory chunks available for the number of memory series. A +factor of three is a good starting point. Refer to the +[section about helpful metrics](#helpful-metrics) to find out what to look +for. A very broad rule of thumb for an upper limit of memory series is the +total available physical memory divided by 10,000, e.g. About 6M memory series +on a 64GiB server. + +If you combine a high number of time series with very fast and/or large +scrapes, the number of pre-allocated mutexes for series locking might not be +sufficient. If you see scrape hiccups while Prometheus is writing a checkpoint +or processing expensive queries, try increasing the value of the +`storage.local.num-fingerprint-mutexes` flag. Sometimes tens of thousands or +even more are required. + +PromQL queries that involve a high number of time series will make heavy use of +the LevelDB-backed indexes. If you need to run queries of that kind, tweaking +the index cache sizes might be required. The following flags are relevant: + +* `-storage.local.index-cache-size.label-name-to-label-values`: For regular + expression matching. +* `-storage.local.index-cache-size.label-pair-to-fingerprints`: Increase the + size if a large number of time series share the same label pair or name. +* `-storage.local.index-cache-size.fingerprint-to-metric` and + `-storage.local.index-cache-size.fingerprint-to-timerange`: Increase the size + if you have a large number of archived time series, i.e. series that have not + received samples in a while but are still not old enough to be purged + completely. + +You have to experiment with the flag values to find out what helps. If a query +touches 100,000+ time series, hundreds of MiB might be reasonable. If you have +plenty of memory available, using more of it for LevelDB cannot harm. More +memory for LevelDB will effectively reduce the number of memory chunks +Prometheus can afford. + +## Persistence urgency and “rushed mode” + +Naively, Prometheus would all the time try to persist completed chunk to disk +as soon as possible. Such a strategy would lead to many tiny write operations, +using up most of the I/O bandwidth and keeping the server quite busy. Spinning +disks will appear to be very slow because of the many slow seeks required, and +SSDs will suffer from write amplification. Prometheus tries instead to batch up +write operations as much as possible, which works better if it is allowed to +use more memory. + +Prometheus will also sync series files after each write (with +`storage.local.series-sync-strategy=adaptive`, which is the default) and use +the disk bandwidth for more frequent checkpoints (based on the count of “dirty +series”, see [below](#crash-recovery)), both attempting to minimize data loss +in case of a crash. + +But what to do if the number of chunks waiting for persistence grows too much? +Prometheus calculates a score for urgency to persist chunks. The score is +between 0 and 1, where 1 corresponds to the highest urgency. Depending on the +score, Prometheus will write to disk more frequently. Should the score ever +pass the threshold of 0.8, Prometheus enters “rushed mode” (which you can see +in the logs). In rushed mode, the following strategies are applied to speed up +persisting chunks: + +* Series files are not synced after write operations anymore (making better use + of the OS's page cache at the price of an increased risk of losing data in + case of a server crash – this behavior can be overridden with the flag + `storage.local.series-sync-strategy`). +* Checkpoints are only created as often as configured via the + `storage.local.checkpoint-interval` flag (freeing more disk bandwidth for + persisting chunks at the price of more data loss in case of a crash and an + increased time to run the subsequent crash recovery). +* Write operations to persist chunks are not throttled anymore and performed as + fast as possible. + +Prometheus leaves rushed mode once the score has dropped below 0.7. + +Throttling of ingestion happens if the urgency score reaches 1. Thus, the +rushed mode is not _per se_ something to be avoided. It is, on the contrary, a +measure the Prometheus server takes to avoid the really bad situation of +throttled ingestion. Occasionally entering rushed mode is OK, if it helps and +ultimately leads to leaving rushed mode again. _If rushed mode is entered but +the urgency score still goes up, the server has a real problem._ + +## Settings for very long retention time + +If you have set a very long retention time via the `storage.local.retention` +flag (more than a month), you might want to increase the flag value +`storage.local.series-file-shrink-ratio`. + +Whenever Prometheus needs to cut off some chunks from the beginning of a series +file, it will simply rewrite the whole file. (Some file systems support “head +truncation”, which Prometheus currently does not use for several reasons.) To +not rewrite a very large series file to get rid of very few chunks, the rewrite +only happens if at least 10% of the chunks in the series file are removed. This +value can be changed via the mentioned `storage.local.series-file-shrink-ratio` +flag. If you have a lot of disk space but want to minimize rewrites (at the +cost of wasted disk space), increase the flag value to higher values, e.g. 0.3 +for 30% of required chunk removal. + +## Crash recovery + +Prometheus saves chunks to disk as soon as possible after they are +complete. Incomplete chunks are saved to disk during regular +checkpoints. You can configure the checkpoint interval with the flag +`storage.local.checkpoint-interval`. Prometheus creates checkpoints +more frequently than that if too many time series are in a “dirty” +state, i.e. their current incomplete head chunk is not the one that is +contained in the most recent checkpoint. This limit is configurable +via the `storage.local.checkpoint-dirty-series-limit` flag. + +More active time series to cycle through lead in general to more chunks waiting +for persistence, which in turns leads to larger checkpoints and ultimately more +time needed for checkpointing. There is a clear trade-off between limiting the +loss of data in case of a crash and the ability to scale to high number of +active time series. To not spend the majority of the disk throughput for +checkpointing, you have to increase the checkpoint interval. Prometheus itself +limits the time spent in checkpointing to 50% by waiting after each +checkpoint's completion for at least as long as the previous checkpoint took. + +Nevertheless, should your server crash, you might still lose data, and +your storage might be left in an inconsistent state. Therefore, +Prometheus performs a crash recovery after an unclean shutdown, +similar to an `fsck` run for a file system. Details about the crash +recovery are logged, so you can use it for forensics if required. Data +that cannot be recovered is moved to a directory called `orphaned` +(located under `storage.local.path`). Remember to delete that data if +you do not need it anymore. + +The crash recovery usually takes less than a minute. Should it take much +longer, consult the log to find out what is going on. With increasing number of +time series in the storage (archived or not), the re-indexing tends to dominate +the recovery time and can take tens of minutes in extreme cases. + +## Data corruption + +If you suspect problems caused by corruption in the database, you can +enforce a crash recovery by starting the server with the flag +`storage.local.dirty`. + +If that does not help, or if you simply want to erase the existing +database, you can easily start fresh by deleting the contents of the +storage directory: + + 1. Stop Prometheus. + 1. `rm -r /*` + 1. Start Prometheus. + +## Helpful metrics + +Out of the metrics that Prometheus exposes about itself, the following are +particularly useful to tweak flags and find out about the required +resources. They also help to create alerts to find out in time if a Prometheus +server has problems or is out of capacity. + +* `prometheus_local_storage_memory_series`: The current number of series held + in memory. +* `prometheus_local_storage_open_head_chunks`: The number of open head chunks. +* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks + that still need to be persisted to disk. +* `prometheus_local_storage_memory_chunks`: The current number of chunks held + in memory. If you substract the previous two, you get the number of persisted + chunks (which are evictable if not currently in use by a query). +* `prometheus_local_storage_series_chunks_persisted`: A histogram of the number + of chunks persisted per batch. +* `prometheus_local_storage_persistence_urgency_score`: The urgency score as + discussed [above](#persistence-urgency-and-rushed-mode). +* `prometheus_local_storage_rushed_mode` is 1 if Prometheus is in “rushed + mode”, 0 otherwise. Can be used to calculate the percentage of time + Prometheus is in rushed mode. +* `prometheus_local_storage_checkpoint_last_duration_seconds`: How long the + last checkpoint took. +* `prometheus_local_storage_checkpoint_last_size_bytes`: Size of the last + checkpoint in bytes. +* `prometheus_local_storage_checkpointing` is 1 while Prometheus is + checkpointing, 0 otherwise. Can be used to calculate the percentage of time + Prometheus is checkpointing. +* `prometheus_local_storage_inconsistencies_total`: Counter for storage + inconsistencies found. If this is greater than 0, restart the server for + recovery. +* `prometheus_local_storage_persist_errors_total`: Counter for persist errors. +* `prometheus_local_storage_memory_dirty_series`: Current number of dirty series. +* `process_resident_memory_bytes`: Broadly speaking the physical memory + occupied by the Prometheus process. +* `go_memstats_alloc_bytes`: Go heap size (allocated objects in use plus allocated + objects not in use anymore but not yet garbage-collected).