mirror of
				https://github.com/prometheus/prometheus.git
				synced 2025-10-31 16:31:03 +01:00 
			
		
		
		
	* Write exemplars to the WAL and send them over remote write. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Update example for exemplars, print data in a more obvious format. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add metrics for remote write of exemplars. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Fix incorrect slices passed to send in remote write. Signed-off-by: Callum Styan <callumstyan@gmail.com> * We need to unregister the new metrics. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address review comments Signed-off-by: Callum Styan <callumstyan@gmail.com> * Order of exemplar append vs write exemplar to WAL needs to change. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Condense sample/exemplar delivery tests to parameterized sub-tests Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Rename test methods for clarity now that they also handle exemplars Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Rename counter variable. Fix instances where metrics were not updated correctly Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Add exemplars to LoadWAL benchmark Signed-off-by: Callum Styan <callumstyan@gmail.com> * last exemplars timestamp metric needs to convert value to seconds with ms precision Signed-off-by: Callum Styan <callumstyan@gmail.com> * Process exemplar records in a separate go routine when loading the WAL. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address review comments related to clarifying comments and variable names. Also refactor sample/exemplar to enqueue prompb types. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Regenerate types proto with comments, update protoc version again. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Put remote write of exemplars behind a feature flag. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address some of Ganesh's review comments. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Move exemplar remote write feature flag to a config file field. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address Bartek's review comments. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Don't allocate exemplar buffers in queue_manager if we're not going to send exemplars over remote write. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add ValidateExemplar function, validate exemplars when appending to head and log them all to WAL before adding them to exemplar storage. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address more reivew comments from Ganesh. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add exemplar total label length check. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address a few last review comments Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Martin Disibio <mdisibio@gmail.com>
		
			
				
	
	
		
			121 lines
		
	
	
		
			9.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			121 lines
		
	
	
		
			9.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # WAL Disk Format
 | |
| 
 | |
| The write ahead log operates in segments that are numbered and sequential,
 | |
| e.g. `000000`, `000001`, `000002`, etc., and are limited to 128MB by default.
 | |
| A segment is written to in pages of 32KB. Only the last page of the most recent segment
 | |
| may be partial. A WAL record is an opaque byte slice that gets split up into sub-records
 | |
| should it exceed the remaining space of the current page. Records are never split across
 | |
| segment boundaries. If a single record exceeds the default segment size, a segment with
 | |
| a larger size will be created.
 | |
| The encoding of pages is largely borrowed from [LevelDB's/RocksDB's write ahead log.](https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format)
 | |
| 
 | |
| Notable deviations are that the record fragment is encoded as:
 | |
| 
 | |
| ```
 | |
| ┌───────────┬──────────┬────────────┬──────────────┐
 | |
| │ type <1b> │ len <2b> │ CRC32 <4b> │ data <bytes> │
 | |
| └───────────┴──────────┴────────────┴──────────────┘
 | |
| ```
 | |
| 
 | |
| The type flag has the following states:
 | |
| 
 | |
| * `0`: rest of page will be empty
 | |
| * `1`: a full record encoded in a single fragment
 | |
| * `2`: first fragment of a record
 | |
| * `3`: middle fragment of a record
 | |
| * `4`: final fragment of a record
 | |
| 
 | |
| ## Record encoding
 | |
| 
 | |
| The records written to the write ahead log are encoded as follows:
 | |
| 
 | |
| ### Series records
 | |
| 
 | |
| Series records encode the labels that identifies a series and its unique ID.
 | |
| 
 | |
| ```
 | |
| ┌────────────────────────────────────────────┐
 | |
| │ type = 1 <1b>                              │
 | |
| ├────────────────────────────────────────────┤
 | |
| │ ┌─────────┬──────────────────────────────┐ │
 | |
| │ │ id <8b> │ n = len(labels) <uvarint>    │ │
 | |
| │ ├─────────┴────────────┬─────────────────┤ │
 | |
| │ │ len(str_1) <uvarint> │ str_1 <bytes>   │ │
 | |
| │ ├──────────────────────┴─────────────────┤ │
 | |
| │ │  ...                                   │ │
 | |
| │ ├───────────────────────┬────────────────┤ │
 | |
| │ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │
 | |
| │ └───────────────────────┴────────────────┘ │
 | |
| │                  . . .                     │
 | |
| └────────────────────────────────────────────┘
 | |
| ```
 | |
| 
 | |
| ### Sample records
 | |
| 
 | |
| Sample records encode samples as a list of triples `(series_id, timestamp, value)`.
 | |
| Series reference and timestamp are encoded as deltas w.r.t the first sample.
 | |
| The first row stores the starting id and the starting timestamp.
 | |
| The first sample record begins at the second row.
 | |
| 
 | |
| ```
 | |
| ┌──────────────────────────────────────────────────────────────────┐
 | |
| │ type = 2 <1b>                                                    │
 | |
| ├──────────────────────────────────────────────────────────────────┤
 | |
| │ ┌────────────────────┬───────────────────────────┐               │
 | |
| │ │ id <8b>            │ timestamp <8b>            │               │
 | |
| │ └────────────────────┴───────────────────────────┘               │
 | |
| │ ┌────────────────────┬───────────────────────────┬─────────────┐ │
 | |
| │ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b>  │ │
 | |
| │ └────────────────────┴───────────────────────────┴─────────────┘ │
 | |
| │                              . . .                               │
 | |
| └──────────────────────────────────────────────────────────────────┘
 | |
| ```
 | |
| 
 | |
| ### Tombstone records
 | |
| 
 | |
| Tombstone records encode tombstones as a list of triples `(series_id, min_time, max_time)`
 | |
| and specify an interval for which samples of a series got deleted.
 | |
| 
 | |
| ```
 | |
| ┌─────────────────────────────────────────────────────┐
 | |
| │ type = 3 <1b>                                       │
 | |
| ├─────────────────────────────────────────────────────┤
 | |
| │ ┌─────────┬───────────────────┬───────────────────┐ │
 | |
| │ │ id <8b> │ min_time <varint> │ max_time <varint> │ │
 | |
| │ └─────────┴───────────────────┴───────────────────┘ │
 | |
| │                        . . .                        │
 | |
| └─────────────────────────────────────────────────────┘
 | |
| ```
 | |
| 
 | |
| ### Exemplar records
 | |
| 
 | |
| Exemplar records encode exemplars as a list of triples `(series_id, timestamp, value)` 
 | |
| plus the length of the labels list, and all the labels.
 | |
| The first row stores the starting id and the starting timestamp.
 | |
| Series reference and timestamp are encoded as deltas w.r.t the first exemplar.
 | |
| The first exemplar record begins at the second row.
 | |
| 
 | |
| See: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exemplars
 | |
| 
 | |
| ```
 | |
| ┌──────────────────────────────────────────────────────────────────┐
 | |
| │ type = 5 <1b>                                                    │
 | |
| ├──────────────────────────────────────────────────────────────────┤
 | |
| │ ┌────────────────────┬───────────────────────────┐               │
 | |
| │ │ id <8b>            │ timestamp <8b>            │               │
 | |
| │ └────────────────────┴───────────────────────────┘               │
 | |
| │ ┌────────────────────┬───────────────────────────┬─────────────┐ │
 | |
| │ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b>  │ │
 | |
| │ ├────────────────────┴───────────────────────────┴─────────────┤ │
 | |
| │ │  n = len(labels) <uvarint>                                   │ │
 | |
| │ ├──────────────────────┬───────────────────────────────────────┤ │
 | |
| │ │ len(str_1) <uvarint> │ str_1 <bytes>                         │ │
 | |
| │ ├──────────────────────┴───────────────────────────────────────┤ │
 | |
| │ │  ...                                                         │ │
 | |
| │ ├───────────────────────┬──────────────────────────────────────┤ │
 | |
| │ │ len(str_2n) <uvarint> │ str_2n <bytes> │                     │ │
 | |
| │ └───────────────────────┴────────────────┴─────────────────────┘ │
 | |
| │                              . . .                               │
 | |
| └──────────────────────────────────────────────────────────────────┘
 | |
| ```
 |