811 Commits

Author SHA1 Message Date
beorn7
2fad91d25a Avoid blocking in the logThrottling loop
The timer semantics is really hard. The simple pattern as given in the
godoc for the time package assumes we are not elsewhere consuming from
the timer's channel. However, exactly that can happen here with the
right sequence of events. Thus, we have to drain the channel only if
it has something to drain.
2017-10-06 14:19:30 +02:00
Alexander Morozov
7e814ada72 [storage/local] fix timer.Reset usage
According to https://golang.org/pkg/time/#Timer.Reset you must not rely
on a returned value from Reset.

Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>
2017-10-05 15:48:03 -07:00
Bryan Boreham
a03193232a Don't disable HTTP keep-alives for remote storage connections. (#3173)
Removes configurability introduced in #3160 in favour of hard-coding,
per advice from @brian-brazil.
2017-10-05 12:32:24 +01:00
Tom Wilkie
758d64ffd9 s/EncodReadResponse/EncodeReadResponse/ 2017-09-16 11:15:03 +02:00
Tom Wilkie
febed48703 Implement remote read server in Prometheus. 2017-09-16 11:13:01 +02:00
Julius Volz
8ebeed0b44 remote: Expose ClientConfig type (#3165)
The Client type is already exposed, but can't be used without the config for it
also being exposed. Using the remote.Client from other programs is useful to do
full end-to-end tests of Prometheus's remote protocol against adapter
implementations.
2017-09-14 15:25:09 +02:00
Tom Wilkie
f66f882d08 Merge pull request #3160 from bboreham/remote-keepalive
Re-enable http keepalive on remote storage
2017-09-14 08:23:43 +01:00
Ben Kochie
1ab0bbb2c2 Merge pull request #3125 from prometheus/bjk/staticcheck
Enable statitcheck at build time.
2017-09-13 14:42:29 -07:00
Bryan Boreham
9d6b945e41 Default HTTP keep-alive ON for remote read/write 2017-09-11 09:48:30 +00:00
wangguoliang
7e6c6020ff should use time.Since instead of time.Now().Sub
Signed-off-by: wgliang <liangcszzu@163.com>
2017-09-07 18:00:45 +08:00
Ben Kochie
59aca4138b Fix staticcheck issues. 2017-08-28 17:29:01 +02:00
Tom Wilkie
e1c77cdfd4 Merge pull request #2991 from tomwilkie/2990-remote-config
Make queue manager configurable.
2017-08-03 10:26:29 +01:00
Tom Wilkie
5169f990f9 Review feedback: add yaml struct tags, don't embed queue config.
Also, rename QueueManageConfig to QueueConfig, for consistency with tags.
2017-08-01 14:43:56 +01:00
Fabian Reinartz
bc2e9459d8 Merge pull request #2973 from tomwilkie/2969-negative-shards
Prevent number of remote write shards from going negative.
2017-07-28 13:02:33 +02:00
Tom Wilkie
454b661145 Make queue manager configurable. 2017-07-25 13:47:34 +01:00
Conor Broderick
4b868113bb Metric name validation (#2975) 2017-07-24 13:49:20 +01:00
beorn7
3bb0667607 Merge branch 'release-1.7' 2017-07-21 19:40:30 +02:00
beorn7
ea5e7eafde Fix #2965
We would overscan when hitting a value directly, interspersed with
samples in between timestamps. Apparently, that happens rarely enough
that it was only noticed recently.
2017-07-21 16:35:15 +02:00
beorn7
c06292af2f Add test to expose #2965 2017-07-21 16:25:24 +02:00
Tom Wilkie
1d94eb8d95 Prevent number of remote write shards from going negative.
This can happen in the situation where the system scales up the number of shards massively (to deal with some backlog), then scales it down again as the number of samples sent during the time period is less than the number received.
2017-07-19 16:27:19 +01:00
Alexey Palazhchenko
76c9a0d931 Fix typo (#2963) 2017-07-18 16:11:10 +01:00
Matt Bostock
13c6e4a4bc Remote queue manager: Fix typo
Change 'send' to 'sent'.
2017-07-04 20:48:52 +01:00
Tom Wilkie
24a113bb09 Review feedback: limit number of bytes read under error. 2017-06-01 11:21:48 +01:00
Tom Wilkie
46abe8cbf2 Remote write: read first line of response and include it in the error. 2017-05-31 13:46:08 +01:00
Alexey Palazhchenko
b0e1ea7c6c Simplify code, fix typos. (#2719) 2017-05-15 09:56:09 +01:00
Julius Volz
1c72524870 Fix HTTP error handling in remote.Client.Store() (#2708)
Regression introduced in
e5d7bbfc3c
2017-05-11 18:40:10 +02:00
Tom Wilkie
3141a6b36b Compress remote storage requests and responses with unframed/raw snappy. (#2696)
* Compress remote storage requests and responses with unframed/raw snappy, for compatibility with other languages.

* Remove backwards compatibility code from remote_storage_adapter, update example_write_adapter

* Add /documentation/examples/remote_storage/example_write_adapter/example_writer_adapter to .gitignore
2017-05-10 16:42:59 +02:00
beorn7
46226088aa Merge branch 'release-1.6' 2017-05-09 11:16:07 +02:00
beorn7
69eddc9e84 storage: Correctly increase prometheus_local_storage_open_head_chunks 2017-05-08 18:20:23 +02:00
Tom Wilkie
2195bb66f7 Ensure ewma int64s are always aligned. (#2675) 2017-05-03 14:32:50 -05:00
Tom Wilkie
4d9b917d11 Instrument Prometheus with OpenTracing (#2554)
* Use request.Context() instead of a global map of contexts.

* Add some basic opentracing instrumentation on the query path.

* Remove tracehandler endpoint.
2017-05-02 18:49:29 -05:00
beorn7
1dd737d7c3 storage: Don't panic if storage has no FPs even after initial wait 2017-04-18 15:59:12 +02:00
beorn7
c53f256a09 storage: Fix use of counter (Set -> Add) 2017-04-11 12:58:24 +02:00
beorn7
f338d791d2 storage: Several optimizations of checkpointing
- checkpointSeriesMapAndHeads accepts a context now to allow
  cancelling.

- If a shutdown is initiated, cancel the ongoing checkpoint. (We will
  create a final checkpoint anyway.)

- Always wait for at least as long as the last checkpoint took before
  starting the next checkpoint (to cap the time spending checkpointing
  at 50%).

- If an error has occurred during checkpointing, don't bother to sync
  the write.

- Make sure the temporary checkpoint file is deleted, even if an error
  has occurred.

- Clean up the checkpoint loop a bit. (The concurrent Timer.Reset(0)
  call might have cause a race.)
2017-04-07 13:10:12 +02:00
Björn Rabenstein
934d86b936 Merge pull request #2593 from prometheus/beorn7/storage2
storage: Recover from corrupted indices for archived series
2017-04-07 12:55:35 +02:00
Björn Rabenstein
38bcba11fe Merge pull request #2594 from prometheus/beorn7/storage3
storage: Guard against a corner case of data corruption
2017-04-07 00:52:28 +02:00
Björn Rabenstein
f0076aca01 Merge pull request #2595 from prometheus/beorn7/storage4
storage: Guard against appending to evicted chunk
2017-04-07 00:51:53 +02:00
Tom Wilkie
e5d7bbfc3c Remote writes: retry on recoverable errors. (#2552)
* Remote writes: retry on recoverable errors.

* Add comments

* Review feedback

* Comments

* Review feedback

* Final spelling misteak (I hope).  Plus, record failed samples correctly.
2017-04-07 00:15:41 +02:00
beorn7
7199a9d9d4 storage: Guard against appending to evicted chunk
Fixes #2480. For certain definition of "fixes".

This is something that should never happen. Sadly, it does happen,
albeit extremely rarely. This could be some weird cornercase we
haven't covered yet. Or it happens as a consequesnce of data
corruption or a crash recovery gone bad.

This is not a "real" fix as we don't know the root cause of the
incident reported in #2480. However, this makes sure the server does
not crash, but deals gracefully with the problem: The series in
question is quarantined, which even makes it available for forensics.
2017-04-06 20:02:52 +02:00
beorn7
3d12906286 storage: Guard against a corner case of data corruption
Fixes #2475.
2017-04-06 19:50:32 +02:00
beorn7
4fcc73a04c storage: Recover from corrupted indices for archived series
An unopenable archived_fingerprint_to_timerange is simply deleted and
will be rebuilt during crash recovery (wich can then take quite some time).

An unopenable archived_fingerprint_to_metric is not deleted but
instructions to the user are logged. A deletion has to be done by the
user explicitly as it means losing all archived series (and a repair
with a 3rd party tool might still be possible).
2017-04-06 19:26:39 +02:00
Julius Volz
9775ad4754 Merge pull request #2588 from prometheus/read-multi
Separate out remote read responses.
2017-04-06 17:10:31 +02:00
Brian Brazil
c813c824d4 Separate out remote read responses.
Fixes #2574
2017-04-06 15:49:47 +01:00
Björn Rabenstein
516a96d9a3 Merge pull request #2587 from prometheus/beorn7/storage2
storage: Mark storage as dirty if indexing fails
2017-04-06 16:42:06 +02:00
beorn7
ed5f68f382 storage: Increment s.persistErrors on all persist errors
Fixes #2091
2017-04-06 15:55:15 +02:00
beorn7
f3365c4f26 storage: Mark storage as dirty if indexing fails 2017-04-06 15:29:33 +02:00
Alexey Palazhchenko
17f15d024a Small fixes. (#2578)
Fix typos. Simplify with gofmt -s
2017-04-05 14:24:22 +01:00
Björn Rabenstein
425f591fc9 Merge pull request #2576 from prometheus/beorn7/storage
storage: Check for negative values from varint decoding
2017-04-04 23:23:51 +02:00
beorn7
ae286385fd storage: Check for negative values from varint decoding
Sadly, we have a number of places where we use varint encoding for
numbers that cannot be negative. We could have saved a bit by using
uvarint encoding. On the bright side, we now have a 50% chance to
detect data corruption. :-/

Fixes #1800 and #2492.
2017-04-04 19:14:52 +02:00
beorn7
9b6a1dad05 storage: Fix go vet error 2017-04-04 19:14:09 +02:00