5022 Commits

Author SHA1 Message Date
Yecheng Fu
8ceb8f2ae8 Refactor Kubernetes Discovery Part 2: Refactoring
- Do initial listing and syncing to scrape manager, then register event
  handlers may lost events happening in listing and syncing (if it
  lasted a long time). We should register event handlers at the very
  begining, before processing just wait until informers synced (sync in
  informer will list all objects and call OnUpdate event handler).
- Use a queue then we don't block event callbacks and an object will be
  processed only once if added multiple times before it being processed.
- Fix bug in `serviceUpdate` in endpoints.go, we should build endpoints
  when `exists && err == nil`. Add `^TestEndpointsDiscoveryWithService`
  tests to test this feature.

Testing:

- Use `k8s.io/client-go` testing framework and fake implementations which are
  more robust and reliable for testing.
- `Test\w+DiscoveryBeforeRun` are used to test objects created before
  discoverer runs
- `Test\w+DiscoveryAdd\w+` are used to test adding objects
- `Test\w+DiscoveryDelete\w+` are used to test deleting objects
- `Test\w+DiscoveryUpdate\w+` are used to test updating objects
- `TestEndpointsDiscoveryWithService\w+` are used to test endpoints
  events triggered by services
- `cache.DeletedFinalStateUnknown` related stuffs are removed, because
  we don't care deleted objects in store, we only need its name to send
  a specical `targetgroup.Group` to scrape manager

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:34 +02:00
Yecheng Fu
9bc6ced55d Refactor Kubernetes Discovery Part 1: Add Vendor files.
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:14 +02:00
Adam Shannon
809881d7f5 support reading basic_auth password_file for HTTP basic auth (#4077)
Issue: https://github.com/prometheus/prometheus/issues/4076

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-04-25 18:19:06 +01:00
Björn Rabenstein
7cc46bafcb
Merge pull request #4113 from prometheus/beorn7/juggling
Fix the merge into release-2.2
2018-04-25 17:09:10 +02:00
Ben Kochie
219433aae5
Update CircleCI build
Use CircleCI 2.0 build config.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-04-25 16:38:05 +02:00
Björn Rabenstein
91e470d733
Merge pull request #4096 from simonpasquier/fix-scrape-races-2.2
Fix scrape races (release-2.2 branch)
2018-04-25 15:36:29 +02:00
Ben Kochie
4a4e8a7d3b Fix spelling in Makefile.common. (#4105)
Signed-off-by: Ben Kochie <superq@gmail.com>
2018-04-20 19:35:42 +03:00
Ben Kochie
76f6fe8f86
Merge pull request #4102 from krasi-georgiev/makefile
run the style target to fail if the code is not properly formatted
2018-04-20 17:30:42 +02:00
Krasi Georgiev
98c51d241b run the style target to fail if the code is not properly formated
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-04-19 15:18:35 +03:00
Krasi Georgiev
0b0c9f4b6b
unused target didn't trigger an error for unused packages (#4101)
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-04-19 15:07:55 +03:00
Krasi Georgiev
416db814e8
use package shorthand selection that excludes vendored. (#4100)
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-04-19 13:38:01 +03:00
Krasi Georgiev
3f2b2c50dd
use the Makefile.common (#3978)
split common targets in a  Makefile.common to reuse it across projects

Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-04-19 12:07:10 +03:00
Simon Pasquier
2cbba4e948 scrape: fix data races
This commit avoids passing the full scrape configuration down to the
scrape loop to fix data races when the scrape configuration is being
reloaded.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-04-18 11:17:31 +02:00
Simon Pasquier
8b89ab0173 scrape: add test detecting data races
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-04-18 11:17:25 +02:00
Rohit Gupta
30c3e02864 Fixes #4090. Marathon service discovery for 5XX http response (#4091)
Signed-off-by: rohit01 <hello@rohit.io>
2018-04-17 09:28:06 +01:00
Krasi Georgiev
d13db89548
Merge pull request #4073 from krasi-georgiev/remove-unused-vendored
remove unused vendored packages
2018-04-17 10:24:01 +03:00
David King
6286c10df0 Fix OOM when a large K is used in topk queries (#4087)
This attempts to close #3973.

Handles cases where the length of the input vector to an aggregate topk
/ bottomk function is less than the K paramater. The change updates
Prometheus to allocate a result vector the same length as the input
vector in these cases.

Previously Prometheus would out-of-memory panic for large K values. This
change makes that unlikely unless the size of the input vector is
equally large.

Signed-off-by: David King <dave@davbo.org>
2018-04-16 09:03:04 +01:00
Björn Rabenstein
e7584ee345
Merge pull request #4072 from prometheus/beorn7/forward-merge
Merge 2.2 bugfixes into master
2018-04-11 13:17:55 +02:00
Krasi Georgiev
7951f6a0f6
Merge pull request #4075 from prometheus/issue-use-case
request a use case for proposals
2018-04-11 13:55:06 +03:00
Krasi Georgiev
1467d01147 request a use case for proposals
Signed-off-by: Krasi Georgiev <krasi-georgiev@users.noreply.github.com>
2018-04-11 13:47:48 +03:00
Krasi Georgiev
7679bc169d remove unused vendored packages
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-04-10 21:22:19 +03:00
beorn7
94ff07b81d Merge branch 'release-2.2'
Signed-off-by: beorn7 <beorn@soundcloud.com>
2018-04-10 16:50:35 +02:00
Björn Rabenstein
f8dcf9b272
Merge pull request #4066 from krasi-georgiev/race-DiscoveredLabels
add mutex for DiscoveredLabels
2018-04-10 15:36:56 +02:00
Krasi Georgiev
dc29dd1c6f add mutex for DiscoveredLabels
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-04-10 00:18:58 +03:00
Björn Rabenstein
e65fc8591a
Merge pull request #4064 from prometheus/beorn7/vendoring
Update vendoring of prometheus/common/route to include data race fix
2018-04-09 17:50:06 +02:00
beorn7
bd44e7fe98 Update vendoring of prometheus/common/route to include data race fix
See https://github.com/prometheus/common/pull/125

Signed-off-by: beorn7 <beorn@soundcloud.com>
2018-04-09 17:48:32 +02:00
Krasi Georgiev
ddd46de6f4 Races/3994 (#4005)
Fix race by properly locking access to scrape pools. Use separate mutex for information needed by UI so that UI isn't blocked when targets are being updated.
2018-04-09 15:18:25 +01:00
Mario Trangoni
464e747f1e fix some comments typos (#4059) 2018-04-08 10:51:54 +01:00
Sneha Inguva
cbfb207cca vendor: correctly update golang client (#4056) 2018-04-06 18:05:32 +01:00
Tony Lee
7cd56f56df add queue_time slice to query_duration_seconds (#4050) 2018-04-05 19:56:58 +01:00
Julius Volz
fe10b36b30 Fix curl example for deleting series (#4046) 2018-04-05 13:06:18 +01:00
sev3ryn
cc917aee7f fix of endless loop while doing Consul service discovery. (#4044)
Reloading Prometheus configs doesn't make loop end.
It produced a goroutine leak
2018-04-05 10:41:09 +01:00
Philippe Laflamme
2aba238f31 Use common HTTPClientConfig for marathon_sd configuration (#4009)
This adds support for basic authentication which closes #3090

The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.

DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.

Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
2018-04-05 09:08:18 +01:00
Manos Fokas
25f929b772 Yaml UnmarshalStrict implementation. (#4033)
* Updated yaml vendor package.

* remove checkOverflow duplicate in rulefmt

* remove duplicated HTTPClientConfig.Validate()

* Added yaml static check.
2018-04-04 09:07:39 +01:00
Krasi Georgiev
406233e937
Merge pull request #4034 from si74/main_comments
main: actor functionality comments
2018-04-03 12:52:15 +03:00
Sneha Inguva
7be846754a main: actor functionality comments 2018-04-01 11:19:30 -07:00
albatross0
0245fd55bf Add a machine type label to GCE SD (#4032) 2018-03-31 09:20:19 +01:00
Kristiyan Nikolov
be85ba3842 discovery/ec2: Support filtering instances in discovery (#4011) 2018-03-31 07:51:11 +01:00
Bryan Boreham
93494d8b7e Add an OpenTracing span for each rule (#4027)
* Add an OpenTracing span for each rule

So that tags and child spans can be traced back to the rule that they
refer to.
2018-03-30 21:29:19 +01:00
Björn Rabenstein
6cf725c56d
Merge pull request #4031 from codesome/fix-bug-from-4025
Fix bug from 4025
2018-03-30 16:41:30 +02:00
Ganesh Vernekar
b44ce11d1b Added test to check pathPrefix 2018-03-30 11:55:54 +05:30
Ganesh Vernekar
cd2820e165 Fix pathPrefix bug from PR-4025 2018-03-30 11:04:15 +05:30
Solomon Van
68e394a56e notifier: update use testutil for testing (#3695) 2018-03-29 16:07:26 +01:00
Elif T. Kuş
daebf68ea2 Rewrote tests for relabel and template (#3754)
* relabel: use testutil for testing

* template: use testutil for testing
2018-03-29 16:02:28 +01:00
Björn Rabenstein
61accb51ac
Merge pull request #4025 from codesome/route-prefix
Fixed pathPrefix for web pages
2018-03-29 16:22:54 +02:00
Ganesh Vernekar
f30b37e00b Fixed pathPrefix for web pages 2018-03-29 18:02:25 +05:30
Fabian Reinartz
184b6e3767
Merge pull request #3968 from zjwzte/fix-magic-number
Fix magic number.
2018-03-28 14:09:43 +02:00
Krasi Georgiev
dfd6709a44 update common package (#4015) 2018-03-27 10:21:56 +05:30
Krasi Georgiev
5fec98d0a7 simplify server error handling (#4006) 2018-03-25 10:05:59 +01:00
Corentin Chary
60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00