prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-09-27 08:41:05 +02:00

Author	SHA1	Message	Date
Yecheng Fu	8ceb8f2ae8	Refactor Kubernetes Discovery Part 2: Refactoring - Do initial listing and syncing to scrape manager, then register event handlers may lost events happening in listing and syncing (if it lasted a long time). We should register event handlers at the very begining, before processing just wait until informers synced (sync in informer will list all objects and call OnUpdate event handler). - Use a queue then we don't block event callbacks and an object will be processed only once if added multiple times before it being processed. - Fix bug in `serviceUpdate` in endpoints.go, we should build endpoints when `exists && err == nil`. Add `^TestEndpointsDiscoveryWithService` tests to test this feature. Testing: - Use `k8s.io/client-go` testing framework and fake implementations which are more robust and reliable for testing. - `Test\w+DiscoveryBeforeRun` are used to test objects created before discoverer runs - `Test\w+DiscoveryAdd\w+` are used to test adding objects - `Test\w+DiscoveryDelete\w+` are used to test deleting objects - `Test\w+DiscoveryUpdate\w+` are used to test updating objects - `TestEndpointsDiscoveryWithService\w+` are used to test endpoints events triggered by services - `cache.DeletedFinalStateUnknown` related stuffs are removed, because we don't care deleted objects in store, we only need its name to send a specical `targetgroup.Group` to scrape manager Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>	2018-04-25 19:28:34 +02:00
Yecheng Fu	9bc6ced55d	Refactor Kubernetes Discovery Part 1: Add Vendor files. Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>	2018-04-25 19:28:14 +02:00
Adam Shannon	809881d7f5	support reading basic_auth password_file for HTTP basic auth (#4077 ) Issue: https://github.com/prometheus/prometheus/issues/4076 Signed-off-by: Adam Shannon <adamkshannon@gmail.com>	2018-04-25 18:19:06 +01:00
Björn Rabenstein	7cc46bafcb	Merge pull request #4113 from prometheus/beorn7/juggling Fix the merge into release-2.2	2018-04-25 17:09:10 +02:00
Ben Kochie	219433aae5	Update CircleCI build Use CircleCI 2.0 build config. Signed-off-by: Ben Kochie <superq@gmail.com>	2018-04-25 16:38:05 +02:00
Björn Rabenstein	91e470d733	Merge pull request #4096 from simonpasquier/fix-scrape-races-2.2 Fix scrape races (release-2.2 branch)	2018-04-25 15:36:29 +02:00
Ben Kochie	4a4e8a7d3b	Fix spelling in Makefile.common. (#4105 ) Signed-off-by: Ben Kochie <superq@gmail.com>	2018-04-20 19:35:42 +03:00
Ben Kochie	76f6fe8f86	Merge pull request #4102 from krasi-georgiev/makefile run the style target to fail if the code is not properly formatted	2018-04-20 17:30:42 +02:00
Krasi Georgiev	98c51d241b	run the style target to fail if the code is not properly formated Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-04-19 15:18:35 +03:00
Krasi Georgiev	0b0c9f4b6b	unused target didn't trigger an error for unused packages (#4101 ) Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-04-19 15:07:55 +03:00
Krasi Georgiev	416db814e8	use package shorthand selection that excludes vendored. (#4100 ) Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-04-19 13:38:01 +03:00
Krasi Georgiev	3f2b2c50dd	use the Makefile.common (#3978 ) split common targets in a Makefile.common to reuse it across projects Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>	2018-04-19 12:07:10 +03:00
Simon Pasquier	2cbba4e948	scrape: fix data races This commit avoids passing the full scrape configuration down to the scrape loop to fix data races when the scrape configuration is being reloaded. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-18 11:17:31 +02:00
Simon Pasquier	8b89ab0173	scrape: add test detecting data races Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-18 11:17:25 +02:00
Rohit Gupta	30c3e02864	Fixes #4090 . Marathon service discovery for 5XX http response (#4091 ) Signed-off-by: rohit01 <hello@rohit.io>	2018-04-17 09:28:06 +01:00
Krasi Georgiev	d13db89548	Merge pull request #4073 from krasi-georgiev/remove-unused-vendored remove unused vendored packages	2018-04-17 10:24:01 +03:00
David King	6286c10df0	Fix OOM when a large K is used in topk queries (#4087 ) This attempts to close #3973. Handles cases where the length of the input vector to an aggregate topk / bottomk function is less than the K paramater. The change updates Prometheus to allocate a result vector the same length as the input vector in these cases. Previously Prometheus would out-of-memory panic for large K values. This change makes that unlikely unless the size of the input vector is equally large. Signed-off-by: David King <dave@davbo.org>	2018-04-16 09:03:04 +01:00
Björn Rabenstein	e7584ee345	Merge pull request #4072 from prometheus/beorn7/forward-merge Merge 2.2 bugfixes into master	2018-04-11 13:17:55 +02:00
Krasi Georgiev	7951f6a0f6	Merge pull request #4075 from prometheus/issue-use-case request a use case for proposals	2018-04-11 13:55:06 +03:00
Krasi Georgiev	1467d01147	request a use case for proposals Signed-off-by: Krasi Georgiev <krasi-georgiev@users.noreply.github.com>	2018-04-11 13:47:48 +03:00
Krasi Georgiev	7679bc169d	remove unused vendored packages Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>	2018-04-10 21:22:19 +03:00
beorn7	94ff07b81d	Merge branch 'release-2.2' Signed-off-by: beorn7 <beorn@soundcloud.com>	2018-04-10 16:50:35 +02:00
Björn Rabenstein	f8dcf9b272	Merge pull request #4066 from krasi-georgiev/race-DiscoveredLabels add mutex for DiscoveredLabels	2018-04-10 15:36:56 +02:00
Krasi Georgiev	dc29dd1c6f	add mutex for DiscoveredLabels Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>	2018-04-10 00:18:58 +03:00
Björn Rabenstein	e65fc8591a	Merge pull request #4064 from prometheus/beorn7/vendoring Update vendoring of prometheus/common/route to include data race fix	2018-04-09 17:50:06 +02:00
beorn7	bd44e7fe98	Update vendoring of prometheus/common/route to include data race fix See https://github.com/prometheus/common/pull/125 Signed-off-by: beorn7 <beorn@soundcloud.com>	2018-04-09 17:48:32 +02:00
Krasi Georgiev	ddd46de6f4	Races/3994 (#4005 ) Fix race by properly locking access to scrape pools. Use separate mutex for information needed by UI so that UI isn't blocked when targets are being updated.	2018-04-09 15:18:25 +01:00
Mario Trangoni	464e747f1e	fix some comments typos (#4059 )	2018-04-08 10:51:54 +01:00
Sneha Inguva	cbfb207cca	vendor: correctly update golang client (#4056 )	2018-04-06 18:05:32 +01:00
Tony Lee	7cd56f56df	add queue_time slice to query_duration_seconds (#4050 )	2018-04-05 19:56:58 +01:00
Julius Volz	fe10b36b30	Fix curl example for deleting series (#4046 )	2018-04-05 13:06:18 +01:00
sev3ryn	cc917aee7f	fix of endless loop while doing Consul service discovery. (#4044 ) Reloading Prometheus configs doesn't make loop end. It produced a goroutine leak	2018-04-05 10:41:09 +01:00
Philippe Laflamme	2aba238f31	Use common HTTPClientConfig for marathon_sd configuration (#4009 ) This adds support for basic authentication which closes #3090 The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`. DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this. Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.	2018-04-05 09:08:18 +01:00
Manos Fokas	25f929b772	Yaml UnmarshalStrict implementation. (#4033 ) * Updated yaml vendor package. * remove checkOverflow duplicate in rulefmt * remove duplicated HTTPClientConfig.Validate() * Added yaml static check.	2018-04-04 09:07:39 +01:00
Krasi Georgiev	406233e937	Merge pull request #4034 from si74/main_comments main: actor functionality comments	2018-04-03 12:52:15 +03:00
Sneha Inguva	7be846754a	main: actor functionality comments	2018-04-01 11:19:30 -07:00
albatross0	0245fd55bf	Add a machine type label to GCE SD (#4032 )	2018-03-31 09:20:19 +01:00
Kristiyan Nikolov	be85ba3842	discovery/ec2: Support filtering instances in discovery (#4011 )	2018-03-31 07:51:11 +01:00
Bryan Boreham	93494d8b7e	Add an OpenTracing span for each rule (#4027 ) * Add an OpenTracing span for each rule So that tags and child spans can be traced back to the rule that they refer to.	2018-03-30 21:29:19 +01:00
Björn Rabenstein	6cf725c56d	Merge pull request #4031 from codesome/fix-bug-from-4025 Fix bug from 4025	2018-03-30 16:41:30 +02:00
Ganesh Vernekar	b44ce11d1b	Added test to check pathPrefix	2018-03-30 11:55:54 +05:30
Ganesh Vernekar	cd2820e165	Fix pathPrefix bug from PR-4025	2018-03-30 11:04:15 +05:30
Solomon Van	68e394a56e	notifier: update use testutil for testing (#3695 )	2018-03-29 16:07:26 +01:00
Elif T. Kuş	daebf68ea2	Rewrote tests for relabel and template (#3754 ) * relabel: use testutil for testing * template: use testutil for testing	2018-03-29 16:02:28 +01:00
Björn Rabenstein	61accb51ac	Merge pull request #4025 from codesome/route-prefix Fixed pathPrefix for web pages	2018-03-29 16:22:54 +02:00
Ganesh Vernekar	f30b37e00b	Fixed pathPrefix for web pages	2018-03-29 18:02:25 +05:30
Fabian Reinartz	184b6e3767	Merge pull request #3968 from zjwzte/fix-magic-number Fix magic number.	2018-03-28 14:09:43 +02:00
Krasi Georgiev	dfd6709a44	update common package (#4015 )	2018-03-27 10:21:56 +05:30
Krasi Georgiev	5fec98d0a7	simplify server error handling (#4006 )	2018-03-25 10:05:59 +01:00
Corentin Chary	60dafd425c	consul: improve consul service discovery (#3814 ) * consul: improve consul service discovery Related to #3711 - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services` allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`). Tags and nore-meta are also used in `/catalog/service` requests. - Do not require a call to the catalog if services are specified by name. This is important because on large cluster `/catalog/services` changes all the time. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog on a large cluster would basically change all the time. No need to discover targets in 1sec if we scrape them every minute. - Added plenty of unit tests. Benchmarks ---------- ```yaml scrape_configs: - job_name: prometheus scrape_interval: 60s static_configs: - targets: ["127.0.0.1:9090"] - job_name: "observability-by-tag" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 tag: marathon-user-observability # Used in After refresh_interval: 30s # Used in After+delay relabel_configs: - source_labels: [__meta_consul_tags] regex: ^(.,)?marathon-user-observability(,.)?$ action: keep - job_name: "observability-by-name" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - observability-cerebro - observability-portal-web - job_name: "fake-fake-fake" scrape_interval: "15s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - fake-fake-fake ``` Note: tested with ~1200 services, ~5000 nodes. \| Resource \| Empty \| Before \| After \| After + delay \| \| -------- \|:-----:\|:------:\|:-----:\|:-------------:\| \|/service-discovery size\|5K\|85MiB\|27k\|27k\|27k\| \|`go_memstats_heap_objects`\|100k\|1M\|120k\|110k\| \|`go_memstats_heap_alloc_bytes`\|24MB\|150MB\|28MB\|27MB\| \|`rate(go_memstats_alloc_bytes_total[5m])`\|0.2MB/s\|28MB/s\|2MB/s\|0.3MB/s\| \|`rate(process_cpu_seconds_total[5m])`\|0.1%\|15%\|2%\|0.01%\| \|`process_open_fds`\|16\|1236\|22\|22\| \|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`\|~0\|1\|1\|0.03\| \|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`\|0.1\|80\|0.5\|0.5\| \|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`\|N/A\|200ms\|0.2ms\|0.2ms\| \|Network bandwidth\|~10kbps\|~2.8Mbps\|~1.6Mbps\|~10kbps\| Filtering by tag using relabel_configs uses 100kiB and 23kiB/s per service per job and quite a lot of CPU. Also sends and additional 1Mbps of traffic to consul. Being a little bit smarter about this reduces the overhead quite a lot. Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery. * consul: tweak `refresh_interval` behavior `refresh_interval` now does what is advertised in the documentation, there won't be more that one update per `refresh_interval`. It now defaults to 30s (which was also the current waitTime in the consul query). This also make sure we don't wait another 30s if we already waited 29s in the blocking call by substracting the number of elapsed seconds. Hopefully this will do what people expect it does and will be safer for existing consul infrastructures.	2018-03-23 14:48:43 +00:00

1 2 3 4 5 ...

5022 Commits