You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/notes/Monitoring

From Wikitech-static
Jump to navigation Jump to search

Monitoring Discussion. This meeting was held on 2020-04-11.

Scope

  • Started discussing around what metrics we need, and are metrics enough
  • jeh: prometheus node-exporter is installed on all VMs today, which also reports the same data we have in shinken today
  • andrew: leveraging prod architecture, pros and cons
  • arturo: we may not want to follow prod a lot
  • brooke: retention. can we decide in prometheus what to retent more or less?
  • jeh: can use time or size based storage retention policy
  • arturo: retention is directly related to storage capacity

modules/prometheus/manifests/server.pp: $storage_retention = '730h', <--- default retention in the prometheus puppet module, what we are using in tools-prometheus (1 month)

  • bd808: multitenancy, a prometheus instance per project
  • andrew: does prometheus even support multitenancy?
  • brooke: somehow yes, by using labels
  • andrew: security concerns with multitenancy? or only organizational concerns?
  • brooke: not today, but we need to keep security in mind
  • bd808: log aggregation is scarier
  • andrew: central prometheus server vs per project prometheus server
  • jeh: network scoping, security groups, etc
  • arturo: prometheus proxy
  • jeh: push gateway from prometheus server: not very smart for dynamic environments like VMs being created and destroyed
  • brooke:
    • scope1: inmediate need to shutdown shinken. We can shutdown it today and don't loss many
    • scope2: centralice & multi tenant servicec
  • andrew: imagine a cloud project admin wanting a simple grafana dashboard with prometheus metrics.
  • jeh: a local prometheus server allows for custom, per-project alerts. And then a central grafana
  • brooke: we apparently are leaning towards prometheus
  • arturo: replacing shinken with prometheus+alertmanager could be a good experiment before introducing any cloud-wide solution
  • brooke: alertmanager outgoing alerts? smtp server?
  • jeh: yes, email + [..] How do we do it with shinken today?
  • brooke: let's make a task to replace shinken with prometheus+alertmanager. Alert: only for us for now.
  • brooke: Jason, would you be willing to handle the initial shinken replacement task?
  • jeh: sure. What about security groups?
  • andrew: probably update every security group out there. Few projects use shinken. Initial change only in the tools project.
  • andrew: share info wiht krenair
  • jeh: we already have a prometheus server in the tools project. Would it make sense to just extend it with alertmanager?
  • andrew: yeah, why not.
  • arturo: make this an OKR for proper credits for jason
  • brooke: maybe search/create an objective to relate all things together. Also, epic phab task https://phabricator.wikimedia.org/T194333
  • jeh: what do we call it? Use prometheus openstack integration to auto discover VMs (node exporter, puppet alerts on day 1, expand from there).
  • jeh: initial notifications by email + IRC bots?
  • brooke: legit!
  • brooke: next step, replace toolschecker, because shinken couldn't generate pages.
  • andrew: prometheus in english meas forethinker
  • arturo: what about monitoring-infra
  • jeh: create new openstack project, add new prometheus server, update existing server groups (and new project template), configure prometheus openstack-sd-config to scrape vms, configure alert manager to email wmcs-team and notify cloud-feed IRC
  • brooke: metrics-infra, is shorter!