You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Portal:Cloud VPS/Admin/Alerts"

From Wikitech
Jump to navigation Jump to search
imported>Rush
(Created page with "Alerts possible to WMCS-team (or WMCS-bots as of now): * Nova ** Nova-Fullstack (labnet) - Launch a "full" test of instance creation ** nova-network (labnet) - handle dynamic...")
 
imported>Jhedden
 
Line 1: Line 1:
 
Alerts possible to WMCS-team (or WMCS-bots as of now):
 
Alerts possible to WMCS-team (or WMCS-bots as of now):
  
* Nova
+
= Nova =
** Nova-Fullstack (labnet) - Launch a "full" test of instance creation
+
* Nova-Fullstack (labnet) - Launch a "full" test of instance creation
** nova-network (labnet) - handle dynamic NAT and networking gateway
+
* nova-network (labnet) - handle dynamic NAT and networking gateway
** nova-api (labnet) - main API gateway for interacting with nova (creation, deletion, etc)  
+
* nova-api (labnet) - main API gateway for interacting with nova (creation, deletion, etc)  
 +
* nova-scheduler (labcontrol) - schedule and launch instances
 +
* nova-compute - handles setup and tear down of instances on hypervisor
 +
* nova-conductor - DB broker for nova components not-nova-api
  
** nova-scheduler (labcontrol) - schedule and launch instances
+
= Glance =
** nova-compute - handles setup and tear down of instances on hypervisor
+
* glance-api-http (control) - image management for instances  
** nova-conductor - DB broker for nova components not-nova-api
 
  
Glance:
+
= Keystone =
** glance-api-http (control) - image management for instances
 
  
Keystone:
+
* projects and users
 +
** check-novaobserver-membership - Make sure 'novaobserver' has 'observer' everywhere
 +
** check-novaadmin-membership - Make sure 'novaadmin' has 'projectadmin' and 'user' everywhere
 +
** check-keystone-projects - Verify service projects
  
** projects and users
+
* services
*** check-novaobserver-membership - Make sure 'novaobserver' has 'observer' everywhere
+
** keystone-http-${auth_port} - admin API port avail (little context)
*** check-novaadmin-membership - Make sure 'novaadmin' has 'projectadmin' and 'user' everywhere
+
** keystone-http-${public_port} - public API port (little context)
*** check-keystone-projects - Verify service projects
 
  
** services
+
= Designate =
*** keystone-http-${auth_port} - admin API port avail (little context)
+
* check_designate_api_process: service api for DNS changes
*** keystone-http-${public_port} - public API port (little context)
+
* designate-api-http: api external monitoring
 +
*  check_designate_sink_process
 +
* check_designate_central_process
 +
* check_designate_mdns`
 +
* check_designate_pool-manager
  
Designate: <--- can be restarted
+
= Labstore =
** check_designate_api_process: service api for DNS changes
+
*  nfsd-exports - sets up /etc/export.d/ files for instances in cloud
** designate-api-http: api external monitoring
+
* interfaces - saturation in/out
**  check_designate_sink_process
+
* ldap - there is a scheme to use LDAP for groups w/o having the entire system be an LDAP client.
** check_designate_central_process
+
* secondary - checks specific to the 'secondary' Tooforge DRBD/NFSd cluster
** check_designate_mdns`
 
** check_designate_pool-manager
 
  
Labstore:
+
= Toolforge =
**  nfsd-exports - sets up /etc/export.d/ files for instances in cloud
+
modules/icinga/manifests/monitor/toollabs.pp
** interfaces - saturation in/out
 
** ldap - there is a scheme to use LDAP for groups w/o having the entire system be an LDAP client.
 
** secondary - checks specific to the 'secondary' Tooforge DRBD/NFSd cluster
 
  
Toolforge:
+
* tools-proxy - reverse proxy for all web tools
    modules/icinga/manifests/monitor/toollabs.pp
+
* tools-checker-self - reverse proxy for actual check running. This is to monitoring toolforge from prod icinga atm.
** tools-proxy - reverse proxy for all web tools
+
* tools-checker-ldap - without LDAP Toolfroge crumbles.
** tools-checker-self - reverse proxy for actual check running. This is to monitoring toolforge from prod icinga atm.
+
* tools-checker-labs-dns-private - verify resolution for internal DNS from within Toolforge
** tools-checker-ldap - without LDAP Toolfroge crumbles.
+
* tools-checker-nfs-home - NFS /home test (this is a subpath really of one export for project and home)
** tools-checker-labs-dns-private - verify resolution for internal DNS from within Toolforge
+
* tools-checker-grid-start-trusty - starting and running a process on grid
** tools-checker-nfs-home - NFS /home test (this is a subpath really of one export for project and home)
+
* tools-checker-etcd-flannel - etcd is the backend for flannel which is our networking overlay for k8s
** tools-checker-grid-start-trusty - starting and running a process on grid
+
* tools-checker-etcd-k8s - etcd is the persistent data store for k8s itself
** tools-checker-etcd-flannel - etcd is the backend for flannel which is our networking overlay for k8s
+
* tools-checker-k8s-node-ready - check to see if k8s thinks workers are healthy (nods)
** tools-checker-etcd-k8s - etcd is the persistent data store for k8s itself
 
** tools-checker-k8s-node-ready - check to see if k8s thinks workers are healthy (nods)
 
  
Docs:
+
= Ceph =
 +
 
 +
== Ceph Cluster Health ==
 +
Global Ceph cluster health state. 
 +
 
 +
Icinga status check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Ceph+Cluster+Health
 +
 
 +
Grafana dashboard: https://grafana.wikimedia.org/dashboard/db/cloudvps-ceph-cluster
 +
 
 +
Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/
 +
 
 +
State:
 +
* 0 - Healthy
 +
* 1 - Unhealthy (The cluster is currently degraded, but there should be no interruption in service.)
 +
* 2 - Critical (The cluster is in a critical state, it's very likely there are non-functioning services or inaccessible data.)
 +
 
 +
Next steps:
 +
Connect to one of the Ceph mon hosts and identify the cause
 +
cloudcephmon1001:~$ sudo ceph health detail
 +
HEALTH_OK
 +
 
 +
cloudcephmon1001:~$ sudo ceph -s
 +
  cluster:
 +
    id:    5917e6d9-06a0-4928-827a-f489384975b1
 +
    health: HEALTH_OK
 +
 +
  services:
 +
    mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 6d)
 +
    mgr: cloudcephmon1002(active, since 6w), standbys: cloudcephmon1003, cloudcephmon1001
 +
    osd: 24 osds: 24 up (since 6d), 24 in (since 6d)
 +
 +
  data:
 +
    pools:  1 pools, 256 pgs
 +
    objects: 5.46k objects, 21 GiB
 +
    usage:  87 GiB used, 42 TiB / 42 TiB avail
 +
    pgs:    256 active+clean
 +
 +
  io:
 +
    client:  18 KiB/s wr, 0 op/s rd, 3 op/s wr
 +
 
 +
= Docs =
 
     * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Infrastructure
 
     * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Infrastructure
 
     * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin
 
     * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin

Latest revision as of 21:56, 14 February 2020

Alerts possible to WMCS-team (or WMCS-bots as of now):

Nova

  • Nova-Fullstack (labnet) - Launch a "full" test of instance creation
  • nova-network (labnet) - handle dynamic NAT and networking gateway
  • nova-api (labnet) - main API gateway for interacting with nova (creation, deletion, etc)
  • nova-scheduler (labcontrol) - schedule and launch instances
  • nova-compute - handles setup and tear down of instances on hypervisor
  • nova-conductor - DB broker for nova components not-nova-api

Glance

  • glance-api-http (control) - image management for instances

Keystone

  • projects and users
    • check-novaobserver-membership - Make sure 'novaobserver' has 'observer' everywhere
    • check-novaadmin-membership - Make sure 'novaadmin' has 'projectadmin' and 'user' everywhere
    • check-keystone-projects - Verify service projects
  • services
    • keystone-http-${auth_port} - admin API port avail (little context)
    • keystone-http-${public_port} - public API port (little context)

Designate

  • check_designate_api_process: service api for DNS changes
  • designate-api-http: api external monitoring
  • check_designate_sink_process
  • check_designate_central_process
  • check_designate_mdns`
  • check_designate_pool-manager

Labstore

  • nfsd-exports - sets up /etc/export.d/ files for instances in cloud
  • interfaces - saturation in/out
  • ldap - there is a scheme to use LDAP for groups w/o having the entire system be an LDAP client.
  • secondary - checks specific to the 'secondary' Tooforge DRBD/NFSd cluster

Toolforge

modules/icinga/manifests/monitor/toollabs.pp

  • tools-proxy - reverse proxy for all web tools
  • tools-checker-self - reverse proxy for actual check running. This is to monitoring toolforge from prod icinga atm.
  • tools-checker-ldap - without LDAP Toolfroge crumbles.
  • tools-checker-labs-dns-private - verify resolution for internal DNS from within Toolforge
  • tools-checker-nfs-home - NFS /home test (this is a subpath really of one export for project and home)
  • tools-checker-grid-start-trusty - starting and running a process on grid
  • tools-checker-etcd-flannel - etcd is the backend for flannel which is our networking overlay for k8s
  • tools-checker-etcd-k8s - etcd is the persistent data store for k8s itself
  • tools-checker-k8s-node-ready - check to see if k8s thinks workers are healthy (nods)

Ceph

Ceph Cluster Health

Global Ceph cluster health state.

Icinga status check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Ceph+Cluster+Health

Grafana dashboard: https://grafana.wikimedia.org/dashboard/db/cloudvps-ceph-cluster

Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/

State:

  • 0 - Healthy
  • 1 - Unhealthy (The cluster is currently degraded, but there should be no interruption in service.)
  • 2 - Critical (The cluster is in a critical state, it's very likely there are non-functioning services or inaccessible data.)

Next steps: Connect to one of the Ceph mon hosts and identify the cause

cloudcephmon1001:~$ sudo ceph health detail
HEALTH_OK
cloudcephmon1001:~$ sudo ceph -s
 cluster:
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_OK

 services:
   mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 6d)
   mgr: cloudcephmon1002(active, since 6w), standbys: cloudcephmon1003, cloudcephmon1001
   osd: 24 osds: 24 up (since 6d), 24 in (since 6d)

 data:
   pools:   1 pools, 256 pgs
   objects: 5.46k objects, 21 GiB
   usage:   87 GiB used, 42 TiB / 42 TiB avail
   pgs:     256 active+clean

 io:
   client:   18 KiB/s wr, 0 op/s rd, 3 op/s wr

Docs

   * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Infrastructure
   * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin