You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Runbooks/Check for VMs leaked by the nova-fullstack test: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>David Caro
imported>MABot
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{Remark|The procedures in this runbook require admin permissions to complete.|reminder}}
#REDIRECT [[Portal:Cloud VPS/Admin/Runbooks/NovafullstackSustainedLeakedVMs]]
 
= Error / Incident =
Usually an email/icinga alert with the subject <code>** PROBLEM alert - <hostname>/Check for VMs leaked by the nova-fullstack test is CRITICAL **</code>
 
This happens when there's an error in the creation and/or deletion of the nova-fullstack VM.
To see the logs, you have to login to the instance that triggered the alert (if you are looking at the https://alerts.wikimedia.org, it is shown as a blue tag). For example <code>cloudcontrol1003</code>:
 
= Debugging =
== Quick check ==
  ssh cloudcontrol1003.wikimedia.org
 
There you can check the systemctl status:
  user@cloudcontrol1003:~$ sudo systemctl status nova-fullstack.service
 
It might be that the service re-triggered since then and now it's running.  Logs are in Kibana:  https://logstash.wikimedia.org/app/dashboards#/view/13149940-ff6c-11eb-85b7-9d1831ce7631
 
Quick way to check the last error (now that we have structured errors):
  user@cloudcontrol1003:~$ echo "$(sudo journalctl -n 5000 -u  nova-fullstack.service | grep '"log.level": "ERROR"' | tail -n 1 | sed -e 's/^.*@cee: //')" | jq '.'
 
View the list of active VMs
  user@cloudcontrol1003:~$ sudo wmcs-openstack server list --os-project-id admin-monitoring
 
 
 
Another possible source of information is the Grafana dasboard.<ref>[https://grafana-rw.wikimedia.org/d/ebJoA6VWz/nova-fullstack?orgId=1&from=now-7d&to=now Nova fullstack Grafana dashboard]</ref>
 
== Debugging service errors ==
 
 
=== DNS entry not being created ===
 
This is related to the designate service that is running (currently) on <code>cloudservices1003.wikimedia.org</code>, to check it's status ssh and run:
 
root@cloudservices1003:~# journalctl -u designate-sink.service  -f
 
 
One common issue is designate-sink getting stuck, the reason is not known yet (see {{PhabT|309929}}), but the symptom is that it stopped logging anything, so if there are no logs newer than 2h, it's probably this, the workaround is to restart the service:
 
root@cloudservices1003:~# systemctl restart designate-sink.service
 
 
It will take a couple minutes to start up, but after you should start seeing entries like:
 
Jun 04 20:42:02 cloudservices1003 designate-sink[3363330]: 2022-06-04 20:42:02.192 3363330 WARNING nova_fixed_multi.base [req-6caefd0f-9620-49f4-94b3-53710ce2b128 - - - - -] Deleting forward record 0ddc6e2d-10db-4246-9a43-3fe81ff1fa20 in recordset becbf645-db72-4dd3-b704-1a1cd380e85c
 
 
=== SSH timeout to the machine ===
 
This can have many reasons, one common one is that the machine failed to run cloud-init properly.
 
To check it up, connect to the console of the machine, that is, check the host that VM is running on (on horizon, or run):
 
user@cloudcontrol1003:~$ sudo wmcs-openstack server show --os-project-id admin-monitoring afe478d0-d950-467f-abd0-6f3a41e50552
user@cloudcontrol1003:~$ sudo wmcs-openstack server show --os-project-id admin-monitoring afe478d0-d950-467f-abd0-6f3a41e50552 | grep -E '(OS-EXT-SRV-ATTR:host|instance_name)'
| OS-EXT-SRV-ATTR:host                | cloudvirt1045
| OS-EXT-SRV-ATTR:instance_name      | i-000626d2 
 
Ssh to that host, and run:
 
user@cloudvirt1045:~$ sudo virsh
Welcome to virsh, the virtualization interactive terminal.
 
Type:  'help' for help with commands
      'quit' to quit
virsh # console i-000626f9
Connected to domain 'i-000626f9'
Escape character is ^] (Ctrl + ])
 
With the <code>instance_name</code> you got before, and that will connect you to the console (to exit hit <code>CTRL+]</code>), from there, check the cloud-init log:
 
root@buildvm-84424939-7022-4ea6-9e97-0ef7d3df1af0:~# cat /var/log/cloud-init.log
 
==== No active metadata service found ====
 
If you see this in the <code>cloud-init.log</code> file:
 
2022-06-03 23:46:57,664 - util.py[WARNING]: No active metadata service found
2022-06-03 23:46:57,665 - util.py[DEBUG]: No active metadata service found
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 145, in _get_data
    results = self._crawl_metadata()
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 181, in _crawl_metadata
    raise sources.InvalidMetaDataException(
cloudinit.sources.InvalidMetaDataException: No active metadata service found
 
Then the VM was unable to reach the metadata service to get the <code>vendor-data</code> with the custom settings to initialize it.
 
A common problem is the <code>nova-api-metadata</code> not working properly on the <code>cloudcontrol</code> nodes, you can check it out and try restarting it (workaround):
 
root@cloudcontrol1003:~# systemctl restart nova-api-metadata.service
 
This might help for a while, the underlying issue is still being looked into (see {{PhabT|309930}}).
 
== Per-VM debugging ==
If there's nothing in the logs (sometimes the log has been rotated), you can check the servers for the <code>admin-monitoring</code> project:
  dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list
  +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
  | ID                                  | Name                      | Status | Networks                              | Image                                      | Flavor                |
  +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
  | d603b2e0-7b8b-462f-b74d-c782c2d34fea | fullstackd-20210110160929 | BUILD  |                                        | debian-10.0-buster (deprecated 2021-02-22) |                      |
  | 33766360-bbbe-4bef-8294-65fca6722e20 | fullstackd-20210415002301 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.230 | debian-10.0-buster                        | g3.cores1.ram2.disk20 |
  | 697ebb69-0394-4e29-82fc-530153a38a1b | fullstackd-20210414162903 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.251 | debian-10.0-buster                        | g3.cores1.ram2.disk20 |
  | 1812e03b-c978-43a5-a07e-6e3e240a9bf0 | fullstackd-20210414123145 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.184 | debian-10.0-buster                        | g3.cores1.ram2.disk20 |
  | d70a111d-6f23-40e1-8c01-846dedb5f2ca | fullstackd-20210414110500 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.117 | debian-10.0-buster                        | g3.cores1.ram2.disk20 |
  | 0b476a1a-a75e-4b56-bf51-8f9d43ec9201 | fullstackd-20210413182752 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.198 | debian-10.0-buster                        | g3.cores1.ram2.disk20 |
  +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
 
There we see that there's a few instances stuck from April 14th, and one from the 13th, and a building instance from January.
 
=== Very old failed instances ===
Usually the best of action there is just to delete the server, as any traces will be lost already, in this case <code>fullstackd-20210110160929</code>.
 
  sudo wmcs-openstack --os-project-id admin-monitoring server delete fullstackd-20210110160929
 
If it replies with:
  No server with a name or ID of 'fullstackd-20210110160929' exists.
 
Then you have a stuck server entry, '''follow this:''' [[Portal:Cloud VPS/Admin/Troubleshooting#Instance Troubleshooting]]
 
In this case, it was an old build request that got lost (quite uncommon): [[Portal:Cloud VPS/Admin/Troubleshooting#Deleting an orphaned build request]]
 
=== Steady, intermittent failures ===
 
Sometimes the fullstack test will exhibit a slow leak, with one failed VM every few hours or every few days. That suggests an intermittent failure. Here are some things that have caused that in the past:
 
* Race conditions in the instance startup script, usually related to getting clean puppet runs (puppet/modules/openstack/templates/nova/vendordata.txt.erb).  This usually shows up in the cloud-init logs on the failed instances, /var/log/cloud-init*.
 
* Partial outage of OpenStack services, resulting either from one service on one cloudcontrol being out, or an haproxy configuration error. First place to look for that is in /var/log/haproxy/haproxy.log
 
* Bad tokens in a service cache, causing intermittent auth failures between openstack services. To clear the caches, restart memcached on all cloudcontrols.
 
* Bad [https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Rabbitmq Rabbitmq] behavior
 
=== New instances ===
Now we can dig a bit deeper, and check for openstack logs containing those instance ids, let's pick <code>697ebb69-0394-4e29-82fc-530153a38a1b</code>
You can now go to kibana<ref>[https://logstash.wikimedia.org Wikimedia Kibana]</ref> and search for any openstack entry that has that server id in it, for example:
 
https://logstash.wikimedia.org/goto/d1dd6f5c0b090d2c903b88c7eac734ab
 
'''Remember to set the time span accordingly to the instance creation date'''
 
In this case there seems not to be anything interesting there for that instance.
 
===== Going to the new instance =====
As a last resort, we can try sshing to the instance and running puppet, see if anything is broken there:
 
  ssh fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
 
====== Check Cloud-Init / Logs ======
  $ cat /var/log/cloud-init.log | grep error
 
====== Check Puppet ======
  dcaro@fullstackd-20210415002301:~$ sudo run-puppet-agent
  Info: Using configured environment 'production'
  Info: Retrieving pluginfacts
  Info: Retrieving plugin
  Info: Retrieving locales
  Info: Loading facts
  Info: Caching catalog for fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
  Info: Applying configuration version '(dd0bf90505) Manuel Arostegui - mariadb: Productionzie db1182'
  Notice: The LDAP client stack for this host is: sssd/sudo
  Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
  Notice: Applied catalog in 5.74 seconds
 
In this case nothing failed.
 
==== Cleanup ====
As we did not find anything in the logs (some might be lost already), and the service is running ok currently, all that's left to do is to cleanup the VMs, see if deleting them fails, and freeing up the resources if not.
  useer@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list  -f value -c ID | xargs -I{} sudo wmcs-openstack --os-project-id admin-monitoring server delete {}
 
= Support contacts =
Usually anyone in the [[Help:Cloud_Services_Introduction|WMCS]] team should be able to help/debug the issue, the main subject matter expert would be [[User:Andrew_Bogott|Andrew Bogott]].
 
= Related information =
* https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack
* https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack
 
== References ==
<references />
 
[[Category:Runbooks]]

Latest revision as of 04:01, 15 July 2022