You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Portal:Cloud VPS/Admin/Runbooks/Check for VMs leaked by the nova-fullstack test: Difference between revisions
imported>David Caro No edit summary |
imported>David Caro |
||
Line 50: | Line 50: | ||
You can now go to kibana<ref>[https://logstash.wikimedia.org Wikimedia Kibana]</ref> and search for any openstack entry that has that server id in it, for example: | You can now go to kibana<ref>[https://logstash.wikimedia.org Wikimedia Kibana]</ref> and search for any openstack entry that has that server id in it, for example: | ||
https://logstash.wikimedia.org/goto/ | https://logstash.wikimedia.org/goto/33a4ec64e36b40d00fc000ca37adb130 | ||
'''Remember to set the time span accordingly to the instance creation date''' | '''Remember to set the time span accordingly to the instance creation date''' | ||
Line 77: | Line 77: | ||
dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server delete 33766360-bbbe-4bef-8294-65fca6722e20 | dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server delete 33766360-bbbe-4bef-8294-65fca6722e20 | ||
Everything worked, so repeat with the other servers. | Everything worked, so repeat with the other servers. | ||
= More info on nova-fullstack = | = More info on nova-fullstack = |
Revision as of 09:25, 8 June 2021
This happens when there's an error in the creation and/or deletion of the nova-fullstack VM.
To see the logs, you have to login to the instance that triggered the alert (if you are looking at the https://alerts.wikimedia.org, it is shown as a blue tag). For example cloudcontrol1003
:
Quick check
ssh cloudcontrol1003.wikimedia.org
There you can check the systemctl status:
dcaro@cloudcontrol1003:~$ sudo systemctl status nova-fullstack.service
It might be that the service re-triggered since then and now it's running, to see older logs try journalctl:
dcaro@cloudcontrol1003:~$ sudo journalctl -u nova-fullstack.service -n 10000
Another possible source of information is the Grafana dasboard.[1]
Debugging service errors
TODO: Add here any service errors you encountered and how you fixed them.
Per-VM debugging
If there's nothing in the logs (sometimes the log has been rotated), you can check the servers for the admin-monitoring
project:
dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+ | d603b2e0-7b8b-462f-b74d-c782c2d34fea | fullstackd-20210110160929 | BUILD | | debian-10.0-buster (deprecated 2021-02-22) | | | 33766360-bbbe-4bef-8294-65fca6722e20 | fullstackd-20210415002301 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.230 | debian-10.0-buster | g3.cores1.ram2.disk20 | | 697ebb69-0394-4e29-82fc-530153a38a1b | fullstackd-20210414162903 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.251 | debian-10.0-buster | g3.cores1.ram2.disk20 | | 1812e03b-c978-43a5-a07e-6e3e240a9bf0 | fullstackd-20210414123145 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.184 | debian-10.0-buster | g3.cores1.ram2.disk20 | | d70a111d-6f23-40e1-8c01-846dedb5f2ca | fullstackd-20210414110500 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.117 | debian-10.0-buster | g3.cores1.ram2.disk20 | | 0b476a1a-a75e-4b56-bf51-8f9d43ec9201 | fullstackd-20210413182752 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.198 | debian-10.0-buster | g3.cores1.ram2.disk20 | +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
There we see that there's a few instances stuck from April 14th, and one from the 13th, and a building instance from January.
Very old failed instances
Usually the best of action there is just to delete the server, as any traces will be lost already, in this case fullstackd-20210110160929
.
sudo wmcs-openstack --os-project-id admin-monitoring server delete fullstackd-20210110160929
If it replies with:
No server with a name or ID of 'fullstackd-20210110160929' exists.
Then you have a stuck server entry, follow this: Portal:Cloud VPS/Admin/Troubleshooting#Instance Troubleshooting
In this case, it was an old build request that got lost (quite uncommon): Portal:Cloud VPS/Admin/Troubleshooting#Deleting an orphaned build request
New instances
Now we can dig a bit deeper, and check for openstack logs containing those instance ids, let's pick 697ebb69-0394-4e29-82fc-530153a38a1b
You can now go to kibana[2] and search for any openstack entry that has that server id in it, for example:
https://logstash.wikimedia.org/goto/33a4ec64e36b40d00fc000ca37adb130
Remember to set the time span accordingly to the instance creation date
In this case there seems not to be anything interesting there for that instance.
Going to the new instance
As a last resort, we can try sshing to the instance and running puppet, see if anything is broken there:
ssh fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud dcaro@fullstackd-20210415002301:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud Info: Applying configuration version '(dd0bf90505) Manuel Arostegui - mariadb: Productionzie db1182' Notice: The LDAP client stack for this host is: sssd/sudo Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo' Notice: Applied catalog in 5.74 seconds
In this case nothing failed.
Cleanup
As we did not find anything in the logs (some might be lost already), and the service is running ok currently, all that's left to do is to cleanup the VMs, see if deleting them fails, and freeing up the resources if not.
dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server delete 33766360-bbbe-4bef-8294-65fca6722e20
Everything worked, so repeat with the other servers.
More info on nova-fullstack
- https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack
- https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack