You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Runbooks/Check for VMs leaked by the nova-fullstack test: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>David Caro
 
imported>David Caro
No edit summary
Line 1: Line 1:
This happens when there's an error in the creation and/or deletion of the nova-fullstack VM.
This happens when there's an error in the creation and/or deletion of the nova-fullstack VM.
To see the logs, you have to login to the instance that triggered the alert (if you are looking at the https://alerts.wikimedia.org is shown as a blue tag). For example cloudcontrol1003:
To see the logs, you have to login to the instance that triggered the alert (if you are looking at the https://alerts.wikimedia.org, it is shown as a blue tag). For example <code>cloudcontrol1003</code>:




== Quick check ==
= Quick check =
   ssh cloudcontrol1003.wikimedia.org
   ssh cloudcontrol1003.wikimedia.org


Line 12: Line 12:
   dcaro@cloudcontrol1003:~$ sudo journalctl -u nova-fullstack.service -n 10000
   dcaro@cloudcontrol1003:~$ sudo journalctl -u nova-fullstack.service -n 10000


== Debugging service errors ==
Another possible source of information is the Grafana dasboard.<ref>[https://grafana-rw.wikimedia.org/d/ebJoA6VWz/nova-fullstack?orgId=1&from=now-7d&to=now Nova fullstack Grafana dashboard]</ref>
 
= Debugging service errors =


TODO: Add here any service errors you encountered and how you fixed them.
TODO: Add here any service errors you encountered and how you fixed them.


== Per-VM debugging ==
= Per-VM debugging =
If there's nothing in the logs (sometimes the log has been rotated), you can check the servers for the `admin-monitoring` project:
If there's nothing in the logs (sometimes the log has been rotated), you can check the servers for the <code>admin-monitoring</code> project:
   dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list
   dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list
   +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
   +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
Line 32: Line 34:
There we see that there's a few instances stuck from April 14th, and one from the 13th, and a building instance from January.
There we see that there's a few instances stuck from April 14th, and one from the 13th, and a building instance from January.


=== Very old failed instances ===
== Very old failed instances ==
Usually the best of action there is just to delete the server, as any traces will be lost already, in this case `fullstackd-20210110160929`.
Usually the best of action there is just to delete the server, as any traces will be lost already, in this case <code>fullstackd-20210110160929</code>.


   sudo wmcs-openstack --os-project-id admin-monitoring server delete fullstackd-20210110160929
   sudo wmcs-openstack --os-project-id admin-monitoring server delete fullstackd-20210110160929
Line 40: Line 42:
   No server with a name or ID of 'fullstackd-20210110160929' exists.
   No server with a name or ID of 'fullstackd-20210110160929' exists.


Then you have a stuck server entry, follow this: [[Portal:Cloud VPS/Admin/Troubleshooting#Instance Troubleshooting]]
Then you have a stuck server entry, '''follow this:''' [[Portal:Cloud VPS/Admin/Troubleshooting#Instance Troubleshooting]]


In this case, it was an old build request that got lost: [[Portal:Cloud VPS/Admin/Troubleshooting#Deleting an orphaned build request]]
In this case, it was an old build request that got lost (quite uncommon): [[Portal:Cloud VPS/Admin/Troubleshooting#Deleting an orphaned build request]]
 
== New instances ==
Now we can dig a bit deeper, and check for openstack logs containing those instance ids, let's pick <code>697ebb69-0394-4e29-82fc-530153a38a1b</code>
You can now go to kibana<ref>[https://logstash.wikimedia.org Wikimedia Kibana]</ref> and search for any openstack entry that has that server id in it, for example:
 
https://logstash.wikimedia.org/goto/a481f274764fa31a021e8bbff319a26f


=== New instances ===
Now we can dig a bit deeper, and check for openstack logs containing those instance ids, let's pick `697ebb69-0394-4e29-82fc-530153a38a1b`
You can now go to kibana and search for any openstack entry that has that server id in it, for example: https://logstash.wikimedia.org/goto/a481f274764fa31a021e8bbff319a26f
'''Remember to set the time span accordingly to the instance creation date'''
'''Remember to set the time span accordingly to the instance creation date'''


In this case there seems not to be anything interesting there for that instance.
In this case there seems not to be anything interesting there for that instance.


==== Going to the new instance ====
=== Going to the new instance ===
As a last resort, we can try sshing to the instance and running puppet, see if anything is broken there:
As a last resort, we can try sshing to the instance and running puppet, see if anything is broken there:
   ssh fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
   ssh fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
Line 68: Line 73:
In this case nothing failed.
In this case nothing failed.


==== Cleanup ====
=== Cleanup ===
As we did not find anything in the logs (some might be lost already), and the service is running ok currently, all that's left to do is to cleanup the VMs, see if deleting them fails, and freeing up the resources if not.
As we did not find anything in the logs (some might be lost already), and the service is running ok currently, all that's left to do is to cleanup the VMs, see if deleting them fails, and freeing up the resources if not.
   dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server delete 33766360-bbbe-4bef-8294-65fca6722e20
   dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server delete 33766360-bbbe-4bef-8294-65fca6722e20
Line 74: Line 79:




 
= More info on nova-fullstack =
More info on nova-fullstack:
* https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack
* https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack
* https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack
* https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack
= References =
<references />

Revision as of 08:39, 19 April 2021

This happens when there's an error in the creation and/or deletion of the nova-fullstack VM. To see the logs, you have to login to the instance that triggered the alert (if you are looking at the https://alerts.wikimedia.org, it is shown as a blue tag). For example cloudcontrol1003:


Quick check

 ssh cloudcontrol1003.wikimedia.org

There you can check the systemctl status:

 dcaro@cloudcontrol1003:~$ sudo systemctl status nova-fullstack.service

It might be that the service re-triggered since then and now it's running, to see older logs try journalctl:

 dcaro@cloudcontrol1003:~$ sudo journalctl -u nova-fullstack.service -n 10000

Another possible source of information is the Grafana dasboard.[1]

Debugging service errors

TODO: Add here any service errors you encountered and how you fixed them.

Per-VM debugging

If there's nothing in the logs (sometimes the log has been rotated), you can check the servers for the admin-monitoring project:

 dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list
 +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
 | ID                                   | Name                      | Status | Networks                               | Image                                      | Flavor                |
 +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
 | d603b2e0-7b8b-462f-b74d-c782c2d34fea | fullstackd-20210110160929 | BUILD  |                                        | debian-10.0-buster (deprecated 2021-02-22) |                       |
 | 33766360-bbbe-4bef-8294-65fca6722e20 | fullstackd-20210415002301 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.230 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | 697ebb69-0394-4e29-82fc-530153a38a1b | fullstackd-20210414162903 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.251 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | 1812e03b-c978-43a5-a07e-6e3e240a9bf0 | fullstackd-20210414123145 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.184 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | d70a111d-6f23-40e1-8c01-846dedb5f2ca | fullstackd-20210414110500 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.117 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | 0b476a1a-a75e-4b56-bf51-8f9d43ec9201 | fullstackd-20210413182752 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.198 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+

There we see that there's a few instances stuck from April 14th, and one from the 13th, and a building instance from January.

Very old failed instances

Usually the best of action there is just to delete the server, as any traces will be lost already, in this case fullstackd-20210110160929.

 sudo wmcs-openstack --os-project-id admin-monitoring server delete fullstackd-20210110160929

If it replies with:

 No server with a name or ID of 'fullstackd-20210110160929' exists.

Then you have a stuck server entry, follow this: Portal:Cloud VPS/Admin/Troubleshooting#Instance Troubleshooting

In this case, it was an old build request that got lost (quite uncommon): Portal:Cloud VPS/Admin/Troubleshooting#Deleting an orphaned build request

New instances

Now we can dig a bit deeper, and check for openstack logs containing those instance ids, let's pick 697ebb69-0394-4e29-82fc-530153a38a1b You can now go to kibana[2] and search for any openstack entry that has that server id in it, for example:

https://logstash.wikimedia.org/goto/a481f274764fa31a021e8bbff319a26f

Remember to set the time span accordingly to the instance creation date

In this case there seems not to be anything interesting there for that instance.

Going to the new instance

As a last resort, we can try sshing to the instance and running puppet, see if anything is broken there:

 ssh fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
 dcaro@fullstackd-20210415002301:~$ sudo run-puppet-agent
  Info: Using configured environment 'production'
  Info: Retrieving pluginfacts
  Info: Retrieving plugin
  Info: Retrieving locales
  Info: Loading facts
  Info: Caching catalog for fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
  Info: Applying configuration version '(dd0bf90505) Manuel Arostegui - mariadb: Productionzie db1182'
  Notice: The LDAP client stack for this host is: sssd/sudo
  Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
  Notice: Applied catalog in 5.74 seconds

In this case nothing failed.

Cleanup

As we did not find anything in the logs (some might be lost already), and the service is running ok currently, all that's left to do is to cleanup the VMs, see if deleting them fails, and freeing up the resources if not.

 dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server delete 33766360-bbbe-4bef-8294-65fca6722e20

Everything worked, so repeat with the other servers.


More info on nova-fullstack

References