You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Juniper router upgrade: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ayounsi
imported>Cathal Mooney
Line 46: Line 46:
#* <code>sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'</code>
#* <code>sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'</code>
#* This needs to match the Icinga "hosts", <code>cr3-ulsfo</code> will match in AlertManager as well.
#* This needs to match the Icinga "hosts", <code>cr3-ulsfo</code> will match in AlertManager as well.
#* NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
# Double check site has been fully drained of traffic before proceeding:
# Double check site has been fully drained of traffic before proceeding:
#* Check no traffic to LVS at site: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs
#* Check no traffic to LVS at site: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs

Revision as of 08:41, 12 September 2022

Known issues

Preparation

  1. Download the proper image to apt1001:/srv/junos/

All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>

  1. Make room for the image
    • request system storage cleanup
    • If multi-RE, cleanup files on backup RE: request system storage cleanup re1
  2. Save rescue config (just in case)
    • request system configuration rescue save
  3. Copy image
  4. Check checksum
    • file checksum md5 /var/tmp/$filename.tgz
    • Compare with checksum on Juniper's website
  5. Validate new image against existing config
    • request vmhost software validate /var/tmp/$filename.tgz
    • On Junos >=18.4 (after the upgrade to 21+)

Upgrade

  1. Check if console port(s) is(/are) working
  2. Depool site (optional)
  3. Drain traffic away from router
    1. Apply GRACEFUL_SHUTDOWN - https://phabricator.wikimedia.org/T211728
    2. Disable the peers
      • deactivate protocols bgp group Transit4
      • deactivate protocols bgp group Transit6
      • deactivate protocols bgp group IX4
      • deactivate protocols bgp group IX6
      • Adjust OSPF metrics
      • If eqiad/codfw drain the pfw3 link by increasing the MED value on both sides
  4. Ensure router is not VRRP master
    • show vrrp summary
    • set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70    
    • set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
  5. Downtime host in Icinga and Alert-manager
    • sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
    • This needs to match the Icinga "hosts", cr3-ulsfo will match in AlertManager as well.
    • NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
  6. Double check site has been fully drained of traffic before proceeding:
  7. Disable BGP sessions to LVS/PyBal load-balancers
    • deactivate protocols bgp group PyBal

If Multi RE:

  1. Remove graceful-switchover
    • deactivate chassis redundancy graceful-switchover
    • request system configuration rescue save (to ensure graceful-switchover is not in the rescue config)
  2. Install image on backup RE
    • request vmhost software add /var/tmp/$filename.tgz re1
    • This needs no-validate when upgrading to 21+
  3. Reboot RE1
    • request vmhost reboot re1
  4. Once back up (show chassis routing-engine), perform RE switchover (impactful)
    • request chassis routing-engine master switch
  5. Once done, repeat previous 3 steps for re0
  6. Rollback "Remove graceful-switchover"

If single RE:

  1. Install image on RE
    • request vmhost software add /var/tmp/$filename.tgz
    • This needs no-validate when upgrading to 21+
  2. Reboot router
    • request vmhost reboot

Both single and dual RE:

  1. Check if router is healthy
    • show log messages | last
    • show system alarms
    • show ospf(3) interface
    • show bgp summary
    • All green in Icinga and LibreNMS

Cleanup

    • request system storage cleanup
      • If multi-RE, cleanup files on backup RE: request system storage cleanup re1
  1. Remove Icinga and LibreNMS downtimes
  2. Rollback "Drain traffic away from router"
  3. Rollback VRRP change if any
  4. Save rescue config (just in case)
    • request system configuration rescue save
  5. On vmhost devices, save the disk snapshot to the backup partition
    • request vmhost snapshot for single RE devices
    • request vmhost snapshot routing-engine both for dual RE devices