You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Auto schema

From Wikitech-static
Revision as of 14:57, 10 January 2022 by imported>Ladsgroup (+ operations/software/schema-changes/)
Jump to navigation Jump to search

This is the documentation on auto_schema. A set of python scripts to reduce the toil of schema changes for DBAs. This tool is a library rather than script to run. You should utilize it to your needs and can use it for more than schema changes (for example automated complex fixes across the fleet, an example below).

You must get your schema patches submitted to operations/software/schema-changes/ and get it reviewed before running in production. You can find lots of examples of schema changes files as well.

Schema changes

An example of schema change file would look like something like this (and need to be added in root of auto_schema)

from auto_schema.schema_change import SchemaChange

section = 's1'
downtime_hours = 4
should_downtime = True
ticket = 'T297189'

# Don't add set session sql_log_bin=0;
command = """ALTER TABLE flaggedtemplates DROP PRIMARY KEY, ADD PRIMARY KEY (ft_rev_id, ft_tmp_rev_id), DROP COLUMN ft_title, DROP COLUMN ft_namespace;"""

# Set this to false if you don't want to run on all dbs
# In that case, you have to specify the db in the command.
all_dbs = True

# DO NOT FORGET to set the right port if it's not 3306
# Use None instead of [] to get all pooled replicas
replicas = None

# Should return true if schema change is applied
def check(db):
    if 'flaggedtemplates' not in db.run_sql('show tables;'):
        return True
    return 'ft_title' not in db.run_sql('desc flaggedtemplates;')


schema_change = SchemaChange(
    replicas=replicas,
    section=section,
    all_dbs=all_dbs,
    check=check,
    command=command,
    ticket=ticket,
    downtime_hours=downtime_hours
)
schema_change.run()

Explanation of each variable:

  • section is the name of the section. Make sure the section is actually the section of replicas
  • downtime_hours how many hours it should downtime the host. Keep it in mind that if there are several layers of replication, it will take multiple times for replication to catch up. As a rule of thumb put three times more than the time it would take to run the alter db.
    • Set it to 0 to disable downtiming.
  • ticket is the ticket it will be used in SAL messages and cookbook runs.
  • command is the alter table command. Don't set the database if you set the "all_dbs" to True, otherwise you must set the database to avoid the schema change from failing.
    • Note: DO NOT add set session sql_log_bin=0; to the command. The script automatically adds it when needed.
  • all_dbs is boolean value to whether should try to run it on each db of the section or one db. For example, if you need to run it only on centralauth, set it to false but if you need to run it on all wikis, set it to True. If you need to run it on some of wikis (e.g. wikis that have flaggedrevs enabled), just set it to run on all wikis and handle that part in "check" function
  • replicas is either None or a list of replicas. Some example values:
    • None: all direct replicas of master of that section in active dc. so for s1 and eqiad being the active dc, you will get all hosts that get directly replicated from master of s1 in eqiad (including master of s1 in codfw, pooled replicas of s1 in eqiad, backup sources of s1, analytics db of s1)
    • ['db1135', 'db1163'] runs the schema change on the two hosts on port 3306
    • ['db1140:3311', 'db1135'] runs the schema change on the host with that given port.
    • Note: DO NOT mix hosts from different sections. You can mix hosts from different dcs (but in the same section) though.
    • Note: Make sure you set the correct port when it's not the default 3306
  • check is the function you need to write and pass (without calling it, note the missing ()) to the class. This function takes a db object that you can call run_sql('commonad;') in it. You can call it several times. The function must return a boolean value. True if the schema change is not needed or done already. False if it's needed. In the example above, it checks if the wiki has that table (flaggedrevs is enabled on a subset of wikis) and bails out if not. Then checks if the column doesn't exists. Returns False if it does (so the schema change can go ahead)
    • Note: DO NOT add write queries or slow queries in check. Otherwise, hell would break loose.
    • Note: The check is being run twice on each host/db. First to see if it's needed and skips if not. The second time to check if the schema change has been done correctly. If it fails in the second run, the script stops in the first host and won't repool it.

Running

Once the script has been written and reviewed. You can run it (in cumin as root) with:

python3 run_schema_change.py --run --include-masters
  • Without --include-masters, it will ask you before running schema change on any db that has hanging replicas (including sanitarium master, passive dc master, active dc master, etc.) but if you run it with --include-masters, it won't ask and applies the schema change on masters too (with replication enabled if needed see logic section below)
    • Note:: "y", "yes", "si" or "ja" will be considered as permission to run the schema change on the master.
  • Without --run it will be in dry mode and won't make any actions against production but it outputs what commands it would run on where. Run every script on dry mode first and check the output.
  • Note: DO NOT run this in parallel in one section, it will depool too many replicas.
  • Note: Given that repooling each host takes around 45 minutes, schema change runs must be run in a screen, otherwise it just refuses to work and stops.

logging

The log of its actions can be seen in directory logs/ under the name of the ticket. e.g. T297189.log. Note that dry runs also log into that file.

Logic

Depooling

  • Replicas that are in the active dc and are pooled, would be depooled. This is regardless of the if they are master themselves (e.g. sanitarium masters can get traffic too).
  • Masters of each dc won't get depooled for obvious reasons and schema change will be live on them.
  • Replicas of passive dc won't get depooled (you shouldn't set them anyway, just set it on master of the passive dc).
  • After depooling, the script waits for user traffic to drain.
    • Sometimes it might take very long time (due to mediawiki maintenance scripts keeping connection open).
    • It has a downtime of an hour, if the traffic doesn't drain after an hour, it'll repool and move on to the next host.
  • Before repooling, the script waits for replication to catch up.

With or without replication

  • If the host doesn't have replicas itself. It will be ran without replication.
  • If the host is a master:
    • If it's not the master of active dc. Meaning anything from sanitarium master to master of passive dc. It will get the change with replication.
    • If it's the master of the active dc. The schema change will run without replication.

Downtiming

  • Any host that the schema change is going to happen on them will get downtimed (if downtime value is set).
  • If the host has any replicas, those and replicas of replicas (and so on) will also get downtimed before start of schema change.
    • With the exception of master of active dc which would downtime only itself (and not the replicas) given that schema change won't happen with replication on these hosts.
    • Replicas (that will get the change with replication not directly) will be downtimed twice as long as the master.

More

You can write checks that are not straightforward in bash and run them across the fleet. For example, here is a fix for heartbeat GRANTs in hosts:

from auto_schema.host import Host
hosts = [('db2103', 's1'),
('db1116:3318', 's8'),
('db1114', 's8'),
('db1111', 's8'),
('db1109', 's8'),
('db1104', 's8')]

for host_tuple in hosts:
    host = Host(host_tuple[0], host_tuple[1])
    grants = host.run_sql('show grants for `wikiuser`@`10.64.%`').replace('`', '').replace('\'', '')
    if 'GRANT SELECT ON heartbeat.* TO wikiuser@10.64.%' in grants:
        if 'GRANT SELECT ON heartbeat.heartbeat TO wikiuser@10.64.%' in grants:
            print('needs removing the grant')
            host.run_sql('set session sql_log_bin=0; REVOKE SELECT ON `heartbeat`.* FROM `wikiuser`@`10.64.%`')
        else:
            print('needs fixing the grant')
            host.run_sql('set session sql_log_bin=0; GRANT SELECT ON `heartbeat`.`heartbeat` TO `wikiuser`@`10.64.%`; REVOKE SELECT ON `heartbeat`.* FROM `wikiuser`@`10.64.%`')

Future

In future, we can slowly start using this for automation of mysql and kernel upgrades in the fleet (phab:T239814). Parts of this code also should be ported to use spicecrack and become a cookbook in itself.