You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
User:Razzi/grand SRE IC plan
Trying to become the most capable SRE at wmf. Or at least really capable, the competition is really with myself.
I already have root ssh credentials so there's nothing I can't do, from a permission standpoint. But realistically it's worth doing ops as separate from root; at this point I don't need root on the mediawiki servers etc. Pretty dangerous, somebody like me could easily dump some misinformation into the pipes and it would cause problems.
Also I've been in software for about a decade so I have the technical foundation to learn anything. If you're trying to replicate this know it'll take a few years to get up to speed on unix, programming, networking, cryptography. But really all you need is a good command line and willingness to be patient with yourself, software stuff can be really dense; learning a single command line tool like `ssh` or `git` can be a lifelong process, or at least take a few weeks of concerted effort.
Ok so I'm starting on the data engineering team, and I have some open tickets to get things like superset and presto (eventually trino) working. If I were an expert in all of these things this would all be easier, so I'll become an expert in each. First let me check my backlog to see what else I should do.
Aside: here's the job spec for a senior sre for search platform:
Deployment, scaling, monitoring, provisioning, and support of our Search and SPARQL endpoints Developing and maintaining automation tools and processes Providing guidance and expertise to the team on productionizing and operating our applications Configuration management and deployment tools Ensuring the continuous improvement and evolution of services on our platform Monitoring of systems and services, optimization of performance and resource utilization Incident response, diagnosis and follow-up on system outages or alerts Assisting in software updates Skills and Experience: 5+ years experience in an SRE/Operations/DevOps role as part of a team Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby, etc., we use primarily Python) Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack) Good understanding of Linux systems Experience in automating tasks and processes, identifying process gaps, and finding automation opportunities Open to supporting JVM-based applications Strong English language skills and ability to work independently, as an effective part of a globally distributed team B.S. or M.S. in Computer Science or related field or equivalent in related work experience
And here's the job I have:
The Wikimedia Foundation is hiring a Site Reliability Engineer to support and maintain the data and statistics infrastructure that powers a big part of decision making in the Foundation and in the Wiki community. This includes everything from eliminating boring things from your daily workflow by automating them, to upgrading a multi-petabyte Hadoop cluster to the next upstream version without impacting uptime and users. We're looking for an experienced candidate who's excited about working with big data systems. Ideally you will already have some experience working with software like Hadoop, Kafka, ElasticSearch, Spark and other members of the distributed computing world. Since you'll be joining an existing team of SREs you'll have plenty of space and opportunities to get familiar with our tech (Analytics, Search, WDQS), so there's no need to immediately have the answer to every question. We are a full-time distributed team with no one working out of the actual Wikimedia office, so we are all together in the same remote boat. Part of the team is in Europe and part in the United States. We see each other in person two or three times a year, either during one of our off-sites (most recently in Europe), the Wikimedia All Hands (once a year), or Wikimania, the annual international conference for the Wiki community. Here are some examples of projects we've been tackling lately that you might be involved with: Integrating an open-source GPU software platform like AMD ROCm in Hadoop and in the Tensorflow-related ecosystem Improving the security of our data by adding Kerberos authentication to the analytics Hadoop cluster and its satellite systems Scaling the Wikidata query service, a semantic query endpoint for graph databases Building the Foundation's new event data platform infrastructure Implementing alarms that alert the team of possible data loss or data corruption Building a new and improved Jupyter notebooks ecosystem for the Foundation and the community to use Building and deploying services in Kubernetes with Helm Upgrading the cluster to Hadoop 3 Replacing Oozie by Airflow as a workflow scheduler And these are our more formal requirements: Couple years experience in an SRE/Operations/DevOps role as part of a team Experience in supporting complex web applications running highly available and high traffic infrastructure based on Linux Comfortable with configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.), and modern observability infrastructure (monitoring, metrics and logging) An appetite for the automation and streamlining of tasks Willingness to work with JVM-based systems Comfortable with shell and scripting languages used in an SRE/Operations engineering context (e.g. Python, Go, Bash, Ruby, etc.) Good understanding of Linux/Unix fundamentals and debugging skills Strong English language skills and ability to work independently, as an effective part of a globally distributed team B.S. or M.S. in Computer Science, related field or equivalent in related work experience. Do not feel you need a degree to apply; we value hands-on experience most of all.
Ok back to the backlog. I see:
T293083 Superset SQL Lab fails to stop query T288975 Cookbook to reboot cassandra nodes T294772 Superset Timeout Logging T294771 Increase Superset Timeout T294768 Triage Superset Dashboard Timeouts T292087 Setup Presto UI in production T273004 Presto should warn or prevent users from querying without Hive partition predicates T277553 varnishkafka / ATSkafka should support setting the kafka message timestamp T273850 Superset caching doesn't enforce data acccess permissions T269832 Add a presto query logger T279738 Superset annotation text overlaps illegibly