You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Machine Learning/Technical Meeting Notes
Jump to navigation Jump to search
- Andy: what do we want to learn? what would be a good demo for online feature store?
- Just load a bunch of revscoring features in?
- What kind of storage makes sense for us? Swift, parquet, sql?
- Luca: How do we want to structure the +3 nodes in eqiad & codfw
- we have codfw, eqiad still needs to be racked
- score cache could be a seperate Redis instance
- Tobias: having separate instances would allow us to tune each instance.
- Luca: Feast wants a single redis endpoint for host & config, if we have multiple nodes, we may need a proxy in the middle.
- Let's figure out how we save registry, also how to handle single redis endpoint.
- Followup: how do we load data into Feast? (airflow) How much space do we need? etc..
- Andy: was there a issue on beta?
- Luca: There is a local proxy issue, Taavi fixed, not sure if deployment-prep vm is fixed.
- Nothing is burning, no weird errors, things seem to work.
- Might be nice for learning if Aiko wants to work on the task w/ Luca
- Aiko: I would like to learn more about the difference between ORES & Lift Wing
Changeprop calling ORES
- Luca: we could post to eventgate in our model.py for Lift Wing
- Tobias: could we do this in Istio on our side? a bit like request logging right?
- Luca: its just a simple POST request so we could do it in the python code, could maybe try Knative eventing but we are using an older version of Knative.
- Tobias: agreed, doing it w/ a library or wrapper makes alot of sense, only tricky bit is you don't want to delay the call.
- Andy: does consistency matter? can we just fire off a post request via asyncio and then return our prediction to the user?
- Luca: that should be fine, if some are missing it's not super problematic.
Lift Wing migration
- Luca: moving models will take space on the cluster with the current cgroup configs
- was hoping the new system would not
- Tobias: Lift Wing is not homebrew, which is an advantage.
- Andy: the revscoring images are pretty big, also they include a ton of assets
- other models won't necessarily be like this
- Luca: we are halfway through migrating ORES images and cpu/memory is filling up.
- maybe fine with 8 nodes?
- Luca: we are unblocked, recent patches are now running on beta
API Gateway platform
- Chris: i think there is now a big push to get it in a good place, which is awesome for us
- Luca: We should connect and start making sure everything works as expected
- header pathing map etc..
- Luca: we may need to change Swift clusters to MOSS
- the paths should stay the same
- Luca: we can just include a parameter that will make an HTTP post to eventgate
- Are we blocked?
- Chris: short-term- Hugh & us will unblock, unsure of long-term status of project
- Tobias: will let you know outcome of upcoming meeting, all our asks could be no big issue.
- Images are big (transformer + predictor)
- also transformer + predictor both need to mount/load model into pod from storage
- editquality will have 30+ isvcs running two large images
- Chris: Do we want to use transformers on future models? Yes. The ORES models are a special case.
- Kevin: the transformers seemed fine until we needed to load the models into the separate pods, now it seems really heavy.
- Andy: my one argument for keeping the heavy transformers was that we could use it in an explainer, but that does not seem to work (maybe a kserve bug?)
- Tobias: Forcing transformers architecture on revscoring models may not help us gain much, other than keeping it alive longer.
- Luca: for editquality it might make sense to go back to predictor
- the mw transformers are not async either so there is a bottleneck.
- Kevin: Also, regarding maintainability, we are currently loading models + revscoring and all its dependencies in both the transformer and predictor. Loading them in only the predictor is much more maintainable. It's more DRY.
- Chris: We should get all the models onto Lift Wing before ORES dies (hardware out of warranty, stretch is ending LTS in a few months, etc.) and then we can improve the models.
- Luca: Looking at traffic in ores1001 - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=ores1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores
- Lots of hosts doing nothing
- Tobias: I wish we could easily sample say 5% of current ORES-bound traffic and see what happens to it on LW
- Luca: We could use changeprop and see how it handles on LW
- Chris: We could do an experiment, although every single model we are not able to migrate over is a conversation we need to have with the community.
- Andy: It would be pretty easy to migrate editquality over to LW. We just need to add the model binary files to Thanos Swift and then update the helm files.
- Chris: Let's load every single ORES model into LW. We got 110+ of them, let's start moving them.
- Luca: For editquality, let's get rid of transformer, move to predictor-only and then start spinning up pods.
- Andy: I will make a task and then Kevin and I can split the work from there.
- Kevin: What is our staging environment going to be?
- Chris: Should we do dev on staging?
- Luca: We have ml-sandbox
- Kevin: ml-sandbox is good
- Andy: I think we were unsure of what the testing cluster would be used for. Also the cluster-local-gateway networking issues hadn't been solved on ml-sandbox yet, so we were unsure if we should continue maintaining our dev cluster. Things are good now and I think ml-sandbox is good for dev.
ORES deploy planning
- Chris: 4 tasks
- Andy: Should we wait on the logger changes and fix security bugs before full deploy?
- Luca: I have a logger pr: https://github.com/wikimedia/ores/pull/355
- Chris: What are the chances of the nlwiki model breaking things?
- Maybe we do a model deployment first-> then logging -> then dep upgrade
- Luca: the celery update will be tricky, what about pyyaml?
- Andy: there is a wrapper for pyyaml that will need to be updated: https://github.com/halfak/yamlconf
- Luca: Let's see if we can get a new version pushed to PYPI, otherwise we can fork and install ourselves.
- Tobias: re: upgrades, Let's do risk assessment, do the smallest first then iterate.
- Luca: We can test on canary for a few days
- Luca: it would be helpful to know who is using ORES the most
- Andy: list of ORES applications: https://www.mediawiki.org/wiki/ORES/Applications
API Gateway Integration
- Chris: Where are we on this?
- Tobias: Luca and I have been discussing about our wants & needs, still need to get info about feasability.
- Chris: Lets figure out nice-to-haves, needs for production system and what we need to get to MVP.
- Luca: All things we have asked for have almost been delivered, but we need to start testing the integration
- Hugh has been very helpful in deploying changes to prod.
- Chris: it's not really standalone ML model
- we won't need to host this (built-in logistic regression feature in elasticsearch)
- Luca: where is this hosted/ who owns this?
- Chris: Cormac on platform i believe.
Lift Wing MVP check-in
- Luca: what is left for SRE?
- Finish load balancer endpoint - https://phabricator.wikimedia.org/T289835
- API Gateway integration - https://phabricator.wikimedia.org/T288789
- still need rate limiting, but ok to start testing
- egress gateway works
- cert-manager is deployed - https://phabricator.wikimedia.org/T298976
- load testing - https://phabricator.wikimedia.org/T296173
- Luca: Is score-caching and feature store outside MVP scope?
- Chris: yes
- Andy: we need to finish up transformers
- We know how to create an inference service (predictor, transformer, models)
- We can upload model binaries to thanos swift via statboxes using model_upload script
- Dev work on testing transformers on the ML sandbox - not an MVP blocker
- Need to figure out how to run transformers on ml-sandbox, cluster-local-gateway issue?
- Also need to decide where to store dev model binaries (pvc, minio, swift or keep using old bucket?)
Wikidata ORES spikes
- no visibility
- missing logs
- Luca: i think it happens during feature extraction?
- Luca: adding more logging would help us figure out what is happening
- Chris: this will help keep ores stable while we migrate to lift wing
- Andy: this could be helpful for us to see if there is a bug buried somewhere.
- Luca: maybe not fix the bug but help us know where the issue is
- Andy: we need to deploy the new nlwiki articlequality model, maybe this week?
- Luca: let's include the logging updates for the spikes
- Andy: +1, i'm happy to do the deploy, maybe we can record it and use it as a side-by-side comparison video w/ lift wing?
- Luca: let's catalog all clients, bots, how people are calling ores (apis etc.) a good starting task for maybe Aiko?
- Andy: +1 - we need local contacts for wiki communities too
- Hal (privacy engineer) had suggested having 'service cards' that describe all downstream users for a model, might be good to have early-on for lift wing.
- Luca: a preliminary list of users, tools, etc.
Feast spike & hardware order
- Luca: We should discuss our plans for the Feature store(s), we have to review procurement tasks for dcops this week. Current Plan:
- 3 redis-like nodes in eqiad
- 3 redis-like nodes in codfw
- 2 nodes in eqiad (for the offline store, but it was in early stages, not even sure if we need those)
- Online Feature Store Task: https://phabricator.wikimedia.org/T294434
- Score Cache:
- Luca: ORES models may not need a feature store right away
- Chris: from product perspective, score cache would handle MVP use-case
- Tobias: Having a cache and not needing it is lower risk than doing full feature store
- Chris: let's try to use the same boxes for score cache and then later on the online feature-store.
- build score-cache first, then progress to online-feature store
nlwiki articlequality deployment
- Andy: the new version ranges for revscoring on the model class repos have some implications for the inference-service image builds. The 'easiest' solution so far has been to install model-class repo via git+ssh and pin to a specific commit in requirements.txt
- Chris: Let's deploy for now and use the git+ssh hack while we finish the MVP
- Andy: Will merge PRs this week and plan for a deployment next week
- Luca: we will need to update clients / all users of ORES with new endpoints
- New urls will not be a simple redirect due to api gateway etc..
- Chris: Let's start getting in touch with down-stream users, I will start asking around.
ORES wikidata spike
- Luca: we are seeing occasional spikes: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=72&orgId=1&refresh=1m
- is this a model error or ORES-error?
- Tobias: Classic problem - We have signal but we aren't sure what is noise.
- Luca: Let's make a task (update: https://phabricator.wikimedia.org/T299137)
- ML monitoring
- Prometheus -> Grafana
- logs -> logstash(?)
- Status codes
- Tobias: 4xx is client screwed up, 5xx is we screwed up
- Tobias: there are some great rpc tracing tools that let you explore each step in a workflow, it would be helpful to have something similar
- Andy: I've seen zipkin and jaeger recommended for distributed tracing in our stack
- Luca: Per pod performance is not great at the moment
- Maybe we need to tune CPU & Memory?
- Tobias: What is being 'starved'? Does container still see full machine?
- Luca: there is 'blocking' code, not sure if this is on IO, will ask in slack
- max asyncio workers is cpu-count + 4
- Luca: Lets bump the limit to two for now.
- Andy: articlequality transformer image is done (need to update chart & deploy to ml-serve)
- Luca: Some changes will need to be made in deployment-charts to support transformer definition
- Chris: Do we need a separate transformer for post-processing?
- Andy: Nope, transformers can have both preprocess and postprocess methods
- Andy: Next step is to create transformers for editquality, draftquality and topic
- These images might also be large. They may need the same assets as the predictor due to feature extraction process. Another reason to start work on a feature store soon!
Deployment Pipeline image issues
- What is 'stable' tag?: https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/.pipeline/config.yaml#40
- It's an alias to the most recent image
- Kevin: Why are there so many new images published? Some seem to be duplicates?
- Andy: It looks like we are publishing on each patchset, which is not good :( my bad!
- Happy to work on (or tag-team) this during the silent week. edit: https://phabricator.wikimedia.org/T297823
- Luca: We can manually delete older images from registry
- Any image pre-December 2021 should be deprecated now due to kserve migration https://phabricator.wikimedia.org/T293331
- Andy: It looks like we are publishing on each patchset, which is not good :( my bad!