The Wikimedia media thumbnailing infrastructure is based on Thumbor.
As of June 2017, all thumbnail traffic for public and beta wikis is served and rendered by Thumbor.
As of February 2018, all thumbnail traffic for private wikis is served by Thumbor.
The Mediawiki-based image scaling cluster is expected to be retired in 2018, as it no longer serves any traffic.
- Better support Thumbor has a lively community of its own, and is a healthy open-source project. In contrast, the media-handling code in Mediawiki is supported on a best-effort basis by very few people
- Better security isolation Thumbor is stateless and connects to Swift, Poolcounter and DC-local Thumbor-specific Memcache instances (see "Throttling" below). In contrast, Mediawiki is connected to many more services, as well as user data and sessions. Considering how common security vulnerability discoveries are in media-processing software, it makes sense to isolate media thumbnailing as much as possible.
- Better separation of concerns Thumbor only concerns itself with thumbnail generation. This is desirable in a service-oriented architecture.
- Easier operations Thumbor is a simple service and should be easy to operate.
Supported file types
We have written Thumbor engines for all the file formats used on Wikimedia wikis. Follow these links for special information about the Thumbor engines for those formats:
These engines reuse the same logic as Mediawiki to render those images, often leveraging the same underlying open-source libraries or executables. Whenever possible, reference images generated with Mediawiki are used for the Thumbor integration tests.
In order to understand Thumbor's role in our software stack, one has to understand how Wikimedia production is currently serving those images.
The edge, where user requests first land, is Varnish. Most requests for a thumbnail are a hit on the Varnish frontend or backend caches.
When Varnish can't find a copy of the requested thumbnail - whether it's a thumbnail that has never been requested before, or ones that fell out of Varnish cache - Varnish hits the Swift proxies. We run a custom plugin on our Swift proxies, which is responsible for parsing the thumbnail URL, determining whether there is a copy of that thumbnail already stored in Swift, serving it if that's the case, asking Thumbor to generate it otherwise.
There is one exception to this workflow, which is when a request is made directly to thumb.php. In that case the request isn't cached by Varnish and is send to Mediawiki, which then proxies it to Thumbor. This is the same behavior used by private wikis, described below. These requests are actually undesirable because of their inefficiency (skipping Varnish caching) and are all coming from gadgets, not from Mediawiki itself. It would be interesting to perform a cleanup campaign, encouraging gadget owners to migrate their code to the proper way of crafting well-cached thumbnail URLs and subsequently blocking thumb.php use on public wikis once the cleanup is complete.
In the case of private wikis, Varnish doesn't cache thumbnails, because Mediawiki-level authentication is required to ensure that the client has access to the desired content (is logged into the private wiki). Therefore, Varnish passes the requests to Mediawiki, which verifies the user's credentials. Once authentication is validated, Mediawiki proxies the HTTP request to Thumbor. A shared secret key between Mediawiki and Thumbor is used to increase security.
Hitting Thumbor (common)
When Thumbor receives a request, it tries to fetch the original media from Swift. If it can't, it 404s. It then proceeds to generate the request thumbnail for that media. Once it's done, it serves the resulting image, which the Swift proxy then forwards to Varnish, which serves it to the client. Varnish saves a copy in its own cache, and Thumbor saves a copy in Swift.
Ways our use of Thumbor deviates from its original intent
Thumbor, in its default configuration, never touched the disk, for performance purposes. Since most image processing software isn't capable of streaming content, it keeps the originals entirely in memory in a request lifecycle. This works fine for most websites that deal with original media files that are a few megabytes at most. But the variety of files found on Wikimedia wikis mean we deal with some original files that are several gigabytes. This core logic in Thumbor of keeping originals in memory doesn't scale to the concurrency of large files we can experience.
This logic of passing the whole original around is deeply baked into Thumbor, which makes it difficult to change Thumbor itself to behave differently. Which is why we opted for a workaround, in the form of custom loaders. Loaders are a class of Thumbor plugins responsible for loading the original media from a given source.
Our custom loaders stream the original media coming from its source (eg. Swift) directly to a file on disk. The path of that file is then passed via a context variable, and the built-in variable in Thumbor that normally contains the whole original only contains the beginning of the file. Passing this extract allows us to leverage Thumbor's built-in logic for file type detection, because most file types signal what type they are at the beginning of the file.
Filters normally do something in Thumbor. We had needs, such as multipage support, that span very different engines. Which is why we repurposed filters to simply pass information to each engine, which is then responsible for applying the filter's functionality, instead of having logic for every possible engine baked into the filter. This deviates from Thumbor's intent to have filters do something, since not all engines have to do something according to a filter.
Image processing ordering
Thumbor tends to perform image operations right away (including filters), when it processes them. For performance and quality conservation purposes, we often queue those image operations and perform them all at once in a single command. This need is also increased by our reliance on subprocesses.
Thumbor's default engines do everything with Python libraries. While this has the advantage of cleaner code, and doing everything in the same process, it has the disadvantage... of doing everything in the same process. On Wikimedia sites, we deal with a very wide variety of media. Some of the files would require too much memory to resize and can't be processed, some take too long. In the default Thumbor way of doing things, we could only set resource limits on the Thumbor process and no time limits because Thumbor is single-threaded (so a call to an operation on a python library can't be aborted). By doing all our image processing using subprocess commands, we have better control over resource and time limits for image processing. Which means that a given original being problematic is much less likely to take down the Thumbor process with it, or hog it, and other content can be processed.
Thumbor doesn't have infrastructure for multiple engines. It only expects a single engine as configuration and has a hardcoded special case for GIF. Due to this lack of generic multi-engine support, we developed our own using a proxy engine, which acts as the default Thumbor engine and routes requests to the various custom engines we've written.
We've also had to monkey-path Thumbor's MIME type support to enable the new mime types supported by our various engines. Overall this is a weak area in Thumbor's extensibility that we had to work around, but changes could be made upstream to be more accommodating to our usage pattern.
In order to prevent abuse and to distribute server resources more fairly, Thumbor has a few throttling mechanisms in place. These happen as early as possible in the request handling, in order to avoid unnecessary work.
Failure throttling require having a memory of past events. For this we use Memcached. In order to share the throttling information across Thumbor instances, we use a local nutcracker instance running on each Thumbor server, pointing to all the Thumbor servers in a given datacenter. This is configured in Puppet, with the list of servers in hiera under the
thumbor_memcached_servers_nutcracker config variables.
In Thumbor's configuration, the memcached settings used for this are defined in
FAILURE_THROTTLING_PREFIX, found in Puppet.
The failure throttling logic itself is governed by the
FAILURE_THROTTLING_DURATION Thumbor config variables. This throttling limits retries on failing thumbnails. Some originals are broken or can't be rendered by our thumbnailing software and there would be no point retrying them every time we encounter them. This limit allows us to avoid rendering problematic originals for a while. We don't want to blacklist them permanently, however, as upgrading media-handling software might suddenly make originals that previously couldn't be rendered start working. This limit having an expiry guarantees that the benefits of upgrades apply naturally to problematic files, without requiring to clear a permanent blacklist whenever software is upgraded on the Thumbor hosts.
For other forms of throttling, we use Poolcounter. Both to combat malicious of unintentional DDoS, as well as regulate resource consumption. The Poolcounter server configuration shared by the different throttling types is defined in the
POOLCOUNTER_RELEASE_TIMEOUT Thumbor config variables, found in Puppet.
We limit the amount of concurrent thumbnail generation requests per client IP address. The configuration for that throttle is governed by the and
POOLCOUNTER_CONFIG_PER_IP Thumbor config variable, found in Puppet.
We limit the amount of concurrent thumbnail generation requests per original media. The configuration for that throttle is governed by the and
POOLCOUNTER_CONFIG_PER_ORIGINAL Thumbor config variable, found in Puppet.
Some thumbnail types are disproportionately expensive to render thumbnails for (in terms of CPU time, mostly). Those expensive types are subject to an extra throttle, defined by the
POOLCOUNTER_CONFIG_EXPENSIVE Thumbor config variable, found in Puppet.
Unlike Mediawiki, Thumbor doesn't implement a per-user Poolcounter throttle. First because Thumbor has greater isolation (on purpose) and doesn't have access to any user data, including sessions. Secondly, the per-IP throttle should covers the same ground, as logged-in users should have little IP address variance during a session.
Thumbor logs go to
/srv/log/thumbor on the Thumbor servers. All the Thumbor instances on a given server write to the same files. Logs are rotated daily. The logging configuration is defined in Puppet, under the
THUMBOR_LOG_CONFIG Thumbor config variable.
Thumbor logs also go to Logstash/Kibana.
Thumbor consumes its configuration from the
/etc/thumbor.d/ folder. The .conf files found in that folder are parsed in alphabetical order by Thumbor. The
thumbor Debian package as well as our custom
python-thumbor-wikimedia Debian package contain default configuration files. On top of which we add some defined in Puppet.
The rule of thumb here is that configuration that might depend on the instance or datacenter at hand should be defined in Puppet, while configuration that won't vary per machine can be defined in the
python-thumbor-wikimedia Debian package.
Updating the custom Thumbor plugins
Our custom Thumbor plugins have their reference repo at https://phabricator.wikimedia.org/diffusion/THMBREXT/
Testing the changes
Before putting anything up for review, you can test your changes locally. In fact some tests that require connecting to the internet only run locally and will not be run as part of the Debian package build process (they are blacklisted there because network is turned off during packaging). The tests are simply run by calling
nosetests at the root of the thumbor plugins directory, after running the python setup (which installs dependencies like stock thumbor) at least once.
Once the tests pass locally, if you're using an OS different than Debian Jessie for local development, you should run the tests again on a WMCS machine. This is because variations in the exact versions of the underlying software, like imagemagick, can result in different visual comparison scores for example from one platform to the next. Usually when writing tests that do visual comparison to a reference thumbnail, you want to start with the highest SSIM threshold possible locally, and adjust if necessary once running on the same platform as production Thumbor servers (Debian Jessie currently).
If you want to run only the subset of tests that the Debian package will be built against, you can find the nosetests parameters in the Thumbor plugins Debian package repo: https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-thumbor-wikimedia in the
Copying changes from the plugins repo to the Debian package repo
Once the changes to the plugins repo have been tested on WMCS, we can bump the version in the plugins repo with a commit. After which the changes are simply copied over from the plugins repo to the Debian package repo. When copying changes over, make sure that deleted files are deleted as well, and the the setup.py from the root, which contains the version number, is also copied over. Remember to update the Debian package changelog, update dependencies if needed, and put the Debian package changes up for review.
Thumbor is deployed via Debian packages, specifically
python-thumbor-wikimedia contains WMF extensions to process additional file types, talk to Swift and so on. The repository with
debian/ directory lives at
operations/debs/python-thumbor-wikimedia while "upstream" repository is at
debian/changelog has been updated, it is possible to build a new package by first tagging
upstream/VERSION the relevant commit and then
git-buildpackage -us -uc -S to create a
dpkg-source -b . to create the source package. Once the source package is available it can be built with
BACKPORTS=yes DIST=jessie-wikimedia sudo -E cowbuilder --debbuildopts -sa --build DSC_FILE and uploaded to
After deploying the new Debian packages, or making changes to the Thumbor configuration, once needs to restart Thumbor instances. This needs to be a rolling restart, to allow for the Thumbor cluster to keep serving requests while instances are being restarted. We're depooling machine one by one, even though a single instance being stopped mid-request is ok, since the reverse proxy layer in front of the Thumbor instances takes care of retrying the request.
cumin -b1 -s10 'thumbor1*' 'depool && sleep 7 && systemctl restart thumbor-instances && sleep 2 && pool' cumin -b1 -s10 'thumbor2*' 'depool && sleep 7 && systemctl restart thumbor-instances && sleep 2 && pool'
Dashboards and logs
Thumbor runs with python manhole for debugging/inspection purposes. See also T146143: Figure out a way to live-debug running production thumbor processes
To invoke manhole, e.g. on thumbor on port 8827:
sudo -u thumbor socat - unix-connect:/srv/thumbor/tmp/thumbor@8827/manhole-8827