HTTP proxy
To allow HTTP requests reach the outside world, we maintain a caching
HTTP proxy
in each datacenter. They are exposed using services entries of the form
webproxy.<datacenter>.wmnet
running on the install* servers.
How-to?
You can set the
http_proxy
and
https_proxy
environment variables to make many command-line scripts use the site specific proxy automatically.
The
no_proxy
and
NO_PROXY
variables are configured automatically across the infra by the
profile::environment
puppet module and hiera settings.
Helper commands
In your terminal, just run
set_proxy
. This will take care of setting up the needed environment variables during the active session.
unset_proxy
will do the opposite.
Manual config
export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080
export no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org,.wikibooks.org,.wikiquote.org,.wiktionary.org,.wikisource.org,.wikispecies.org,.wikiversity.org,.wikidata.org,.mediawiki.org,.wikinews.org,.wikivoyage.org
export HTTP_PROXY=$http_proxy
export HTTPS_PROXY=$https_proxy
export NO_PROXY=$no_proxy
-
"no_proxy" MUST be explicitly set
- Prevents unnecessary load on the proxies (to fetch internal resources)
- Prevents stale data cached on the proxies
- Prevents unnecessary dependencies
-
HTTP proxies SHOULD NOT be configured by default, but on a case by case (need) basis
- It's preferred to set these variables for your current session only by using the helper commands at the terminal prompt
- services should leverage Puppet to configure proxies
- These proxies MUST NOT be used from Cloud VPS instances (enforced by ACLs)
Internal endpoints
It is better to use internal endpoints instead of public ones, a list or reasons is visible on this comment .
API
Use e.g.
https://mw-api-int-ro.discovery.wmnet:4446
and set the HTTP Host header to the domain of the site you want to access, e.g.
curl -H "Host: www.wikidata.org" https://mw-api-int-ro.discovery.wmnet:4446
MediaWiki On Kubernetes internal API endpoints:
-
Direct usage
-
Read-only:
https://mw-api-int-ro.discovery.wmnet:4446 -
Read-write:
https://mw-api-int.discovery.wmnet:4446
-
Read-only:
-
Listeners to use through the
Envoy Services Proxy
:
-
Read-only:
mw-api-int-async-ro -
Read-write:
mw-api-intormw-api-int-async
-
Read-only:
For examples in Python and R refer to these notes .
LiftWing
See Machine Learning/LiftWing/Usage#Internal endpoints
A complete list exists at: https://config-master.wikimedia.org/discovery/discovery-basic.yaml
Outbound ports
The squid configuration only allows connecting to external ports from the
profile::installserver::proxy::ssl_ports
and
profile::installserver::proxy::safe_ports
allow lists. These ports are typically configured in the
hieradata/common/profile/installserver/proxy.yaml
settings file.
As of 2025-08-04 the configured ports are:
profile::installserver::proxy::ssl_ports:
- 443
- 873 # rsync used by rpki
- 6443 # T394838: OpenStack Magnum Kubernetes API
profile::installserver::proxy::safe_ports:
- 80
- 8080 # http://wpt-graphite.wmftest.org:8080/
Example usage
curl
If you are using curl, you can use the --proxy flag:
curl --proxy http://webproxy.eqiad.wmnet:8080 http://www.google.com
wget
wget has no --proxy flag, set the appropriate environment variable instead.
https_proxy=http://webproxy:8080 wget https://www.google.com
Maven proxy configuration example
You could reference your proxy in your maven conf file
~/.m2/settings.xml
to make sure you are passing through it to fetch packages at build time.
<settings>
<proxies>
<proxy>
<id>http-proxy</id>
<active>true</active>
<protocol>http</protocol>
<host>webproxy.eqiad.wmnet</host>
<port>8080</port>
</proxy>
<proxy>
<id>https-proxy</id>
<active>true</active>
<protocol>https</protocol>
<host>webproxy.eqiad.wmnet</host>
<port>8080</port>
</proxy>
</proxies>
</settings>
ant
In addition to
environment variables defined above
, invoke ant with the
-autoproxy
argument.
Spark
If your Spark job pulls dependencies via
spark.jars.packages
, you can point it to a settings file that automatically takes care of proxying by mirroring thru our Archiva instance:
conf={
...
"spark.jars.packages": "...", # packages to pull go here
"spark.driver.extraJavaOptions": "-Divy.cache.dir=/tmp/ivy_spark3/cache -Divy.home=/tmp/ivy_spark3/home ",
"spark.jars.ivySettings": "/etc/maven/ivysettings.xml"
}
Monitoring
Access log dashboard: https://logstash.wikimedia.org/app/dashboards#/view/58c908a0-a394-11ec-bf8e-43f1807d5bc2
Requests: https://grafana.wikimedia.org/d/i5YA-BXWz/squid
Future/possible improvements
-
Helper script to correctly configure the proxies for the current user session - T278315 - global http_proxy setting -
Centrally managed global no_proxy settings - T278315 - global http_proxy setting - Maybe restrict domains accessible by webproxy
- Improve proxies redundancy - T242715
Reference
See also
- url-downloader (another set of squid proxies for slightly different use cases)
- T254011: Why do we have 2 sets of squid proxies?
- We need to talk: Can we standardize NO_PROXY? - useful blogpost about proxy settings support across tools