You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Nova Resource:Wikisource/Documentation: Difference between revisions
Jump to navigation
Jump to search
imported>Samwilson (documentation header template) |
imported>Samwilson (→Creating a new instance: CALIBRE_CONFIG_DIRECTORY) |
||
Line 52: | Line 52: | ||
# Requests with these user agents are denied. | # Requests with these user agents are denied. | ||
SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel)" bad_bot=yes | SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel)" bad_bot=yes | ||
# Calibre env vars: https://manual.calibre-ebook.com/customize.html#id1 | |||
SetEnv CALIBRE_CONFIG_DIRECTORY /tmp/calibre-config | |||
CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes') | CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes') |
Revision as of 03:20, 9 March 2020
Wikisource
Description
Wikisource is a VPS project for Wikisource-related tools. At the moment, it hosts only the #Wikisource Export tool (see below).
Project status
currently running
Contact address
Wikisource Export
URL: https://wsexport.wmflabs.org
Staging URL: https://wsexport-test.wmflabs.org
Source: https://github.com/wsexport/tool
License: GPL-2.0
Staging URL: https://wsexport-test.wmflabs.org
Source: https://github.com/wsexport/tool
License: GPL-2.0
We have two VPS instances and two Toolforge tools (one each for prod and test). The latter exist because wsexport used to be hosted there, and the VPSs still use those tools' databases and email addresses.
Creating a new instance
Create a new m1.large instance running on Debian Buster (or m1.small for staging instances). Once the instance has been spawned, SSH in and follow these steps:
- Install PHP and Apache, along with some dependencies:
sudo apt update && sudo apt -y upgrade sudo apt -y install php php-common sudo apt -y install php-cli php-fpm php-json php-xml php-mysql php-sqlite3 php-intl php-zip php-mbstring php-curl sudo apt -y install apache2 libapache2-mod-php zip unzip default-mysql-client libgl1
- Install Calibre by following these instructions (they recommend not to use the packaged version because it can be out of date)
- Install composer by following these instructions, but make sure to install to the
/usr/local/bin
directory and with the filenamecomposer
, e.g.:sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
- Clone the repository, first removing the html directory created by Apache.
cd /var/www && sudo rm -rf html sudo git clone https://github.com/wsexport/tool.git cd /var/www/tool
- Become the root user with
sudo su root
- Run
sudo composer install --no-dev
- Edit the
config.php
file that was created in the previous step. The main things to change are the database credentials and the temporary directory location. - Make sure that all the files in the repo are owned by www-data.
sudo chown -R www-data:www-data .
- Create the web server configuration file at
/etc/apache2/sites-available/wsexport.conf
with the following:<VirtualHost *:80> DocumentRoot /var/www/tool/public ServerName wsexport.wmflabs.org php_value memory_limit 512M # Requests with these user agents are denied. SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel)" bad_bot=yes # Calibre env vars: https://manual.calibre-ebook.com/customize.html#id1 SetEnv CALIBRE_CONFIG_DIRECTORY /tmp/calibre-config CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes') ErrorLog ${APACHE_LOG_DIR}/error.log ScriptAlias /tool "/var/www/tool/public" <Directory /var/www/tool/public/> Options Indexes FollowSymLinks AllowOverride All Require all granted DirectoryIndex book.php </Directory> <Directory /var/www/tool/> Options Indexes FollowSymLinks AllowOverride None Require all granted Deny from env=bad_bot </Directory> ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/ <Directory /usr/lib/cgi-bin/> Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch Require all granted </Directory> ErrorDocument 403 "Access denied" RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC] RewriteRule .* - [R=403,L] RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get" RewriteRule .* - [R=403,L] RewriteEngine On RewriteCond %{HTTP:X-Forwarded-Proto} !https RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L] </VirtualHost>
- Enable the mod-rewrite Apache module, and enable the web server configuration.
sudo a2dismod mpm_event sudo a2enmod php7.3 sudo a2enmod rewrite sudo a2ensite wsexport sudo service apache2 reload
- (Re)start Apache:
sudo service apache2 restart
- Moving forward, you should use
sudo service apache2 graceful
to restart the server.
- Moving forward, you should use
- Set up annual log dump files by running the following weekly (it's located at
/usr/local/bin/wsexport-dump-logs.sh
, and note that you have to put the tool's DB credentials into/etc/mysql/conf.d/wsexport.cnf
):#!/bin/bash YEAR="$1" if [ -z "$YEAR" ]; then YEAR=$( date +%Y ) fi LOGDIR=/var/www/tool/public/logs echo "Dumping logs of $YEAR to $LOGDIR" mysqldump --defaults-file=/etc/mysql/conf.d/wsexport.cnf \ --host=tools.db.svc.eqiad.wmflabs \ s52561__wsexport_p books_generated \ --where="YEAR(time) = $YEAR" \ | gzip -c > $LOGDIR/$YEAR.sql.gz ls -l $LOGDIR