Nova Resource:Wikisource/Documentation

From Wikitech-static
< Nova Resource:Wikisource
Revision as of 06:01, 12 February 2020 by imported>Samwilson (→‎Creating a new instance: mysql and dumping)
Jump to navigation Jump to search

Wikisource is a VPS project for Wikisource-related tools.

Wikisource Export

We have two VPS instances and two Toolforge tools (one each for prod and test). The latter exist because wsexport used to be hosted there, and the VPSs still use those tools' databases and email addresses.

Creating a new instance

Create a new m1.large instance running on Debian Buster (or m1.small for staging instances). Once the instance has been spawned, SSH in and follow these steps:

  1. Install PHP and Apache, along with some dependencies:
    sudo apt update && sudo apt -y upgrade
    sudo apt -y install php php-common
    sudo apt -y install php-cli php-fpm php-json php-xml php-mysql php-sqlite3 php-intl php-zip php-mbstring php-curl
    sudo apt -y install apache2 libapache2-mod-php zip unzip default-mysql-client
    
  2. Install Calibre by following these instructions (they recommend not to use the packaged version because it can be out of date)
  3. Install composer by following these instructions, but make sure to install to the /usr/local/bin directory and with the filename composer, e.g.:
    sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
    
  4. Clone the repository, first removing the html directory created by Apache.
    cd /var/www && sudo rm -rf html
    sudo git clone https://github.com/wsexport/tool.git
    cd /var/www/tool
    
  5. Become the root user with sudo su root
  6. Run sudo composer install --no-dev
  7. Edit the config.php file that was created in the previous step. The main things to change are the database credentials and the temporary directory location.
  8. Make sure that all the files in the repo are owned by www-data.
    sudo chown -R www-data:www-data .
    
  9. Create the web server configuration file at /etc/apache2/sites-available/wsexport.conf with the following:
    <VirtualHost *:80>
            DocumentRoot /var/www/tool/public
            ServerName wsexport.wmflabs.org
            
            php_value memory_limit 512M
    
            # Requests with these user agents are denied.
            SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel)" bad_bot=yes
    
            CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
            ErrorLog ${APACHE_LOG_DIR}/error.log
    
            ScriptAlias /tool "/var/www/tool/public"
            <Directory /var/www/tool/public/>
                 Options Indexes FollowSymLinks
                 AllowOverride All
                 Require all granted
                 DirectoryIndex book.php
            </Directory>
    
            <Directory /var/www/tool/>
                    Options Indexes FollowSymLinks
                    AllowOverride None
                    Require all granted
                    Deny from env=bad_bot
            </Directory>
    
            ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
            <Directory /usr/lib/cgi-bin/>
                    Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
                    Require all granted
            </Directory>
    
            ErrorDocument 403 "Access denied"
            RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
            RewriteRule .* - [R=403,L]
            RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
            RewriteRule .* - [R=403,L]
            
            RewriteEngine On
            RewriteCond %{HTTP:X-Forwarded-Proto} !https
            RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
    </VirtualHost>
    
  10. Enable the mod-rewrite Apache module, and enable the web server configuration.
    sudo a2dismod mpm_event
    sudo a2enmod php7.3
    sudo a2enmod rewrite
    sudo a2ensite wsexport
    sudo service apache2 reload
    
  11. (Re)start Apache:
    sudo service apache2 restart
    
    Moving forward, you should use sudo service apache2 graceful to restart the server.
  12. Set up annual log dump files by running the following weekly (it's located at /usr/local/bin/wsexport-dump-logs.sh, and note that you have to put the tool's DB credentials into /etc/mysql/conf.d/wsexport.cnf):
    #!/bin/bash
    YEAR="$1"
    if [ -z "$YEAR" ]; then
      YEAR=$( date +%Y )
    fi
    LOGDIR=/var/www/tool/public/logs
    echo "Dumping logs of $YEAR to $LOGDIR"
    mysqldump --defaults-file=/etc/mysql/conf.d/wsexport.cnf \
            --host=tools.db.svc.eqiad.wmflabs \
            s52561__wsexport_p books_generated \
            --where="YEAR(time) = $YEAR" \
            | gzip -c > $LOGDIR/$YEAR.sql.gz
    ls -l $LOGDIR