Difference between revisions of "Nova Resource:Wikisource/Documentation"

From Wikitech-static
Jump to navigation Jump to search
(update for .env, tool name, and new var FS)
(Cache pruning.)
Line 158: Line 158:
## Set up the cronjob to compile the stats daily:<syntaxhighlight>
## Set up the cronjob to compile the stats daily:<syntaxhighlight>
@daily perl /var/www/awstats/wwwroot/cgi-bin/awstats.pl -config=ws-export.wmcloud.org
@daily perl /var/www/awstats/wwwroot/cgi-bin/awstats.pl -config=ws-export.wmcloud.org
# Add a cronjob to prune the cache twice a day:<syntaxhighlight>
00 1,13 * * * /usr/local/bin/wsexport-prune-cache.sh
</syntaxhighlight>Where the script is the following:<syntaxhighlight>
df /ws-export/
/usr/bin/php /var/www/tool/bin/console cache:pool:prune
df /ws-export/
#Set up annual log dump files by running the following weekly (it's located at <code>/usr/local/bin/wsexport-dump-logs.sh</code>, and note that you have to put the tool's DB credentials into <code>/etc/mysql/conf.d/wsexport.cnf</code>):<syntaxhighlight lang="bash">
#Set up annual log dump files by running the following weekly (it's located at <code>/usr/local/bin/wsexport-dump-logs.sh</code>, and note that you have to put the tool's DB credentials into <code>/etc/mysql/conf.d/wsexport.cnf</code>):<syntaxhighlight lang="bash">

Revision as of 02:17, 22 February 2021



Wikisource is a VPS project for Wikisource-related tools. At the moment, it hosts only the #Wikisource Export tool (see below).

Project status

currently running

Contact address


Wikisource Export

We have two VPS instances and two Toolforge tools (one each for prod and test). The latter exist because WS Export used to be hosted there, and the VPSs still use those tools' databases and email addresses.

Creating a new instance

Create a new m1.large instance running on Debian Buster (or m1.small for staging instances). Once the instance has been spawned, SSH in and follow these steps:

  1. Install PHP and Apache, along with some dependencies:
    sudo apt update && sudo apt -y upgrade
    sudo apt -y install php php-common
    sudo apt -y install php-cli php-fpm php-json php-xml php-mysql php-sqlite3 php-intl php-zip php-mbstring php-curl php-imagick
    sudo apt -y install apache2 libapache2-mod-php libapache2-mod-perl2 zip unzip default-mysql-client libgl1
  2. Install Calibre by following these instructions (they recommend not to use the packaged version because it can be out of date). Note that Calibre can fail to clean up its temp files in some situations, so we also add the following in /etc/cron.daily/calibre-cleanup:
    find /tmp -path '*calibre*tmp*' -user www-data -mtime +1 -exec rm -r {} \;
  3. Install some fonts. Mostly these are available in the Debian repositories, but the Mukta family must be installed manually to maintain backwards compatibility (these used to be packaged with the tool's code).
    sudo apt -y install fonts-freefont-ttf fonts-linuxlibertine fonts-dejavu-core fonts-gubbi
    wget https://fonts.google.com/download?family=Mukta -O Mukta.zip
    wget https://fonts.google.com/download?family=Mukta%20Mahee -O MuktaMahee.zip
    wget https://fonts.google.com/download?family=Mukta%20Malar -O MuktaMalar.zip
    wget https://fonts.google.com/download?family=Mukta%20Vaani -O MuktaVaani.zip
    sudo unzip Mukta.zip -d /usr/local/share/fonts/Mukta
    sudo unzip MuktaMahee.zip -d /usr/local/share/fonts/MuktaMahee
    sudo unzip MuktaMalar.zip -d /usr/local/share/fonts/MuktaMalar
    sudo unzip MuktaVaani.zip -d /usr/local/share/fonts/MuktaVaani
    sudo fc-cache -v
  4. Install composer by following these instructions, but make sure to install to the /usr/local/bin directory and with the filename composer, e.g.:
    sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
  5. Clone the repository, first removing the html directory created by Apache.
    cd /var/www && sudo rm -rf html
    sudo git clone https://github.com/wikimedia/ws-export.git
    cd /var/www/tool
  6. Become the root user with sudo su root
  7. Add a block storage filesystem at /ws-export/ with a directory in it symlimked from the tool's var/ directory:
    mkdir /ws-export/var
    chown -R www-data:www-data /ws-export/var
    ln -s /ws-export/var /var/www/tool/var
  8. Run sudo composer install --no-dev -o
  9. Copy .env to .env.local and edit the environment variables in it.
  10. Make sure that all the files in the repo are owned by www-data.
    sudo chown -R www-data:www-data .
  11. Create the web server configuration file at /etc/apache2/sites-available/wsexport.conf with the following:
    <VirtualHost *:80>
            ServerName wsexport.wmflabs.org
            Redirect / https://ws-export.wmcloud.org/
    <VirtualHost *:80>
            DocumentRoot /var/www/tool/public
            ServerName wsexport.wmcloud.org
            php_value memory_limit 512M
            # Requests with these user agents are denied.
            SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot)" bad_bot=yes
            # Calibre env vars: https://manual.calibre-ebook.com/customize.html#id1
            SetEnv CALIBRE_CONFIG_DIRECTORY /tmp/calibre-config
            CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
            ErrorLog ${APACHE_LOG_DIR}/error.log
            ScriptAlias /tool "/var/www/tool/public"
            Redirect /wikisource-fr-good.atom /opds/fr/Bon_pour_export.xml
            Redirect /opds/fr.xml /opds/fr/Bon_pour_export.xml
            <Directory /var/www/tool/public/>
                 Options Indexes FollowSymLinks
                 AllowOverride All
                 Require all granted
                 DirectoryIndex index.php book.php
                 # Rewrite URLs for Symfony:
                 RewriteEngine On
                 RewriteRule ^index\.php$ - [L]
                 RewriteCond %{REQUEST_FILENAME} !-f
                 RewriteCond %{REQUEST_FILENAME} !-d
                 RewriteRule . /index.php [L]
            <Directory /var/www/tool/>
                    Options Indexes FollowSymLinks
                    AllowOverride None
                    Require all granted
                    Deny from env=bad_bot
            Alias /awstats /var/www/awstats/wwwroot
            <Files ~ "\.(pl|cgi)$">
                SetHandler perl-script
                PerlResponseHandler ModPerl::PerlRun
                Options +ExecCGI
                PerlSendHeader On
            <Directory "/var/www/awstats/wwwroot">
                AddHandler cgi-script .pl
                Options -Indexes
                DirectoryIndex awstats.pl
                RedirectMatch "/awstats/$" "/awstats/cgi-bin/awstats.pl"
            ErrorDocument 403 "Access denied"
            RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
            RewriteRule .* - [R=403,L]
            RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
            RewriteRule .* - [R=403,L]
            RewriteEngine On
            RewriteCond %{HTTP:X-Forwarded-Proto} !https
            RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
  12. Set PHP configuration in /etc/php/7.3/mods-available/wsexport.ini:
    max_execution_time = 60;
    And enable it with sudo phpenmod wsexport
  13. Enable the mod-rewrite Apache module, and enable the web server configuration.
    sudo a2dismod mpm_event
    sudo a2enmod php7.3
    sudo a2enmod rewrite
    sudo a2ensite wsexport
    sudo service apache2 reload
  14. (Re)start Apache:
    sudo service apache2 restart
    Moving forward, you should use sudo service apache2 graceful to restart the server.
  15. Install AWStats:
    1. Download from https://www.awstats.org/ and extract to /var/www/awstats/ (so it contains wwwroot/, tools/, etc.)
    2. Make sure the web server user can read logs at /var/log/apache2/access.log* (set create 644 root adm in /etc/logrotate.d/apache2)
    3. Set the following in /var/www/awstats/wwwroot/cgi-bin/awstats.ws-export.wmcloud.org.conf:
      	LogFile="/var/www/awstats/tools/logresolvemerge.pl /var/log/apache2/access.log* |"
      	LogFormat="%host %time1 %methodurl %code %refererquot %uaquot
    4. Set up the cronjob to compile the stats daily:
      @daily perl /var/www/awstats/wwwroot/cgi-bin/awstats.pl -config=ws-export.wmcloud.org
  16. Add a cronjob to prune the cache twice a day:
    00 1,13 * * * /usr/local/bin/wsexport-prune-cache.sh
    Where the script is the following:
    df /ws-export/
    /usr/bin/php /var/www/tool/bin/console cache:pool:prune
    df /ws-export/
  17. Set up annual log dump files by running the following weekly (it's located at /usr/local/bin/wsexport-dump-logs.sh, and note that you have to put the tool's DB credentials into /etc/mysql/conf.d/wsexport.cnf):
    if [ -z "$YEAR" ]; then
      YEAR=$( date +%Y )
    echo "Dumping logs of $YEAR to $LOGDIR"
    mysqldump --defaults-file=/etc/mysql/conf.d/wsexport.cnf \
            --host=tools.db.svc.eqiad1.wikimedia.cloud \
            s52561__wsexport_p books_generated \
            --where="YEAR(time) = $YEAR" \
            | gzip -c > $LOGDIR/$YEAR.sql.gz
    ls -l $LOGDIR