Difference between revisions of "Nova Resource:Wikisource/Documentation"

From Wikitech-static
Jump to navigation Jump to search
(update for .env, tool name, and new var FS)
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Nova Project Documentation
{{Nova Project Documentation
| Project Title = Wikisource
| Project Title = Wikisource
| Description = '''Wikisource''' is a VPS project for Wikisource-related tools. At the moment, it hosts only the [[#Wikisource Export]] tool (see below).
| Description = '''Wikisource''' is a VPS project for Wikisource-related tools. At the moment, it hosts the '''Wikisource Export''' and '''Wikimedia OCR''' tools (see below). It is maintained by the WMF Community Tech team.
| Purpose =  
| Purpose =  
| Anticipated Traffic Level =  
| Anticipated Traffic Level =  
| Anticipated Time Span =  
| Anticipated Time Span =  
| Project Status = currently running
| Project Status = currently running
| Contact Address = [[phab:tag/tool-wsexport]]
| Contact Address =  
* [[phab:tag/tool-wsexport]]
* [[phab:tag/wikimedia-ocr]]
| Willing to take contributors or not =  
| Willing to take contributors or not =  
| Subject area narrow or broad =  
| Subject area narrow or broad =  
| Extra Information =  
| Extra Information =  
=== Projects ===

== Wikisource Export ==
* [[Nova Resource:Wikisource/Wikisource Export|Wikisource Export]]
* [[Nova Resource:Wikisource/Wikimedia OCR|Wikimedia OCR]]
{{Hatnote|'''URL''': https://ws-export.wmcloud.org<br/>'''Staging URL''': https://ws-export-test.wmcloud.org
* [[Nova Resource:Wikisource/IA Upload|IA Upload]]
<br/>'''Source''': https://github.com/wikimedia/ws-export<br/>'''License''': GPL-2.0}}
We have two VPS instances and two Toolforge tools (one each for prod and test). The latter exist because WS Export used to be hosted there, and the VPSs still use those tools' databases and email addresses.
=== Creating a new instance ===
Create a new m1.large instance running on Debian Buster (or m1.small for staging instances). Once the instance has been spawned, SSH in and follow these steps:
# Install PHP and Apache, along with some dependencies:<syntaxhighlight lang="bash">
sudo apt update && sudo apt -y upgrade
sudo apt -y install php php-common
sudo apt -y install php-cli php-fpm php-json php-xml php-mysql php-sqlite3 php-intl php-zip php-mbstring php-curl php-imagick
sudo apt -y install apache2 libapache2-mod-php libapache2-mod-perl2 zip unzip default-mysql-client libgl1
# Install Calibre by following [https://calibre-ebook.com/download_linux these instructions] (they recommend not to use the packaged version because it can be out of date). Note that Calibre can fail to clean up its temp files in some situations, so we also add the following in <code>/etc/cron.daily/calibre-cleanup</code>:<syntaxhighlight lang="bash">
find /tmp -path '*calibre*tmp*' -user www-data -mtime +1 -exec rm -r {} \;
# Install some fonts. Mostly these are available in the Debian repositories, but the Mukta family must be installed manually to maintain backwards compatibility (these used to be packaged with the tool's code).<syntaxhighlight lang="bash">
sudo apt -y install fonts-freefont-ttf fonts-linuxlibertine fonts-dejavu-core fonts-gubbi
wget https://fonts.google.com/download?family=Mukta -O Mukta.zip
wget https://fonts.google.com/download?family=Mukta%20Mahee -O MuktaMahee.zip
wget https://fonts.google.com/download?family=Mukta%20Malar -O MuktaMalar.zip
wget https://fonts.google.com/download?family=Mukta%20Vaani -O MuktaVaani.zip
sudo unzip Mukta.zip -d /usr/local/share/fonts/Mukta
sudo unzip MuktaMahee.zip -d /usr/local/share/fonts/MuktaMahee
sudo unzip MuktaMalar.zip -d /usr/local/share/fonts/MuktaMalar
sudo unzip MuktaVaani.zip -d /usr/local/share/fonts/MuktaVaani
sudo fc-cache -v
# Install composer by following [https://getcomposer.org/ these instructions], but make sure to install to the <code>/usr/local/bin</code> directory and with the filename <code>composer</code>, e.g.:<syntaxhighlight lang="bash">
sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
# Clone the repository, first removing the html directory created by Apache.<syntaxhighlight lang="bash">
cd /var/www && sudo rm -rf html
sudo git clone https://github.com/wikimedia/ws-export.git
cd /var/www/tool
# Become the root user with <code>sudo su root</code>
# [[Help:Adding_Disk_Space_to_Cloud_VPS_instances#Cinder:_Attachable_Block_Storage_for_cloud-vps|Add a block storage filesystem]] at <code>/ws-export/</code> with a directory in it symlimked from the tool's <code>var/</code> directory: <syntaxhighlight lang="bash">mkdir /ws-export/var
chown -R www-data:www-data /ws-export/var
ln -s /ws-export/var /var/www/tool/var
# Run <code>sudo composer install --no-dev -o</code>
# Copy <code>.env</code> to <code>.env.local</code> and edit the environment variables in it.
# Make sure that all the files in the repo are owned by www-data.<syntaxhighlight lang="bash">
sudo chown -R www-data:www-data .
# Create the web server configuration file at <code>/etc/apache2/sites-available/wsexport.conf</code> with the following:<syntaxhighlight lang="apacheconf">
<VirtualHost *:80>
        ServerName wsexport.wmflabs.org
        Redirect / https://ws-export.wmcloud.org/
<VirtualHost *:80>
        DocumentRoot /var/www/tool/public
        ServerName wsexport.wmcloud.org
        php_value memory_limit 512M
        # Requests with these user agents are denied.
        SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot)" bad_bot=yes
        # Calibre env vars: https://manual.calibre-ebook.com/customize.html#id1
        SetEnv CALIBRE_CONFIG_DIRECTORY /tmp/calibre-config
        CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
        ErrorLog ${APACHE_LOG_DIR}/error.log
        ScriptAlias /tool "/var/www/tool/public"
        Redirect /wikisource-fr-good.atom /opds/fr/Bon_pour_export.xml
        Redirect /opds/fr.xml /opds/fr/Bon_pour_export.xml
        <Directory /var/www/tool/public/>
            Options Indexes FollowSymLinks
            AllowOverride All
            Require all granted
            DirectoryIndex index.php book.php
            # Rewrite URLs for Symfony:
            RewriteEngine On
            RewriteRule ^index\.php$ - [L]
            RewriteCond %{REQUEST_FILENAME} !-f
            RewriteCond %{REQUEST_FILENAME} !-d
            RewriteRule . /index.php [L]
        <Directory /var/www/tool/>
                Options Indexes FollowSymLinks
                AllowOverride None
                Require all granted
                Deny from env=bad_bot
        Alias /awstats /var/www/awstats/wwwroot
        <Files ~ "\.(pl|cgi)$">
            SetHandler perl-script
            PerlResponseHandler ModPerl::PerlRun
            Options +ExecCGI
            PerlSendHeader On
        <Directory "/var/www/awstats/wwwroot">
            AddHandler cgi-script .pl
            Options -Indexes
            DirectoryIndex awstats.pl
            RedirectMatch "/awstats/$" "/awstats/cgi-bin/awstats.pl"
        ErrorDocument 403 "Access denied"
        RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
        RewriteRule .* - [R=403,L]
        RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
        RewriteRule .* - [R=403,L]
        RewriteEngine On
        RewriteCond %{HTTP:X-Forwarded-Proto} !https
        RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
#Set PHP configuration in <code>/etc/php/7.3/mods-available/wsexport.ini</code>:<syntaxhighlight lang="ini">
max_execution_time = 60;
</syntaxhighlight>And enable it with <code>sudo phpenmod wsexport</code>
#Enable the mod-rewrite Apache module, and enable the web server configuration.<syntaxhighlight lang="bash">
sudo a2dismod mpm_event
sudo a2enmod php7.3
sudo a2enmod rewrite
sudo a2ensite wsexport
sudo service apache2 reload
#(Re)start Apache:<syntaxhighlight lang="bash">
sudo service apache2 restart
#:Moving forward, you should use <code>sudo service apache2 graceful</code> to restart the server.
# Install AWStats:
## Download from https://www.awstats.org/ and extract to <code>/var/www/awstats/</code> (so it contains <code>wwwroot/</code>, <code>tools/</code>, etc.)
## Make sure the web server user can read logs at <code>/var/log/apache2/access.log*</code> (set <code>create 644 root adm</code> in <code>/etc/logrotate.d/apache2</code>)
## Set the following in /var/www/awstats/wwwroot/cgi-bin/awstats.ws-export.wmcloud.org.conf</code>:<syntaxhighlight>
LogFile="/var/www/awstats/tools/logresolvemerge.pl /var/log/apache2/access.log* |"
LogFormat="%host %time1 %methodurl %code %refererquot %uaquot
## Set up the cronjob to compile the stats daily:<syntaxhighlight>
@daily perl /var/www/awstats/wwwroot/cgi-bin/awstats.pl -config=ws-export.wmcloud.org
#Set up annual log dump files by running the following weekly (it's located at <code>/usr/local/bin/wsexport-dump-logs.sh</code>, and note that you have to put the tool's DB credentials into <code>/etc/mysql/conf.d/wsexport.cnf</code>):<syntaxhighlight lang="bash">
if [ -z "$YEAR" ]; then
  YEAR=$( date +%Y )
echo "Dumping logs of $YEAR to $LOGDIR"
mysqldump --defaults-file=/etc/mysql/conf.d/wsexport.cnf \
        --host=tools.db.svc.eqiad1.wikimedia.cloud \
        s52561__wsexport_p books_generated \
        --where="YEAR(time) = $YEAR" \
        | gzip -c > $LOGDIR/$YEAR.sql.gz
ls -l $LOGDIR

Latest revision as of 02:36, 15 June 2021



Wikisource is a VPS project for Wikisource-related tools. At the moment, it hosts the Wikisource Export and Wikimedia OCR tools (see below). It is maintained by the WMF Community Tech team.

Project status

currently running

Contact address