You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Wikisource/Wikimedia OCR

From Wikitech-static
< Nova Resource:Wikisource
Revision as of 01:42, 15 April 2021 by imported>Samwilson (Created page with "This page documents how to set up the Wikimedia OCR project. __TOC__ == Web server == Install and configure Apache and PHP. <syntaxhighlight lang="bash"> sudo apt -y insta...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page documents how to set up the Wikimedia OCR project.

Web server

Install and configure Apache and PHP.

sudo apt -y install php php-common php-cli php-fpm php-json php-xml php-intl php-curl apache2 libapache2-mod-php

Create the web server configuration file at /etc/apache2/sites-available/wikimediaocr.conf with the following:

<VirtualHost *:80>
        DocumentRoot /var/www/tool/public
        ServerName ocr.wmcloud.org
        
        php_value memory_limit 512M

        # Requests with these user agents are denied.
        SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot|sqlmap)" bad_bot=yes

        CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
        ErrorLog ${APACHE_LOG_DIR}/error.log

        <Directory /var/www/tool/public/>
             Options Indexes FollowSymLinks
             AllowOverride All
             Require all granted
             DirectoryIndex index.php
             RewriteEngine On
             RewriteRule ^index\.php$ - [L]
             RewriteCond %{REQUEST_FILENAME} !-f
             RewriteCond %{REQUEST_FILENAME} !-d
             RewriteRule . /index.php [L]
        </Directory>

        <Directory /var/www/tool/>
                Options Indexes FollowSymLinks
                AllowOverride None
                Require all granted
                Deny from env=bad_bot
        </Directory>

        ErrorDocument 403 "Access denied"
        RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
        RewriteRule .* - [R=403,L]
        RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
        RewriteRule .* - [R=403,L]
        
        RewriteEngine On
        RewriteCond %{HTTP:X-Forwarded-Proto} !https
        RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
</VirtualHost>

Set PHP configuration in /etc/php/7.3/mods-available/wikimediaocr.ini:

max_execution_time = 60;

And enable it with sudo phpenmod wikimediaocr

Enable various Apache modules, and the web server configuration.

sudo a2dismod mpm_event
sudo a2enmod php7.3
sudo a2enmod rewrite
sudo a2ensite wikimediaocr
sudo service apache2 reload

(Re)start Apache:

sudo service apache2 restart

Moving forward, you should use sudo service apache2 graceful to restart the server.

Tool

Clone the repository, first removing the html/ directory created by Apache.

cd /var/www && sudo rm -rf html
sudo git clone https://github.com/wikimedia/wikimedia-ocr.git
cd /var/www/tool
<syntaxhighlight>

== Tesseract ==

<syntaxhighlight lang="bash">
sudo apt install tesseract-ocr-all

Google OCR