*** Currently being updated for Ubuntu 14.04 ***

Preparing Wikipedia, step by step:

[TOC]

# Downloading a snap shot

 - Download English versions from [here](http://dumps.wikimedia.org/enwiki/)
 - Unpack the snapshop with bunzip.
 - Its advisable to complete this whole guide with a **different** and smaller dump than the English Wikipedia... try the Danish or some other obscure language.

June 14th 2013, the snapshot was 9.9 GB to download, took up 42 GB once unpacked and the resulting MySQL database was 38 GB. The snapshot contained **13,538,988** articles.

August 8th 2013 (used for the 14.04 deployment) was 9.4 GB. Can be obtained by torrent [here](http://academictorrents.com/details/30ac2ef27829b1b5a7d0644097f55f335ca5241b). It contained **13,715,113** articles and took 70 hours to load on a 2GB laptop with a 2.4 GHz dual core.

# Server requirements

You need to apt-get install the following:

    apt-get install apache2 mysql-server mysql-client php5 php5-mysql php5-gd php5-intl php5-xcache

Enable xcache module, in `/etc/php5/apache2/conf.d/xcache.ini`, put the following line:

    extension=/usr/lib/php5/20121212+lfs/xcache.so

You should also install **phpmyadmin** for debugging etc.

# Installing Mediawiki

In order to have Wikipedia working, we need Wikimedia's own version of Mediawiki, which has all the extensions (>600) bundled in. This can be achieved by checking out the latest stable branch. For instance for their release 1.19:

    cd /usr/local/fair
    mkdir -p mediawiki
    cd mediawiki/
    wget http://dumps.wikimedia.org/mediawiki/1.21/mediawiki-1.21.1.tar.gz
    tar xvfz mediawiki-*.tar.gz
    mv mediawiki-1.21.1 mediawiki
    cd /var/www/html
    ln -s /usr/local/share/fair/mediawiki/mediawiki wiki

Now, navigate to [http://localhost/wiki](http://localhost/wiki) and create a fresh installation of MediaWiki.

## Configuration

Name the database **wikipedia**.

Make sure to name the administrative user **admin** and password **fair** so others can access.

Use **UTF-8** character set and **MyISAM** engine. Do not use InnoDB, it is way too slow for this purpose, and we experienced constraints broken in the source Wikipedia dumps.

*Creative Commons Attribution Share Alike* is the footer to use, that's what Wikipedia uses, too.

## After configuring

Make sure there are no articles, otherwise the import will fail trying to create an already existing article.

Delete everything from **page**, **revision**, **pagelinks**, **text**

**LocalSettings.php**

    $wgThumbLimits = array(300);
    $wgDefaultUserOptions['imagesize'] = 0;
    $wgImageLimits = array (array(1000,1000));
    $wgHashedUploadDirectory = true;

You also need to remove the default `images/` directory and symlink in the real one with all the images once they have been obtained.

You should also obtain [Wikipedia's CSS](https://en.wikipedia.org/w/index.php?title=MediaWiki:Common.css&action=edit). You should copy-paste this to [http://localhost/wiki/MediaWiki:Common.css](http://localhost/wiki/MediaWiki:Common.css)

# Adding extensions

These are the relevant extensions from Wikimedia's list of common extensions:

http://meta.wikimedia.org/wiki/Wikimedia_extensions

    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CategoryTree.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CharInsert.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Cite.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ExpandTemplates.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ImageMap.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/InputBox.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/OAI.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/OggHandler.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Oversight.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/PagedTiffHandler.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ParserFunctions.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SiteMatrix.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SyntaxHighlight_GeSHi.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/wikihiero.git
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/GeoData.git

Afterwards, something more complicated is needed:

    sudo apt-get install lua5.1

This installs Lua, which is used by the Scribunto extension that is responsible for rendering large parts of Wikipedia macros, like the Infoboxes almost on all articles. Get an appropriate version [here](http://www.mediawiki.org/wiki/Extension:Scribunto#Bundled_binaries).

Then put this in LocalSettings.php

    $wgScribuntoEngineConf['luastandalone']['luaPath'] = '/usr/bin/lua5.1';


The final list of extensions in LocalSettings.php is this:

    require_once( "$IP/extensions/Cite/Cite.php" );
    require_once( "$IP/extensions/Gadgets/Gadgets.php" );
    require_once( "$IP/extensions/InputBox/InputBox.php" );
    require_once( "$IP/extensions/Nuke/Nuke.php" );
    require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" );
    require_once( "$IP/extensions/PdfHandler/PdfHandler.php" );
    require_once( "$IP/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.php" );
    require_once( "$IP/extensions/Vector/Vector.php" );

    require_once( "$IP/extensions/CategoryTree/CategoryTree.php" );
    require_once( "$IP/extensions/CharInsert/CharInsert.php" );
    require_once( "$IP/extensions/Cite/Cite.php" );
    require_once( "$IP/extensions/ExpandTemplates/ExpandTemplates.php" );
    require_once( "$IP/extensions/ImageMap/ImageMap.php" );
    require_once( "$IP/extensions/InputBox/InputBox.php" );
    require_once( "$IP/extensions/OAI/OAI.php" );
    require_once( "$IP/extensions/OggHandler/OggHandler.php" );
    require_once( "$IP/extensions/Oversight/Oversight.php" );
    require_once( "$IP/extensions/PagedTiffHandler/PagedTiffHandler.php" );
    require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" );
    require_once( "$IP/extensions/SiteMatrix/SiteMatrix.php" );
    require_once( "$IP/extensions/TemplateInfo/TemplateInfo.php" );
    require_once( "$IP/extensions/wikihiero/wikihiero.php" );
    require_once( "$IP/extensions/Scribunto/Scribunto.php" );
    require_once( "$IP/extensions/GeoData/GeoData.php" );


# After adding extensions but before loading articles!

From `maintenance/`, run `php update.php`. It creates additional tables needed by some of the extensions. It's important to run this command with everything correctly setup before loading articles, otherwise it will take days to complete after articles are loaded.

# Loading the XML snapshot

First, you need to get [mwimporter](http://meta.wikimedia.org/wiki/Data_dumps/mwimport) -- a version that works: [attachment:108]

You may need to correct the script to accept the generator of your dump, as the script is updated at a slower pace than the dumps.

Assuming that the snapshot is placed in `/var/fair/data`, do the following:

    cd /var/www/html/wiki/maintenance/
    cat /var/fair/data/enwiki-20130604-pages-articles-multistream.xml | perl mwimporter.pl | mysql -u root -p --default-character-set=utf8 wikipedia

## Starting over

    DELETE FROM revision;
    DELETE FROM page;
    DELETE FROM pagelinks;
    DELETE FROM text;

## Resolving issues with the dump

The following failure has occured multiple times and seems related to an empty title:

    ERROR 1062 (23000) at line 6931308: Duplicate entry '0-' for key 'name_title'

These are the constraints that are enforced in MyISAM:

    +-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | Table | Non_unique | Key_name                    | Seq_in_index | Column_name      | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
    +-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | page  |          0 | PRIMARY                     |            1 | page_id          | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
    | page  |          0 | name_title                  |            1 | page_namespace   | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
    | page  |          0 | name_title                  |            2 | page_title       | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
    | page  |          1 | page_random                 |            1 | page_random      | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
    | page  |          1 | page_len                    |            1 | page_len         | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
    | page  |          1 | page_redirect_namespace_len |            1 | page_is_redirect | A         |       24566 |     NULL | NULL   |      | BTREE      |         |               |
    | page  |          1 | page_redirect_namespace_len |            2 | page_namespace   | A         |       36655 |     NULL | NULL   |      | BTREE      |         |               |
    | page  |          1 | page_redirect_namespace_len |            3 | page_len         | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
    +-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

# Optimizing database

**TODO**:

 - Make tables read-only


# Logs

Final output line from mwimporter, after 61 hours of importing a 40 GB XML dump ([enwiki - June 14th 2013, pages-articles multistream](http://dumps.wikimedia.org/enwiki/20130604/))

     13538988 pages ( 61.288/s),  13538988 revisions ( 61.288/s) in 220909 seconds

XML dump [August 8 2013](http://academictorrents.com/details/30ac2ef27829b1b5a7d0644097f55f335ca5241b).

    13715113 pages ( 52.465/s),  13715113 revisions ( 52.465/s) in 261414 seconds


# Media files

## Approaches

1. Use SQL dump of images table. This will generate a dump of all images uploaded to a specific language version of Wikipedia. Problem: Images from language A may be used on language B.
2. Parse XML dump and look file [[File:...]]. Problem: Less efficient than Approach 1. Advantage: Finds all and only files that are actually used.
3. Rsync with yours.org. Problem: Gets tons of files that aren't even in use. Trying to fix this problem by add `--include` to rsync, however there may still be a huge redundancy in files downloaded.

## Restoring image database

Simply put all file names in the table `images`. You do not need to fill in anything but the image name field.

You can create a list of all the files sort of like this:

    find wikipedia_images/ -type d -exec ls -1 {} \; > filenames.lst

Then you can quickly see how many files were there:

    wc -l filenames.lst 
    1549297 filenames.lst

Then after, you can load it sort of like this:

    mysql> TRUNCATE image;
    Query OK, 0 rows affected (0.00 sec)

    mysql> ALTER TABLE image DISABLE KEYS;
    Query OK, 0 rows affected (0.00 sec)

    mysql> LOCK TABLES image WRITE;
    Query OK, 0 rows affected (0.00 sec)

    mysql> LOAD DATA INFILE 'filenames.lst' INTO TABLE image (img_name);
    Query OK, 3153470 rows affected, 2 warnings (47 min 13.24 sec)
    Records: 3153470  Deleted: 0  Skipped: 0  Warnings: 2