Currently being updated for Ubuntu 14.04

Preparing Wikipedia, step by step:

Downloading a snap shot

  • Download English versions from here
  • Unpack the snapshop with bunzip.
  • Its advisable to complete this whole guide with a different and smaller dump than the English Wikipedia... try the Danish or some other obscure language.

June 14th 2013, the snapshot was 9.9 GB to download, took up 42 GB once unpacked and the resulting MySQL database was 38 GB. The snapshot contained 13,538,988 articles.

August 8th 2013 (used for the 14.04 deployment) was 9.4 GB. Can be obtained by torrent here. It contained 13,715,113 articles and took 70 hours to load on a 2GB laptop with a 2.4 GHz dual core.

Server requirements

You need to apt-get install the following:

apt-get install apache2 mysql-server mysql-client php5 php5-mysql php5-gd php5-intl php5-xcache

Enable xcache module, in /etc/php5/apache2/conf.d/xcache.ini, put the following line:

extension=/usr/lib/php5/20121212+lfs/xcache.so

You should also install phpmyadmin for debugging etc.

Installing Mediawiki

In order to have Wikipedia working, we need Wikimedia's own version of Mediawiki, which has all the extensions (>600) bundled in. This can be achieved by checking out the latest stable branch. For instance for their release 1.19:

cd /usr/local/fair
mkdir -p mediawiki
cd mediawiki/
wget http://dumps.wikimedia.org/mediawiki/1.21/mediawiki-1.21.1.tar.gz
tar xvfz mediawiki-*.tar.gz
mv mediawiki-1.21.1 mediawiki
cd /var/www/html
ln -s /usr/local/share/fair/mediawiki/mediawiki wiki

Now, navigate to http://localhost/wiki and create a fresh installation of MediaWiki.

Configuration

Name the database wikipedia.

Make sure to name the administrative user admin and password fair so others can access.

Use UTF-8 character set and MyISAM engine. Do not use InnoDB, it is way too slow for this purpose, and we experienced constraints broken in the source Wikipedia dumps.

Creative Commons Attribution Share Alike is the footer to use, that's what Wikipedia uses, too.

After configuring

Make sure there are no articles, otherwise the import will fail trying to create an already existing article.

Delete everything from page, revision, pagelinks, text

LocalSettings.php

$wgThumbLimits = array(300);
$wgDefaultUserOptions['imagesize'] = 0;
$wgImageLimits = array (array(1000,1000));
$wgHashedUploadDirectory = true;

You also need to remove the default images/ directory and symlink in the real one with all the images once they have been obtained.

You should also obtain Wikipedia's CSS. You should copy-paste this to http://localhost/wiki/MediaWiki:Common.css

Adding extensions

These are the relevant extensions from Wikimedia's list of common extensions:

http://meta.wikimedia.org/wiki/Wikimedia_extensions

git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CategoryTree.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CharInsert.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Cite.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ExpandTemplates.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ImageMap.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/InputBox.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/OAI.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/OggHandler.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Oversight.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/PagedTiffHandler.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ParserFunctions.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SiteMatrix.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SyntaxHighlight_GeSHi.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/wikihiero.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/GeoData.git

Afterwards, something more complicated is needed:

sudo apt-get install lua5.1

This installs Lua, which is used by the Scribunto extension that is responsible for rendering large parts of Wikipedia macros, like the Infoboxes almost on all articles. Get an appropriate version here.

Then put this in LocalSettings.php

$wgScribuntoEngineConf['luastandalone']['luaPath'] = '/usr/bin/lua5.1';

The final list of extensions in LocalSettings.php is this:

require_once( "$IP/extensions/Cite/Cite.php" );
require_once( "$IP/extensions/Gadgets/Gadgets.php" );
require_once( "$IP/extensions/InputBox/InputBox.php" );
require_once( "$IP/extensions/Nuke/Nuke.php" );
require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" );
require_once( "$IP/extensions/PdfHandler/PdfHandler.php" );
require_once( "$IP/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.php" );
require_once( "$IP/extensions/Vector/Vector.php" );

require_once( "$IP/extensions/CategoryTree/CategoryTree.php" );
require_once( "$IP/extensions/CharInsert/CharInsert.php" );
require_once( "$IP/extensions/Cite/Cite.php" );
require_once( "$IP/extensions/ExpandTemplates/ExpandTemplates.php" );
require_once( "$IP/extensions/ImageMap/ImageMap.php" );
require_once( "$IP/extensions/InputBox/InputBox.php" );
require_once( "$IP/extensions/OAI/OAI.php" );
require_once( "$IP/extensions/OggHandler/OggHandler.php" );
require_once( "$IP/extensions/Oversight/Oversight.php" );
require_once( "$IP/extensions/PagedTiffHandler/PagedTiffHandler.php" );
require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" );
require_once( "$IP/extensions/SiteMatrix/SiteMatrix.php" );
require_once( "$IP/extensions/TemplateInfo/TemplateInfo.php" );
require_once( "$IP/extensions/wikihiero/wikihiero.php" );
require_once( "$IP/extensions/Scribunto/Scribunto.php" );
require_once( "$IP/extensions/GeoData/GeoData.php" );

After adding extensions but before loading articles!

From maintenance/, run php update.php. It creates additional tables needed by some of the extensions. It's important to run this command with everything correctly setup before loading articles, otherwise it will take days to complete after articles are loaded.

Loading the XML snapshot

First, you need to get mwimporter -- a version that works: First, you need to get mwimporter -- a version that works: Attachment with ID #108 is deleted.

You may need to correct the script to accept the generator of your dump, as the script is updated at a slower pace than the dumps.

Assuming that the snapshot is placed in /var/fair/data, do the following:

cd /var/www/html/wiki/maintenance/
cat /var/fair/data/enwiki-20130604-pages-articles-multistream.xml | perl mwimporter.pl | mysql -u root -p --default-character-set=utf8 wikipedia

Starting over

DELETE FROM revision;
DELETE FROM page;
DELETE FROM pagelinks;
DELETE FROM text;

Resolving issues with the dump

The following failure has occured multiple times and seems related to an empty title:

ERROR 1062 (23000) at line 6931308: Duplicate entry '0-' for key 'name_title'

These are the constraints that are enforced in MyISAM:

+-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name                    | Seq_in_index | Column_name      | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| page  |          0 | PRIMARY                     |            1 | page_id          | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
| page  |          0 | name_title                  |            1 | page_namespace   | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
| page  |          0 | name_title                  |            2 | page_title       | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
| page  |          1 | page_random                 |            1 | page_random      | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
| page  |          1 | page_len                    |            1 | page_len         | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
| page  |          1 | page_redirect_namespace_len |            1 | page_is_redirect | A         |       24566 |     NULL | NULL   |      | BTREE      |         |               |
| page  |          1 | page_redirect_namespace_len |            2 | page_namespace   | A         |       36655 |     NULL | NULL   |      | BTREE      |         |               |
| page  |          1 | page_redirect_namespace_len |            3 | page_len         | A         |     2309270 |     NULL | NULL   |      | BTREE      |         |               |
+-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Optimizing database

TODO:

  • Make tables read-only

Logs

Final output line from mwimporter, after 61 hours of importing a 40 GB XML dump (enwiki - June 14th 2013, pages-articles multistream)

 13538988 pages ( 61.288/s),  13538988 revisions ( 61.288/s) in 220909 seconds

XML dump August 8 2013.

13715113 pages ( 52.465/s),  13715113 revisions ( 52.465/s) in 261414 seconds

Media files

Approaches

  1. Use SQL dump of images table. This will generate a dump of all images uploaded to a specific language version of Wikipedia. Problem: Images from language A may be used on language B.
  2. Parse XML dump and look file [[File:...]]. Problem: Less efficient than Approach 1. Advantage: Finds all and only files that are actually used.
  3. Rsync with yours.org. Problem: Gets tons of files that aren't even in use. Trying to fix this problem by add --include to rsync, however there may still be a huge redundancy in files downloaded.

Restoring image database

Simply put all file names in the table images. You do not need to fill in anything but the image name field.

You can create a list of all the files sort of like this:

find wikipedia_images/ -type d -exec ls -1 {} \; > filenames.lst

Then you can quickly see how many files were there:

wc -l filenames.lst 
1549297 filenames.lst

Then after, you can load it sort of like this:

mysql> TRUNCATE image;
Query OK, 0 rows affected (0.00 sec)

mysql> ALTER TABLE image DISABLE KEYS;
Query OK, 0 rows affected (0.00 sec)

mysql> LOCK TABLES image WRITE;
Query OK, 0 rows affected (0.00 sec)

mysql> LOAD DATA INFILE 'filenames.lst' INTO TABLE image (img_name);
Query OK, 3153470 rows affected, 2 warnings (47 min 13.24 sec)
Records: 3153470  Deleted: 0  Skipped: 0  Warnings: 2