Currently being updated for Ubuntu 14.04
Preparing Wikipedia, step by step:
June 14th 2013, the snapshot was 9.9 GB to download, took up 42 GB once unpacked and the resulting MySQL database was 38 GB. The snapshot contained 13,538,988 articles.
August 8th 2013 (used for the 14.04 deployment) was 9.4 GB. Can be obtained by torrent here. It contained 13,715,113 articles and took 70 hours to load on a 2GB laptop with a 2.4 GHz dual core.
You need to apt-get install the following:
apt-get install apache2 mysql-server mysql-client php5 php5-mysql php5-gd php5-intl php5-xcache
Enable xcache module, in /etc/php5/apache2/conf.d/xcache.ini
, put the following line:
extension=/usr/lib/php5/20121212+lfs/xcache.so
You should also install phpmyadmin for debugging etc.
In order to have Wikipedia working, we need Wikimedia's own version of Mediawiki, which has all the extensions (>600) bundled in. This can be achieved by checking out the latest stable branch. For instance for their release 1.19:
cd /usr/local/fair
mkdir -p mediawiki
cd mediawiki/
wget http://dumps.wikimedia.org/mediawiki/1.21/mediawiki-1.21.1.tar.gz
tar xvfz mediawiki-*.tar.gz
mv mediawiki-1.21.1 mediawiki
cd /var/www/html
ln -s /usr/local/share/fair/mediawiki/mediawiki wiki
Now, navigate to http://localhost/wiki and create a fresh installation of MediaWiki.
Name the database wikipedia.
Make sure to name the administrative user admin and password fair so others can access.
Use UTF-8 character set and MyISAM engine. Do not use InnoDB, it is way too slow for this purpose, and we experienced constraints broken in the source Wikipedia dumps.
Creative Commons Attribution Share Alike is the footer to use, that's what Wikipedia uses, too.
Make sure there are no articles, otherwise the import will fail trying to create an already existing article.
Delete everything from page, revision, pagelinks, text
LocalSettings.php
$wgThumbLimits = array(300);
$wgDefaultUserOptions['imagesize'] = 0;
$wgImageLimits = array (array(1000,1000));
$wgHashedUploadDirectory = true;
You also need to remove the default images/
directory and symlink in the real one with all the images once they have been obtained.
You should also obtain Wikipedia's CSS. You should copy-paste this to http://localhost/wiki/MediaWiki:Common.css
These are the relevant extensions from Wikimedia's list of common extensions:
http://meta.wikimedia.org/wiki/Wikimedia_extensions
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CategoryTree.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CharInsert.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Cite.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ExpandTemplates.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ImageMap.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/InputBox.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/OAI.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/OggHandler.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Oversight.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/PagedTiffHandler.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ParserFunctions.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SiteMatrix.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SyntaxHighlight_GeSHi.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/wikihiero.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/GeoData.git
Afterwards, something more complicated is needed:
sudo apt-get install lua5.1
This installs Lua, which is used by the Scribunto extension that is responsible for rendering large parts of Wikipedia macros, like the Infoboxes almost on all articles. Get an appropriate version here.
Then put this in LocalSettings.php
$wgScribuntoEngineConf['luastandalone']['luaPath'] = '/usr/bin/lua5.1';
The final list of extensions in LocalSettings.php is this:
require_once( "$IP/extensions/Cite/Cite.php" );
require_once( "$IP/extensions/Gadgets/Gadgets.php" );
require_once( "$IP/extensions/InputBox/InputBox.php" );
require_once( "$IP/extensions/Nuke/Nuke.php" );
require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" );
require_once( "$IP/extensions/PdfHandler/PdfHandler.php" );
require_once( "$IP/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.php" );
require_once( "$IP/extensions/Vector/Vector.php" );
require_once( "$IP/extensions/CategoryTree/CategoryTree.php" );
require_once( "$IP/extensions/CharInsert/CharInsert.php" );
require_once( "$IP/extensions/Cite/Cite.php" );
require_once( "$IP/extensions/ExpandTemplates/ExpandTemplates.php" );
require_once( "$IP/extensions/ImageMap/ImageMap.php" );
require_once( "$IP/extensions/InputBox/InputBox.php" );
require_once( "$IP/extensions/OAI/OAI.php" );
require_once( "$IP/extensions/OggHandler/OggHandler.php" );
require_once( "$IP/extensions/Oversight/Oversight.php" );
require_once( "$IP/extensions/PagedTiffHandler/PagedTiffHandler.php" );
require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" );
require_once( "$IP/extensions/SiteMatrix/SiteMatrix.php" );
require_once( "$IP/extensions/TemplateInfo/TemplateInfo.php" );
require_once( "$IP/extensions/wikihiero/wikihiero.php" );
require_once( "$IP/extensions/Scribunto/Scribunto.php" );
require_once( "$IP/extensions/GeoData/GeoData.php" );
From maintenance/
, run php update.php
. It creates additional tables needed by some of the extensions. It's important to run this command with everything correctly setup before loading articles, otherwise it will take days to complete after articles are loaded.
First, you need to get mwimporter -- a version that works: First, you need to get mwimporter -- a version that works:
You may need to correct the script to accept the generator of your dump, as the script is updated at a slower pace than the dumps.
Assuming that the snapshot is placed in /var/fair/data
, do the following:
cd /var/www/html/wiki/maintenance/
cat /var/fair/data/enwiki-20130604-pages-articles-multistream.xml | perl mwimporter.pl | mysql -u root -p --default-character-set=utf8 wikipedia
DELETE FROM revision;
DELETE FROM page;
DELETE FROM pagelinks;
DELETE FROM text;
The following failure has occured multiple times and seems related to an empty title:
ERROR 1062 (23000) at line 6931308: Duplicate entry '0-' for key 'name_title'
These are the constraints that are enforced in MyISAM:
+-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| page | 0 | PRIMARY | 1 | page_id | A | 2309270 | NULL | NULL | | BTREE | | |
| page | 0 | name_title | 1 | page_namespace | A | NULL | NULL | NULL | | BTREE | | |
| page | 0 | name_title | 2 | page_title | A | 2309270 | NULL | NULL | | BTREE | | |
| page | 1 | page_random | 1 | page_random | A | 2309270 | NULL | NULL | | BTREE | | |
| page | 1 | page_len | 1 | page_len | A | 2309270 | NULL | NULL | | BTREE | | |
| page | 1 | page_redirect_namespace_len | 1 | page_is_redirect | A | 24566 | NULL | NULL | | BTREE | | |
| page | 1 | page_redirect_namespace_len | 2 | page_namespace | A | 36655 | NULL | NULL | | BTREE | | |
| page | 1 | page_redirect_namespace_len | 3 | page_len | A | 2309270 | NULL | NULL | | BTREE | | |
+-------+------------+-----------------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
TODO:
Final output line from mwimporter, after 61 hours of importing a 40 GB XML dump (enwiki - June 14th 2013, pages-articles multistream)
13538988 pages ( 61.288/s), 13538988 revisions ( 61.288/s) in 220909 seconds
XML dump August 8 2013.
13715113 pages ( 52.465/s), 13715113 revisions ( 52.465/s) in 261414 seconds
--include
to rsync, however there may still be a huge redundancy in files downloaded.Simply put all file names in the table images
. You do not need to fill in anything but the image name field.
You can create a list of all the files sort of like this:
find wikipedia_images/ -type d -exec ls -1 {} \; > filenames.lst
Then you can quickly see how many files were there:
wc -l filenames.lst
1549297 filenames.lst
Then after, you can load it sort of like this:
mysql> TRUNCATE image;
Query OK, 0 rows affected (0.00 sec)
mysql> ALTER TABLE image DISABLE KEYS;
Query OK, 0 rows affected (0.00 sec)
mysql> LOCK TABLES image WRITE;
Query OK, 0 rows affected (0.00 sec)
mysql> LOAD DATA INFILE 'filenames.lst' INTO TABLE image (img_name);
Query OK, 3153470 rows affected, 2 warnings (47 min 13.24 sec)
Records: 3153470 Deleted: 0 Skipped: 0 Warnings: 2