Wednesday 21 September 2016

Loading Wikipedia in to ElasticSearch

http://blog.novelessay.com/post/loading-wikipedia-in-to-elasticsearch

This article gives instructions for loading Wikipedia articles in to ElasticSearch. I did this on Windows, but all of these steps should work on any java friendly platform.
  1. Download ElasticSearch
  2. Download stream2es
  3. Download Wikipedia articles
  4. Start ElasticSearch
  5. Run stream2es


Download ElasticSearch

Go to Elastic.co and download ElasticSearch here: https://www.elastic.co/downloads/elasticsearch
Download and unzip the elasticsearch download in to a folder of your choice.

Download stream2es

Go here and download the stream2es java jar file: http://download.elasticsearch.org/stream2es/stream2es
Optional: See stream2es on github for options: https://github.com/elastic/stream2es

Download Wikipedia articles

Go here and download the wikipedia article archive: https://dumps.wikimedia.org/enwiki/latest/
There are many options, but the specific one I downloaded was this: enwiki-latest-pages-articles.xml.bz2
(It's over 12GB, so be sure you have plenty of disk space.)

Start ElasticSearch

I'm on Windows, so I opened a command window and ran this: 
c:\elasticsearch-1.5.2\bin\elasticsearch.bat
That starts up your local ElasticSearch instance at localhost:9200

Run stream2es

  • Move the stream2es file to your ElasticSearch bin folder. I put stream2es here c:\elasticsearch-1.5.2\bin\
  • Move the Wikipedia archive (enwiki-latest-pages-articles.xml.bz2) to your ElasticSearch bin folder too.
  • Run the stream2es java file: 
C:\elasticsearch-1.5.2\bin>java -jar stream2es wiki --target http://localhost:9200/mywiki --log debug --source /enwiki-latest-pages-articles.xml.bz2

Notes:
  1. You can change the "mywiki" to whatever you want your specific ElasticSearch index name to be.
  2. I had some trouble getting stream2es to find my wikipedia archive path on Windows, but the / in front of the file name worked.


I ran this all local on my Windows desktop, and it took 6-8 hours. It appears to be locked up near the end, but it did eventually exit. 

Now, you should have over 16 million Wikipedia articles loaded in to your local ElasticSearch index. Enjoy.

No comments:

Post a Comment