Elastic Stack Lab12
In this lab we are going to look at using client libraries to import data and interact with the Elasticsearch API.
Python - import data
Start by making sure the unzip
utility is installed so we can unzip the movies file.
sudo apt-get install -y unzip
Now let’s download the MovieLens data.
wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
After downloading we just need to unzip
unzip ml-latest-small.zip
Now let’s download the Python
script we’ll be using for manipulating and importing the movie data.
wget http://bit.ly/es-python-import -O MoviesToJson.py
The Python
script opens up the movies CSV file, loads the data into a dictionary and then writes out a new output file with the fields needed for importing.
Now let’s run the script to convert the movies.csv
to a JSON
format.
python3 MoviesToJson.py > moremovies.json
Review moremovies.json
and confirm it looks correct.
You should see a format similar to:
{ "create" : { "_index": "movies", "_type": "movie", "_id" : "1" } }
{ "id": "1", "title": "Toy Story", "year":1995, "genre":["Adventure","Animation","Children","Comedy","Fantasy"] }
{ "create" : { "_index": "movies", "_type": "movie", "_id" : "2" } }
{ "id": "2", "title": "Jumanji", "year":1995, "genre":["Adventure","Children","Fantasy"] }
While this dataset is considered “small” it still contains over 18,000 lines! This is a lot more data to play around with than we’ve had to this point.
Now we need to delete our movies
index so we can reimport using our new dataset.
curl -XDELETE 127.0.0.1:9200/movies
After deleting the index let’s go ahead and recreate it with the new data.
curl -XPUT 127.0.0.1:9200/_bulk --data-binary @moremovies.json
Now that everything was imported successfully let’s do a search and see what is returned. In this example we are going to use the “URI Search”.
curl -XGET "127.0.0.1:9200/movies/_search?q=title:ice%20&pretty"
Alright, so we got back a ton of results! Now that we have a lot more data to play around with we should start to see better relevancy results.
Now let’s go ahead and use a client side library to interact with the API instead of curl
Python - ratings
To run our Python
script we must first install pip
and some modules from it.
sudo apt install -y python3-pip
Now that pip
is installed let’s use it to install the Python
Elasticsearch module.
sudo pip3 install elasticsearch
After that’s complete we need to download the script.
wget http://bit.ly/es-py-ratings -O IndexRatings.py
Now that we’ve downloaded that it’s time to run it!
python3 IndexRatings.py
Ok, now let’s use our trusty curl
command to confirm everything was created successfully.
curl -XGET 127.0.0.1:9200/ratings/_search?pretty
And if everything went smoothly we should see a list of movies with their ratings.
We can import many other files as well with simple edits to our Python
script.
Python - tags
Copy the IndexRatings.py
script to IndexTags.py
and modify it to import all of the data from the tags.csv
file.
python3 IndexTags.py
After that completes confirm the data was imported as expected.
curl -XGET 127.0.0.1:9200/tags/_search?pretty
Confirm the tags were imported successfully.