Starting off with big data, basically, every small thing is a task and it takes time and energy to understand stuff.
What is Hadoop
Big data is so big that at one point it will be impossible to keep it on the same disk. So, you have to have a distributed system to manage this amount of data.
Hadoop takes a big data file and distributes it on different clusters with replication because it is inevitable that some clusters will crash or have problems.
Hadoop is based on a framework developed by Google but it was Yahoo that propagated and helped the open source project.
What is Hive
Apache Hive is a system that helps data scientist to perform SQL like queries on data files stored on Hadoop.
For this reason, it is super-useful as it basically converts stored data files into SQL databases. For this exercise, we are going to with the ml-100k from movie lens.
In case you are doing the Udemy course and can’t set up the local environment, you can use the following tutorial to learn how to use hive using this tutorial.
Setting up Hive with Google Cloud
- Setting up the storage bucket
We are going to be using Cloud Dataproc, this allows us to use google storage buckets in parallel with HDFS.
To setup, a storage bucket go to storage under the storage section in google console and click on browser.
Now create a new bucket by giving it a name and selecting a location for the bucket. Use the same region as your Hadoop cluster. This helps in decreasing the lag between the two systems.
2) Upload your files to the google storage bucket
Next, step is pretty straight forward, upload files to the storage bucket. The files we are interested in are u.data and u.item from the ml-100k database.
u.item contains the information about the release dates and u.data has the rating information for each movie.
We can look more into how the data in the files is structured when we work with them using Hive.
3. Check for the files in the SSH client of the Hadoop cluster
We are going to use gsutil a Python application that lets you access Cloud Storage from the command line.
This is the path for the files in the google storage bucket, gs://[BUCKET_NAME]/[OBJECT_NAME]
Using this command you can check if the files are there:-
gsutil ls gs://pavhadooptest/
Here is the output you will want to see.
4. Go to the ssh client of the master node of the Hadoop cluster
In the ssh client, we can use the beeline service to use the Hive SQL query.
beeline -u "jdbc:hive2://localhost:10000/default"
Hive runs on localhost at port 10000.
5) Create Tables in Hive
We are going to get the u.data file into Hive, for the sake of this tutorial. Here is a look at the file, it has four columns, user id of the person giving the rating, movie id of the movie for which the rating was given, rating and the timestamp when the rating was given.
Here are the main components of a create command in Hive.
CREATE TABLE – Like SQL query we define are creating the table or accessing information
IF NOT EXIST – Do not overwrite the table if it already exists
u_data – Name of the data table we are creating
userid INT, movieid INT, rating INT, unixtime TIMESTAMP – Columns of the data Table, along with the data type
ROW FORMAT DELIMITED – How the data is stored in the file
FIELDS TERMINATED BY ‘\t’ – Delimiter used to separate the fields
STORED AS TEXTFILE – Type of file from which we are reading the information
LOCATION ‘gs://pavhadooptest/’ – Directory from which we are reading the information
Some time, when you are truly working with big data, rather than creating a copy you can give hive the managerial control of the data, this will delete the data from the storage bucket once the table is created, use the following command.
If you drop the table from hive the content will be deleted as hive is now the manager for the content.
CREATE TABLE IF NOT EXISTS u_data ( userid INT, movieid INT, rating INT, unixtime TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'gs://pavhadooptest/'
On the other hand, if you want to keep the file in the storage bucket, use ‘Create external table’ rather than ‘Create table’.
CREATE EXTERNAL TABLE IF NOT EXISTS u_data ( userid INT, movieid INT, rating INT, unixtime TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'gs://pavhadooptest/'
By using this query we can get the data loaded as a table in the Hadoop cluster.
6) Run queries on Hive to find your answers
Now we are all ready to run other hive queries to learn from the data table. Let’s first try to see just one row of data in the table
Select * from u_data LIMIT 1;
Now let’s find the top 5 movies which were rated the most number of times. Hive works in the same way as SQL so we will write a select query and order it by ratings , then sort it in descending order and limit it to 5 to get the top 5 rows only.
SELECT movieid, count(rating) as ratings from u_data GROUP BY movieid ORDER BY ratings DESC LIMIT 5;
Here is the final output.
7) Create views in Hive
Another cool thing you can do with hive is to create views which is a virtual table that we can create based on a sql query.
From our u.data table that we have loaded in hive let’s create a view of movieId, count of number of times the movie was rated and the average movie rating.
CREATE VIEW IF NOT EXISTS ratedmovies (movieid, countrating, avgrating) AS SELECT movieid, count(rating) as countrating, avg(rating) as avgrating FROM u_data GROUP BY movieid;
With this query we can create a new view called ratedmovies with three columns movieid, countrating, avgrating.
To find the movies what are rated more then 10 times, we can just issue a query on the table.
SELECT * FROM ratedmovies WHERE countrating >10
This will give you filtered view of movies that have been rated more than 10 times.
Now you are ready to play around with google data storage and Hadoop clusters using hive. Have fun exploring.