Why we need to work with HBASE
HDFS is good for sequential data access, but it lacks the random read/write capability. HBase runs on top of the Hadoop File System and provides read and write access.
It is extremely fault-tolerant for storage of sparse data.
Data Model in HBase
The different components of Apache HBase data model are tables Rows, Column Families, Columns, Cells and Versions.
Hbase tables are made up of multiple rows stored according to their respective row keys in the table
Each rows has a row key and corresponding to it you can have one or multiple column families/columns.
Design row key in such a way that, related entities should be stored in adjacent rows to increase read efficacy.
This helps to avoid hotspotting on a particular node which basically means that most of the read and write operations are not using a single region server. Make sure that data that needs to be fetched together is also storedd together in the in the data nodes an example of of this would be domain names. Since the data is sorted in a lexicographic manner, if you do no reserve the domain name , all the subdomins of a single domain will be on different servers.
Compared to this if you have “com.cnn” and “.com.cnn.us”, they will be stored together.
In HBase you can group together multiple columns into a single column family and you can have one or more column families for each row. However, is recommended to not have more than 10 column families for performance purposes
Under a column family you can have as row qualifiers as you want. A Column is identified by a Column Qualifier that is a combination of Column Family name concatenated with the Column name using a colon – example: columnfamily:columnname.
HBase stores the data in a cell as a unique combination of row key, column family, and column qualifier, and contains a value and a timestamp.
You can have multiple versions of the same data in HBase, so you find versions according to timestamp.
How to setup HBase in a Google Dataproc Hadoop cluster.
- Download HBase , unzip it and then move it to usr/local/HBase folder
<code>wget <a href="https://www-us.apache.org/dist/hbase/1.4.9/hbase-1.4.9-bin.tar.gz">https://www-us.apache.org/dist/hbase/1.4.9/hbase-1.4.9-bin.tar.gz</a></code> <code>tar xzvf hbase-1.4.9-bin.tar.gz</code> <code>sudo mv hbase-1.4.9</code> <code>/usr/local/HBase/</code>
2) Check for Java and change the
hbase-env.sh under folder
Hbase needs java to run , so we have to provide the path for the java version in your local server in the
If you don’t have java installed in your server, use the following commands to get your self a copy.
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer
You can also check if Java is available in your server, usually, in google data proc clusters it is already available.
After this, you can go to
/usr/local/HBase/conf/ folder and change the hbase-env.sh file
3) Change the hbase-site.xml file in
4) Startup HBase For starting up HBase use the following command.
5) Run Hbase Shell For running Hbase shell either you can first make change to the bash_profile file or directly start the shell from the bin folder. For the first method
Add these to
export HBASE_HOME=/usr/local/Hbase export PATH=$PATH:$HBASE_HOME/bin
Or you can go to the cd /usr/local/HBase/bin and use the shell command directly
Importing Shell Commands in HBase
Now let us try to get some see some important command in HBase from the shell command line
Create a table
This is the syntax to create a table in HBase
create '<table_name>', '<column_family_name>'
Let’s first create a userrating table
Check if it created
Now let us add some data into the table for user_id 1 rating a movie with id 1 a rating of 4