Friday, April 15, 2016

Extracting all HBase columns

Have been fiddling with HBase? or new to HBase? and well-versed with our very favourite Hive. Here is a quick post to get all the columns in a HBase table.

Here is the usecase, Hbase is :
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Apache HBase is an open-source, distributed, versioned, non-relational database
 Since its not a relational database so no concept of tables, rows, columns. Just data, that could be in any form of key value pair. So its like a JSON where each JSONObject may have unique data regardless of any fixed structure.

So here keys are the columns and their value is tuple for that JSONObject. In realtime scenarios a HBase table has a whole lot of JSONObjects and each object has whol lots of key-values. Now the hard part is how to find how many keys are there to make it structured or to convert it to Table format or any other business use-case.

  1. Step1:
You have this HBase table hivehbase
On Hbase shell
create 'hivehbase', 'ratings'
put 'hivehbase', 'row1', 'ratings:userid', 'user1'
put 'hivehbase', 'row1', 'ratings:bookid', 'book1'
put 'hivehbase', 'row1', 'ratings:rating', '1'
 
put 'hivehbase', 'row2', 'ratings:userid', 'user2'
put 'hivehbase', 'row2', 'ratings:bookid', 'book1'
put 'hivehbase', 'row2', 'ratings:rating', '3'
 
put 'hivehbase', 'row3', 'ratings:userid', 'user2'
put 'hivehbase', 'row3', 'ratings:bookid', 'book2'
put 'hivehbase', 'row3', 'ratings:rating', '3'
 
put 'hivehbase', 'row4', 'ratings:userid', 'user2'
put 'hivehbase', 'row4', 'ratings:bookid', 'book4'
put 'hivehbase', 'row4', 'ratings:rating', '1'
 

Create column family by name 'ratings' and columns inside it

 
2.Now just get the column family names as below:

hbase(main):017:0> describe 'hivehbase'
Table hivehbase is ENABLED                                                                                                                  
hivehbase                                                                                                                                   
COLUMN FAMILIES DESCRIPTION                                                                                                                 
{NAME => 'ratings', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MI
N_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}       

Here column family name is 'ratings'
 
 
3.Follow https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration to integrate HBase to Hive and add Hbase-Hive Storage Handler to Hive libs
On the Hive shell 
hive>add jar /usr/lib/hbase/lib/hbase-common.jar;
add jar /usr/lib/hbase/lib/hbase-client.jar;
add jar /usr/lib/hbase/lib/zookeeper.jar;
add jar /usr/lib/hbase/lib/hbase-common-0.98.0.2.1.1.0-385-hadoop2-tests.jar;
add jar /usr/lib/hbase/lib/guava-12.0.1.jar;
hive> CREATE EXTERNAL TABLE hbase_table_hive(peData map, row_key int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "ratings:,:key")
TBLPROPERTIES ("hbase.table.name" = "hivehbase");
 
 
4.Now your HBase table is also visible on Hive as  a table. To view all the columns of this table simply perform a select:
hive>select * from hbase_table_hive;
OK
{"bookid":"book1","rating":"1","userid":"user1"}    NULL
{"bookid":"book1","rating":"3","userid":"user2"}    NULL
{"bookid":"book2","rating":"3","userid":"user2"}    NULL
{"bookid":"book4","rating":"1","userid":"user2"}    NULL

It will list all the keys in the system in column family 'ratings' . Similarly you can get all the columns of all column families. If there is no value for a column it will show key and value is null. 
Also you can convert this map JSON list of columns to separate column in another Hive step.

No comments:

Post a Comment