Handling Big Data with HBase Part 2: First Steps

Posted on December 11, 2013 by Scott Leberknight

This is the second in a series of blogs that introduce Apache HBase. In the first blog, we introduced HBase at a high level. In this part, we'll see how to interact with HBase via its command line shell.

Let's take a look at what working with HBase is like at the command line. HBase comes with a JRuby-based shell that lets you define and manage tables, execute CRUD operations on data, scan tables, and perform maintenance among other things. When you're in the shell, just type help to get an overall help page. You can get help on specific commands or groups of commands as well, using syntax like help <group> and help command. For example, help 'create' provides help on creating new tables. While HBase is deployed in production on clusters of servers, you can download it and get up and running with a standalone installation in literally minutes. The first thing to do is fire up the HBase shell. The following listing shows a shell session in which we create a blog table, list the available tables in HBase, add a blog entry, retrieve that entry, and scan the blog table.

$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.96.0-hadoop2, r1531434, Fri Oct 11 15:28:08 PDT 2013

hbase(main):001:0> create 'blog', 'info', 'content'
0 row(s) in 6.0670 seconds

=> Hbase::Table - blog

hbase(main):002:0> list
TABLE
blog
fakenames
my-table
3 row(s) in 0.0300 seconds

=> ["blog", "fakenames", "my-table"]

hbase(main):003:0> put 'blog', '20130320162535', 'info:title', 'Why use HBase?'
0 row(s) in 0.0650 seconds

hbase(main):004:0> put 'blog', '20130320162535', 'info:author', 'Jane Doe'
0 row(s) in 0.0230 seconds

hbase(main):005:0> put 'blog', '20130320162535', 'info:category', 'Persistence'
0 row(s) in 0.0230 seconds

hbase(main):006:0> put 'blog', '20130320162535', 'content:', 'HBase is a column-oriented...'
0 row(s) in 0.0220 seconds

hbase(main):007:0> get 'blog', '20130320162535'
COLUMN             CELL
 content:          timestamp=1386556660599, value=HBase is a column-oriented...
 info:author       timestamp=1386556649116, value=Jane Doe
 info:category     timestamp=1386556655032, value=Persistence
 info:title        timestamp=1386556643256, value=Why use HBase?
4 row(s) in 0.0380 seconds

hbase(main):008:0> scan 'blog', { STARTROW => '20130300', STOPROW => '20130400' }
ROW                COLUMN+CELL
 20130320162535    column=content:, timestamp=1386556660599, value=HBase is a column-oriented...
 20130320162535    column=info:author, timestamp=1386556649116, value=Jane Doe
 20130320162535    column=info:category, timestamp=1386556655032, value=Persistence
 20130320162535    column=info:title, timestamp=1386556643256, value=Why use HBase?
1 row(s) in 0.0390 seconds

In the above listing we first create the blog table having column families info and content. After listing the tables and seeing our new blog table, we put some data in the table. The put commands specify the table, the unique row key, the column key composed of the column family and a qualifier, and the value. For example, info is the column family while title and author are qualifiers and so info:title specifies the column title in the info family with value "Why use HBase?". The info:title is also referred to as a column key. Next we use the get command to retrieve a single row and finally the scan command to perform a scan over rows in the blog table for a specific range of row keys. As you might have guessed, by specifying start row 20130300 (inclusive) and end row 20130400 (exclusive) we retrieve all rows whose row key falls within that range; in this blog example this equates to all blog entries in March 2013 since the row keys are the time when an entry was published.

An important characteristic of HBase is that you define column families, but then you can add any number of columns within that family, identified by the column qualifier. HBase is optimized to store columns together on disk, allowing for more efficient storage since columns that don't exist don't take up any space, unlike in a RDBMS where null values must actually be stored. Rows are defined by columns they contain; if there are no columns then the row, logically, does not exist. Continuing the above example in the following listing, we delete some specific columns from a row.

hbase(main):009:0>  delete 'blog', '20130320162535', 'info:category'
0 row(s) in 0.0490 seconds

hbase(main):010:0> get 'blog', '20130320162535'
COLUMN             CELL
 content:          timestamp=1386556660599, value=HBase is a column-oriented...
 info:author       timestamp=1386556649116, value=Jane Doe
 info:title        timestamp=1386556643256, value=Why use HBase?
3 row(s) in 0.0260 seconds

As shown just above, you can delete a specific column from a table as we deleted the info:category column. You can also delete all columns within a row and thereby delete the row using the deleteall shell command. To update column values, you simply use the put command again. By default HBase retains up to three versions of a column value, so if you put a new value into info:title, HBase will retain both the old and new version.

The commands issued in the above examples show how to create, read, update, and delete data in HBase. Data retrieval comes in only two flavors: retrieving a row using get and retrieving multiple rows via scan. When retrieving data in HBase you should take care to retrieve only the information you actually require. Since HBase retrieves data from each column family separately, if you only need data for one column family, then you can specify to retrieve only that bit of information. In the next listing we retrieve only the blog titles for a specific row key range that equate to March through April 2013.

hbase(main):011:0> scan 'blog', { STARTROW => '20130300', STOPROW => '20130500', COLUMNS => 'info:title' }
ROW                COLUMN+CELL
 20130320162535    column=info:title, timestamp=1386556643256, value=Why use HBase?
1 row(s) in 0.0290 seconds

So by setting row key ranges, restricting the columns we need, and restricting the number of versions to retrieve, you can optimize data access patterns in HBase. Of course in the above examples, all this is done from the shell, but you can do the same things, and much more, using the HBase APIs.

Conclusion to Part 2

In this second part of the HBase introductory series, we saw how to use the shell to create tables, insert data, retrieve data by row key, and saw a basic scan of data via row key range. You also saw how you can delete a specific column from a table row.

In the next blog, we'll get an overview of HBase's high level architecture.

References



Post a Comment:
Comments are closed for this entry.