Quick install HBase in “pseudo distributed” mode and connect from Java

Platforms used: Ubuntu karmic, hadoop-0.20, hbase-0.20.3, Java client

This post will be useful for those wishing to setup HBase on a single server machine in pseudo distributed mode. The advantage of running the database in this mode is that it can then be accessed over the network, for example to allow a bunch of developers to start crafting Java code against it. It is also a nice stepping stone to fully distributed mode, either on your own servers, or somewhere like EC2.

On first reading of the HBase documentation, setting up pseudo distributed mode sounds very simple. The problem is that there a lot of gotchas, which can make life very difficult indeed. Consequently, many people follow a twisted journey to their final destination, and when they finally get there, aren’t sure which of the measures they took were needed, and which were not. This is reflected by a degree of misinformation on the Web, and I will try and present here a reasonably minimal way of getting up and running (that is not even to say that every step I take is absolutely necessary even, but I’ll mention where I’m not sure).

Step 1: Check your IP setup
I believe this is one of the main causes of the weirdness that can happen. So, if you’re on Ubuntu check your hosts file. If you see something like:
127.0.0.1 localhost 127.0.1.1 <server fqn> <server name, as in /etc/hostname>
get rid of the second line, and change to
127.0.0.1 locahost <server ip> <server fqn> <server name, as in /etc/hostname>
e.g.
127.0.0.1 localhost 23.201.99.100 hbase.mycompany.com hbase
If you don’t do this, the region servers will resolve their addresses to 127.0.1.1. This information will be stored inside the ZooKeeper instance that HBase runs (the directory and lock manager used by the system to configure and synchronize a running HBase cluster). When manipulating remote HBase data, client code libraries actually connect to ZooKeeper to find the address of the region server maintaining the data. In this case, they will be given 127.0.1.1 which resolves to the client machine. duh!

Step 2: Install Hadoop Packages
Hadoop is quite a big subject – hell the book has over 500 pages. That’s why it is great that there is a company making pre-packaged distributions called cloudera. So my recommendation here is to go with those packages. Perform the following steps, but check the important notes before proceeding:
a/ If you are on Debian, you need to modify your Apt Repository so you can pickup the packages. In the instructions following, if you are running a recent Ubuntu distro like karmic, then configure your cloudera.list to pickup the packages for “jaunty-testing”. Make sure you choose hadoop-0.20 or better.
Instructions http://archive.cloudera.com/docs/_apt.html
b/ Install the packages setting up hadoop in standalone mode
Instructions http://archive.cloudera.com/docs/_installing_hadoop_standalone_mode.html
c/ Install the package the sets up the pseudo distributed configuration
Instructions http://archive.cloudera.com/docs/cdh2-pseudo-distributed.html
IMPORTANT NOTES
You should begin the above process with your system in a completely hadoop-free state to be sure the steps will work correctly. For example, if you have an entry for a hadoop user in your /etc/passwds file that is different to the one the config package wants to install, installation of the config package can fail. Furthermore, old items may have the wrong permissions which may cause later steps to fail. To find everything on your system you need to remove, do:
cd / find -name "*hadoop*"
and
cd /etc grep -R hadoop

Step 3: Prepare user “hadoop”
We are going to make it possible to login as user hadoop (or rather, do a sudo -i -u hadoop). This will make it possible to easily edit for example configuration files while keeping their owner as hadoop. We are also going to run HBase as user hadoop.
Change the following entry in /etc/passwd
hadoop:x:104:112:Hadoop User,,,:/var/run/hadoop-0.20:/bin/false
to
hadoop:x:104:112:Hadoop User,,,:/var/run/hadoop-0.20:/bin/bash
There is a lot of talk on the Web about setting up ssh for the hadoop user, so that hadoop can ssh to different nodes without specifying a password. I’m not sure that this is necessary any more, but the weight of recommendation (including here http://hadoop.apache.org/common/docs/current/quickstart.html) persuadesd me to do this anyway. So next:
# sudo -i -u hadoop hadoop$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa hadoop$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Step 4: Configure hadoop
Open /etc/hadoop/conf/hadoop-env.sh and make sure your Java home is correctly set e.g.
export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.15

Step 5: Test hadoop
Checkout the Web-based admin interfaces e.g.
http://hbase.mycompany.com:50070/
and
http://hbase.mycompany.com:50030/

Step 6: Install the HBase package
You need to download and install the latest version from http://www.apache.org/dyn/closer.cgi/hadoop/hbase/. Proceed as root as follows:
# cd ~ # wget http://apache.mirror.anlx.net/hadoop/hbase/hbase-0.20.3/hbase-0.20.3.tar.gz # tar -xzf hbase-0.20.3.tar.gz # mv hbase-0.20.3 /usr/lib # cd /etc/alternatives # ln -s /usr/lib/hbase-0.20.3 hbase-lib # cd /usr/lib # ln -s /etc/alternatives/hbase-lib hbase # chown -R hadoop:hadoop hbase

Step 7: Configure HBase
Now, login as hadoop and go to its conf directory
# sudo -i -u hadoop hadoop$ cd /usr/lib/hbase/conf
a/ Then update the Java classpath in hbase-env.sh, just like you did for hadoop-env.sh
b/ Inside hbase-site.xml, configure hbase.roodir and hbase.master. The result should look something like below, notes following:
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8020/hbase</value> <description>The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR </description> </property>

<property> <name>hbase.master</name> <value>23.201.99.100:60000</value> <description>The host and port that the HBase master runs at. </description> </property> </configuration>
NOTES
1/ hbase.rootdir must specify a host and port number exactly the same as specified by fs.default.name inside /etc/hadoop/conf/core-site.xml. Basically it tells HBase where to find the distributed file system.
2/ hbase.master specifies the interface that the HBase master, or rather the Zookeeper instance HBase will start (more later) will listen on. It must be externally addressable for clients to connect. This is a good point to double-check the IP setup step at the beginning of this post.

Step 8: Start up HBase
If you have not already started hadoop, then start it e.g. as described by cloudera:
for service in /etc/init.d/hadoop-0.20-* do sudo $service start done
Next, start HBase as the hadoop user:
hadoop$ /usr/lib/hbase/bin/start-hbase.sh

Step 9: Check HBase is up and running
Open up the HBase Web UI e.g. http://hbase.mycompany.com:60010

Step 10: Open HBase shell, create a table and column family
You need to login to hbase, and create a table and column family that will be used by the Java client example:
me$ /usr/lib/hbase/bin/hbase shell hbase(main):001:0> create "myLittleHBaseTable", "myLittleFamily"

Step 11: Create Java client project
Create your sample Java client project with code, as described for example at http://hadoop.apache.org/hbase/docs/r0.20.3/api/index.html.
Next, you need to add a special hbase-site.xml file to its classpath. It should specify the ZooKeeper quorum (the minimum running ZooKeeper instances, in this case, your hbase server). The client contacts ZooKeeper to find out where the master, region servers etc are (ZooKeeper acts as a directory for clients, but it also acts a definitive description of the HBase cluster and synchronization system for its own nodes). Based upon the foregoing, the contents will look something like:
<configuration> <property> <name>hbase.zookeeper.quorum</name> <value>hbase.mycompany.com</value> <description>The host and port that the HBase master runs at. A value of 'local' runs the master and a regionserver in a single process. </description> </property> </configuration>

Step 12: Yeeeha. Build your Java client, and debug
Watch your e.g. NetBeans Ouput window for a trace of what is hopefully happening… welcome to the world of HBase

Written by dominicwilliams

January 28, 2010 at 12:21 pm

Posted in hadoop, HBase, Install, Install, Uncategorized

Tagged with configure, distributed database, hadoop, HBase, Install, Java, Ubuntu

10 Responses

Subscribe to comments with RSS.

are you honestly saying that the only way to make a small testing instance of hbase available via network is to reconfigure half of the server? my admin will be killing me (or rather laughing outright in my face)!

i’ve seen a lot of hair raising design and horrible documentation at hbase’s — but that has to be the worst (not your howto, of course, but the indication that one needs to jeopardize his server just to make that stuff somewhat usable because those guys at hbase apparently never thought that somebody outside their circle ever would try to use the stuff)!

sorry, but documentation is bad, confusing, incoherent, inconclusive. configuration guesswork at large.
the only conclusion i can make after a few weeks spent: never, ever try hbase! it’s the single worst project i ever met and there are better documented nosql-databases.

never touch hbase

October 11, 2010 at 3:39 pm

Reply
- we moved from HBase to Cassandra for precisely some of these reasons – please see the later posts 😉
  
  dominicwilliams
  
  October 11, 2010 at 10:58 pm
  
  Reply
Cloudera has implemented simple packages with proper documentation to install HBase. See CDH3. They are dead easy to install.

JVerstry

February 2, 2011 at 8:32 pm

Reply
[…] " command its is asking for " root@localhost's password: " . Iam following " https://ria101.wordpress.com/2010/01/…t-java-client/ " to install HBASE for Hadoop single node cluster. So please help me to install Hbase for […]

SSH install problem.

March 17, 2011 at 4:58 am

Reply
Hi Dominic,

First of all, thanks for such a useful document. This is really helpful for beginners.

I am trying to run HBASE in pseudo distributed mode. I was able to install HADOOP and run it successfully but, after i am getting some problem in running HBASE.
These are the commands I am running:
1. start-all.sh
2. start-hbase.sh
3. hbase shell
Till now, it looks fine.
But, in hbase shell, I am trying to run:
4. list
This command result is:
hbase(main):001:0> list
TABLE

ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase

Here is some help for this command:
List all tables in hbase. Optional regular expression parameter could
be used to filter the output. Examples:

hbase> list
hbase> list ‘abc.*’

Do you have any idea, what could be wrong?

Thanking you.

Regards,
Sumved Shami

Sumved Shami

April 4, 2011 at 9:01 am

Reply
- Hi, my company is currently using Cassandra, so not currently so up on HBase. However I know Cloudera etc are distributing as ready-to-go RPM files which might be a good option. By-hand configuration of HBase is complex, but the errors you have related to ZooKeeper, which is the central coordination system used. For whatever reason, your region server or something like that can’t connect to ZooKeeper.
  
  dominicwilliams
  
  April 4, 2011 at 5:50 pm
  
  Reply
  - Any specific reasons for moving to Cassandra?
    
    We guys are actually trying to research on Hadoop-HBase vs. Cassandra? One point for me is very important: Documentation. I feel HBase has really poor documentation. Just to run it on pseudo-distributed mode, I have been struggling for past 2 days. 🙂
    
    But, FB has moved from Cassandra to HBase. One most obvious reason for them is they have big experience on Hadoop. What is your idea?
    
    Thanks for your quick reply.
    
    Regards,
    Sumved Shami
    
    Sumved Shami
    
    April 4, 2011 at 7:07 pm
  - Hi, most of the reasons we moved to Cassandra are covered in the post https://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/. It is important to realize though that both systems have strengths and weaknesses. The main reasons we are using Cassandra are:
    – It is much easier to setup, administer and manage
    – You can deal with “eventual consistency” issues in various ways, such as using distributed locking libraries like Cages (we use Cages and Pelops, see github.com/s7). This is actually quite a complex area though, so I don’t want to oversimplify…
    
    dominicwilliams
    
    April 5, 2011 at 8:52 am
Also, in hbase log file, I get this trace continuosly:
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2011-04-04 15:10:26,625 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
2011-04-04 15:10:26,626 WARN org.apache.zookeeper.ClientCnxn: Session 0x12f1fe16b6a0001 for server null, unexpected error, closing socket connection and attempting reconnect

Sumved Shami

April 4, 2011 at 9:41 am

Reply
[…] HBase伪分布式环境快速搭建。原文地址:Quick install HBase in “pseudo distributed” mode and connect from Java […]

Quick install HBase in “pseudo distributed” mode and connect from Java | _yiihsia[互联网后端技术]yiihsia[互联网后端技术]

May 16, 2011 at 2:24 am

Reply

Dominic Williams