This topic provides tips on configuring Hadoop and Datameer for use in a shared cluster.
Tips on Configuring Hadoop
User identity
The identity of the user who launches Datameer is the identity Hadoop assumes should also be used on the Hadoop cluster. To set the user name up correctly, you need to create a user on the Hadoop cluster with the same name as the user who launches Datameer. That user on the Hadoop cluster must have permissions to read and write to the Datameer private folder on the cluster.
If you don’t set up the name correctly, you might experience the following problems:
- You might experience HDFS permissions errors when you attempt to run jobs (as Datameer tries unsuccessfully to manipulate files in the Datameer private folder, or incorrectly attempts to manipulate the directory of the HDFS root user).
- Jobs can be submitted into the wrong work queue (meaning that your job won't have the appropriate priority), or can be rejected.
See the Hadoop documentation for additional information.
When using Unix-like systems:
- The user name is the equivalent of `whoami`
- The group list is the equivalent of `bash -c groups`
Caution
If an intruder makes changes to the user name and group resolution by path manipulations, access permission checking can be bypassed.
Configuration of access permissions
Define this property in hdfs-site.xml
:
<property> <name>dfs.permissions</name> <value>false</value> <description> If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. </description> </property>
Define permissions in HDFS
hadoop -fs [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...]
Datameer export jobs to HDFS
Datameer allows you to export data to remote filesystems. At the step New Export Job > Data Details you can enable clear output directory.
Caution
This deletes data from other applications in the same space too.
Load Balancing
If you have configured a job scheduler on your cluster, you can easily configure which queue or pool Datameer should use. For example, if you have set up a fair share scheduler (see Apache FairScheduler), you can set this up by doing the following:
- Check which property needs to be set-up to configure the pool. This is the Hadoop property
mapred.fairscheduler.poolnameproperty
configured inconf/mapred-site.xml
of your Hadoop installation. - Set this property in Datameer at Administration > Hadoop Cluster > Custom Property and set the pool that should be used by Datameer.
Using Data from Other Applications
Datameer can read files on HDFS generated by other MapReduce jobs and applications on top of Hadoop. Depending on the format of the files produced by other jobs, writing a plug-in for Datameer might or might not be required.
See the custom import plug-in definition and see Datameer Plug-in Tutorial for additional information on creating a custom plugin.
Useful Links
- HDFS Permissions Guide: https://hadoop.apache.org/docs/stable1/hdfs_permissions_guide.html
- Security in Hadoop, Part – 1 from Blog 'Big Data and etc.' http://bigdata.wordpress.com/2010/03/22/security-in-hadoop-part-1
- Hadoop 0.20.S Virtual Machine Appliance http://developer.yahoo.com/blogs/hadoop/posts/2010/06/hadoop_020s_virtualmachine/
- Using Fair Scheduler: http://hadoop.apache.org/docs/stable1/fair_scheduler.html