Using Compression with Hadoop and Datameer
Datameer FAQ About Compression
How do I configure Datameer/Hadoop to use native compression?
When working with large data volumes, native compression can drastically improve the performance of a Hadoop cluster. There are multiple options for compression algorithms. Each have their benefits, e.g., GZIP is better in terms of disk space, LZO in terms of speed.
- First determine the best compression algorithm for your environment (see the "compression" topic, under Hadoop Cluster Configuration Tips)
- Install native compression libraries (platform-dependent) on both the Hadoop cluster (
<HADOOP_HOME>/lib/native/Linux-[i386-32 | amd64-64]
) and the Datameer machine (<das install dir>/lib/native/Linux-[i386-32 | amd64-64]
)
Configure the codec use as custom properties in the Hadoop Cluster section in Datameer:
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
Custom Hadoop properties needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+:
tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
You can find the correct absolute path in your cluster settings. Depending on your distribution, the absolute path to the native library might look like one of the following examples:
/usr/lib/hadoop/lib/native /opt/cloudera/parcels/CDH/lib/hadoop/lib/native /usr/hdp/current/hadoop/lib/native/
Or you can use settings such as:
=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:<absolute_path_to_native_lib>
How do I configure Datameer/Hadoop to use LZO native compression?
Add corresponding Java libraries to Datameer/Hadoop and follow the step-by-step guide below to implement the LZO compression codec.
Setup steps for LZO compression in Datameer
In order to let Datameer work with files compressed by LZO, its required to add corresponding Java libraries into the Datameer installation folder and point the application to the list of compression codecs to use.
The following steps are only applicable for the local execution framework on respective Datameer Workgroup editions.
Open the
<datameer-install-path>/conf/das-conductor.properties
file and ensure that LZO is listed as adas.import.compression.known-file-suffixes
property value (this should be a default value).conf/das-conductor.properties## Importing files ending with one of those specified suffixes will result in an exception if
## the proper compression-codec can't be loaded. This helps to fail fast and clear instead of displaying spaghetti data.
das.
import
.compression.known-
file
-suffixes=zip,gz,bz2,lzo,lzo_deflate,Z
Open the
<datameer-install-path>/conf/das-common.properties
file and add the values below under the section## HADOOP Properties Additions
.conf/das-common.propertiesio.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
Add
liblzo2*
libraries to<datameer-install-path>/lib/native/
. Usually there are 3 files *.so, *.so.2, *.so.2.0.0. Sometimes LZO libs might already be installed somewhere on the machine, if they aren't, it would be required to add them by installing LZO related packages. Install LZO related packages by following the directions below:Installing LZO packagesExecute the following command at all the nodes in your cluster:
RHEL/CentOS/Oracle Linux:
yum install lzo lzo-devel hadooplzo hadooplzo-native
For SLES:
zypper install lzo lzo-devel hadooplzo hadooplzo-native
For Ubuntu/Debian:
HDP support for Debian 6 is deprecated with HDP 2.4.2. Future versions of HDP will no longer be supported on Debian 6.
apt-get install liblzo2-2 liblzo2-dev hadooplzoAdd
libgplcompression.*
libraries to<datameer-install-path>/lib/native/[Linux-amd64-64/i386-32]
. Usually there are 5 of files *.a, *.la, *.so, *.so.0, *.so.0.0.0. LZO libs might already be installed somewhere on the machine, but if they aren't, it would be required to add them. Follow Hadoop gpl packaging or Install hadoop-gpl-packaging to do this.At this point, it might be be required to move the library files directly into the native folder.
- Add the corresponding JAR library files to
<datameer-install-path>/etc/custom-jars/
. You could loadhadoop-lzo-<version>.jar
from hadoop-lzo andhadoop-gpl-compression-<version>.jar
from hadoop-gpl. - Restart Datameer.
Setup steps for LZO compression on a Hadoop cluster
In order to be able to work with LZO compression over a Hadoop cluster, it is required to add corresponding Java libraries into Hadoop configuration folders and set appropriate properties in the configuration files.
Copy
liblzo2*
andlibgplcompression.*
libraries from<datameer-install-path>/lib/native/
to the corresponding folder of the Hadoop distribution. Execute commandhadoop
checknative -a to find the Hadoop's native libs location. Here are possible locations for different versions:hadoop checknaticve -a/usr/lib/hadoop/lib/native
/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
/usr/hdp/current/hadoop/lib/native/For some configurations libgplcompression.* libraries should be moved from [Linux-amd64-64/i386-32] folder directly to /lib/native. Ensure that all symlinks remain the same after copying.
- Copy appropriate JAR library files from
<datameer-install-path>/etc/custom-jars/
to the corresponding/lib
folder of the Hadoop distribution (this is usually the parent folder for the native library). Add the LZO codec information to the Hadoop configuration files by opening
$HADOOP_HOME/conf/core-site.xml
and insert data provided below to the propertyio.compression.codecs.
core-site.xmlcom.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
add add the following directly after the
io.compress.codecs
property:core-site.xml<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
Add the following custom properties in Datameer on the Hadoop Cluster page:
Datameer Custom Propertiestez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
The above settings should be implemented at all cluster nodes. It is recommended to restart cluster services after setup.
How do I configure Datameer/Hadoop to use Snappy native compression?
Snappy compression codec provides high speed compression with reasonable compression ratio. See the original documentation for more details.
- For now, CDH3u1 and newer versions are containing Snappy compression codec already. This page contains the configuration instructions. In addition, Snappy is integrated into Apache Hadoop versions 1.0.2 and 0.23.
Using Cloudera's distribution of Hadoop is required to enable the codec inside Datameer application, either in Hadoop Cluster settings or on per job basis. Add the following settings:
io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec mapred.output.compress=true mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec mapred.output.compression.type=BLOCK
If you encounter errors, set the following custom Hadoop properties might be needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+:
yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native mapreduce.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
Make sure matching versions of Snappy are installed in Datameer and on your Hadoop cluster (e.g. verify via checksums from the library files). Version mismatch can result in erroneous behavior during codec loading or job execution.
As of Datameer Version 5.6
To prevent out of memory errors, set
das.job.container.memory-heap-fraction=0.7
Using 3rd Party Compression Libraries
Datameer in general can deal with all sort of compression codecs that are supported by Hadoop, with the exception of EMR. The following examples show you how to enable compression.
Enabling LZO compression
For more information from where to download the LZO codec jar files see the Apache Hadoop wiki - UsingLzoCompression page.
Copy native LZO compression libraries to your distribution:
cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-i386-32
or on 64 bit machines:
cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-amd64-64
Copy codec jar file to etc/custom-jars
cp /<path>/hadoop-gpl-compression-0.1.0-dev.jar etc/custom-jars
Restart the conductor.
bin/conductor.sh restart
- Make sure that the compression libraries are installed on your Hadoop cluster by following the documentation under Apache Hadoop wiki - UsingLzoCompression.
Enabling SNAPPY Compression
If you are using a Datameer version for Cloudera, in this example CDH5.2.1-MR1, on the Datameer host the native libraries are already available.
Check native SNAPPY compression libraries on Datameer host.
datameer@datameer-node Datameer-5.3.0-cdh-5.2.1-mr1]$ ll lib/native/ ... -rw-r--r-- 1 datameer datameer 19848 23. Dez 06:39 libsnappy.so -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1 -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1.1.3
Make sure that the compression libraries are installed on your Hadoop cluster.
[root@cloudera-cluster lib]# pwd /usr/lib [root@cloudera-cluster lib]# find . -name 'snappy*' ./hadoop/lib/snappy-java-1.0.4.1.jar ... ./hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar ./hadoop-mapreduce/snappy-java-1.0.4.1.jar
Copy the from Cloudera provided
snappy-java-1.0.4.1.jar
to the Datameer node and put the codec into theetc/custom-jars
folder.[root@datameer-node]# cp snappy-java-1.0.4.1.jar /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars [root@datameer-node]# chown datameer:datameer /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars/*.* [root@datameer-node]# ll /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars -rw-r--r-- 1 datameer datameer 960374 17. Okt 08:05 mysql-connector-java-5.1.34-bin.jar -rw-r--r-- 1 datameer datameer 995968 23. Feb 05:16 snappy-java-1.0.4.1.jar
Restart the conductor.
bin/conductor.sh restart
Check in
logs/conductor.log
that the SNAPPY plugin became loaded. Creating a Datalink and import SNAPPY compressed AVRO files is now possible.
Hadoop LZO Compression on Linux
Installing Hadoop and lzo compression on Linux
You can find some precompiled lzo-libs in our jungledisk:
- hadoop-lzo/mac_64 - for mac os x
- hadoop-lzo/lzo-hadoop-libs-linux-.tar - for linux 32/64 bit
Install the following:
- Useful tools
- Zo libs and headers
- Liblzo2-2 liblzo2-dev on debian based systems apt-get/aptitude
- Lzo lzo-devel on redhat based systems rpm/yum
- Git
- Java
- Ant
You must change the hostnames and paths in the config files to match your environment.
Installing Hadoop-lzo (32 bit)
mkdir lzo cd lzo git clone http://github.com/kevinweil/hadoop-lzo.git cd hadoop-lzo ant clean compile-native test tar cd ../.. wget http://mirror.synyx.de/apache//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz tar xf hadoop-0.20.2.tar.gz ln -s hadoop-0.20.2 hadoop cd hadoop cp ../lzo/hadoop-lzo/build/native/Linux-i386-32/lib/* lib/native/Linux-i386-32 cp ../lzo/hadoop-lzo/build/hadoop-lzo-0.4.9.jar lib/
Installing Hadoop-lzo (64 bit)
mkdir lzo cd lzo git clone http://github.com/kevinweil/hadoop-lzo.git cd hadoop-lzo ant clean compile-native test tar cd ../.. wget http://mirror.synyx.de/apache//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz tar xf hadoop-0.20.2.tar.gz ln -s hadoop-0.20.2 hadoop cd hadoop cp ../lzo/hadoop-lzo/build/native/Linux-amd64-64/lib/* lib/native/Linux-amd64-64 cp ../lzo/hadoop-lzo/build/hadoop-lzo-0.4.9.jar lib/
Configuring Hadoop to use lzo
You have to set $JAVA_HOME
for hadoop:
nano -w $HADOOP_HOME/conf/hadoop-env.sh
You need to modify core-site.xml / mapred-site.xml / hdfs-site.xml:
nano -w $HADOOP_HOME/conf/core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://alpha:9000</value> </property> <property> <name>fs.checkpoint.dir</name> <value>/data/dfs/namesecondary</value> <description>Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/data/hadoop</value> <description>A base for other temporary directories.</description> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> </configuration>
nano -w $HADOOP_HOME/conf/mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>alpha:9001</value> </property> <property> <name>mapred.local.dir</name> <value>/data/tmp/mapred/local</value> <description>The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk I/O. Directories that do not exist are ignored. </description> </property> <property> <name>mapred.system.dir</name> <value>/data/mapred</value> <description>The shared directory where MapReduce stores control files. </description> </property> </configuration>
nano -w $HADOOP_HOME/conf/hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> <description> If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. </description> </property> <property> <name>dfs.data.dir</name> <value>/data/dfs/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> </property> <property> <name>dfs.name.dir</name> <value>/data/dfs/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> </configuration>
Configure passwordless SSH access:
cd ssh-keygen cat .ssh/id_rsa.pub > .ssh/authorized_keys ssh localhost exit
Modify your .bashrc, .bash_profile:
cd nano -w ~/.bashrc export HADOOP_HOME=/home/tester/hadoop export PATH=$HADOOP_HOME/bin:$PATH source .bashrc
Prepare Hadoop and run a test:
hadoop namenode -format start-all.sh hadoop jar /home/hadoop/hadoop/hadoop-0.20.2-test.jar TestDFSIO -Dmapred.child.java.opts=-Xmx600M -write -nrFiles 100 -fileSize 10
Prerequisites for compiling LZO
- Get the latest master branch tarball from http://www.github.com/kevinweil/hadoop-lzo
- Ensure you have jdk, ant, and gcc installed on your box. The jdk version needs to be the same as the jdk deployed on the Hadoop cluster.
- Ensure you have lzo and lzo-devel installed on your compile box and on the Hadoop cluster.
- Unpack the tarball from the first step.
- Run ant package in the unpacked directory.
Setting Up lzo on Hadoop
- Copy build/hadoop-lzo*.jar to /path/to/hadoop/lib on all the Hadoop nodes
- Copy build/hadoop-lzo*/lib/native to /path/to/hadoop/lib/ on all the Hadoop nodes
- Modify core-site.xml and mapred-site.xml:core-site.xml:
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value> </property><property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
mapred-site.xml:
<property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property><property> <name>mapred.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
Restart the Hadoop cluster (dfs and mapred).
Setting Up lzo on das
- Copy
build/hadoop-lzo*.jar
to /path/to/das/etc/custom-jars
on the conductor. - Copy
build/hadoop-lzo*/lib/native
to/path/to/das/lib/
on the conductor. No configuration modification is needed for das to pick up lzo compression. - Restart das.
LZO Compression on a Mac
Contact your Datameer customer support representative.
Make a Working Directory
mkdir $workdir cd $workdir
On a different OS, get a C compiler if you don't have one
sudo apt-get install build-essential
Get the LZO library
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.03.tar.gz tar xzf lzo-2.03.tar.gz cd lzo-2.03 ./configure --prefix=/opt/local make sudo make install cd ..
Download and install the LZO libraries
sudo aptitude search lzo You see something like this: > liblzo2-2 > liblzo2-dev > lzop You might need to to perform this if the install fails: sudo aptitude update (or) sudo apt-get update Install this with: sudo aptitude install liblzo2-2 liblzo2-dev
Build hadoop-lzo
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7.0/Home sudo apt-get install git git clone git://github.com/kevinweil/hadoop-lzo.git cd hadoop-lzo sudo apt-get install ant ant clean compile-native test tar cd ..
Install and configure Hadoop
wget http://mirror.synyx.de/apache//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz tar xf hadoop-0.20.2.tar.gz ln -s hadoop-0.20.2 hadoop cd hadoop
Download and install the lzo-libraries as above, for each Hadoop node.
Change JAVA_HOME
nano -w conf/hadoop-env.sh
nano -w conf/core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
nano -w conf/hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
nano -w conf/mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
cd $workdir/hadoop-lzo mkdir ../hadoop/lib/native/Mac_OS_X-x86_64-64 cp build/native/Mac_OS_X-x86_64-64/lib/* ../hadoop/lib/native/Mac_OS_X-x86_64-64/ cp build/hadoop-lzo-0.4.4.jar ../hadoop/
Formatting HDFS
cd $workdir/hadoop bin/hadoop namenode -format
Starting Hadoop
bin/start-all.sh
Install and configure das
Install and configure das as shown in the Installation Guide, but don't start the application:
cd $workdir/hadoop-lzo cp build/hadoop-lzo-0.4.4.jar ../das-0.40/etc/custom-jars/ mkdir ../das-0.40/lib/native/Mac_OS_X-x86_64-64 cp build/native/Mac_OS_X-x86_64-64/lib/* ../das-0.40/lib/native/Mac_OS_X-x86_64-64/
Now you can start the application:
cd $workdir/das-0.40/ bin/conductor.sh start
Configure the application to use LZO compression, as described above, then test on the command line and with das.
Hadoop, Hive, Datameer, and LZO
Environment
Create a directory on your desktop that contains the Datameer distribution, Hadoop distribution, and Hive distribution:
cd ~/Desktop mkdir DAS
Set up Java to include header files:
sudo ln -s /System/Library/Frameworks/JavaVM.framework/Headers /System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home/include
Install LZO dependencies
LZO C-Libraries
You need the following LZO c-libraries:
- LZO c-binaries to be able to (de)compress data on your system
- LZOP a cmd tool to (de)compress files
Do this with HomeBrew or MacPorts
For HomeBrew use:
brew search lzo
brew install lzo brew install lzop
Create a symlink on your machine to link lzo libraries into another folder:
cd /usr/local ln -s /usr/local/Cellar/lzo/2.06 lzo64
For MacPorts use:
sudo port install lzo2 sudo port install lzop
LZO Hadoop-Libraries
For HomeBrew use:
cd ~/Desktop/DAS git clone git://github.com/kevinweil/hadoop-lzo.git cd hadoop-lzo ant clean compile-native test tar
For MacPorts use:
cd ~/Desktop/DAS git clone git://github.com/kevinweil/hadoop-lzo.git cd hadoop-lzo env JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home/ \ C_INCLUDE_PATH=/opt/local/include \ LIBRARY_PATH=/opt/local/lib \ CFLAGS="-arch x86_64" ant clean compile-native test tar
Find the Java library under:
build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar
Find the native libraries under:
build/hadoop-lzo-0.4.15/lib/native/Mac_OS_X-x86_64-64
Install and set up Hadoop and Hive with LZO
Install and set up a pseudo Hadoop distribution
Download and extract Hadoop 0.20.2:
cd ~/Desktop/DAS curl -O http://apache.easy-webs.de/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz gunzip hadoop-0.20.2.tar.gz tar -xf hadoop-0.20.2.tar rm hadoop-0.20.2.tar cp -r hadoop-0.20.2 pseudo-lzo-hadoop-0.20.2 cd pseudo-lzo-hadoop-0.20.2
Edit an env script and add Java Home:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home
Edit xml config files to create a pseudo-Hadoop that works with hdfs and add lzo support:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>fs.checkpoint.dir</name> <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/dfs/secondary</value> </property> <property> <name>hadoop.tmp.dir</name> <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/tmp</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> </configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.data.dir</name> <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/dfs/data</value> </property> <property> <name>dfs.name.dir</name> <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/dfs/name</value> </property> </configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>mapred.local.dir</name> <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/mapred/local</value> </property> <property> <name>mapred.system.dir</name> <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/mapred/system</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>7</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>3</value> </property> </configuration>
Copy the native libs from the hadoop-lzo project into the native lib folder of the Hadoop dist:
mkdir ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2/lib/native/Mac_OS_X-x86_64-64 cp ~/Desktop/DAS/hadoop-lzo/build/hadoop-lzo-0.4.15/lib/native/Mac_OS_X-x86_64-64/* ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2/lib/native/Mac_OS_X-x86_64-64
Copy the Hadoop LZO jar into the lib folder of your Hadoop dist:
cp ~/Desktop/DAS/hadoop-lzo/build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2/lib
Format your file system and start Hadoop:
cd ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2 bin/hadoop namenode -format bin/start-all.sh
Compress a file and upload it to HDFS
Download the file books.txt
and compress this file to get a lzo file:
lzop books.txt
Compress the file with lzop.
An output file called books.txt.lzo should be created. Upload this file to the hdfs. Later, create an import job for this file to see if Hadoop and Datameer work.
bin/hadoop fs -mkdir /books bin/hadoop fs -copyFromLocal books.txt.lzo /books/
To import this file into a Hive table, create an index file:
bin/hadoop jar lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /books/books.txt.lzo
When you print the file with Hadoop you should see the following encrypted content:
bin/hadoop fs -cat /books/books.txt.lzo
Install and set up Hive and connect to the LZO cluster
Download Hive and extract the archive to your ~/Desktop/DAS
folder. Open the conf/hive-default.xml
file and change the following property:
<property> <name>hive.metastore.warehouse.dir</name> <value>YOUR_USER_HOME/Desktop/DAS/hive-warehouse</value> </property>
Copy the hadoop-lzo.jar
to the lib folder of the Hive installation:
cp ~/Desktop/DAS/hadoop-lzo/build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar ~/Desktop/DAS/hive-0.7.1/lib/
Create a table based that supports lzo and link the books.lzo
file:
cd ~/Desktop/DAS/hive-0.7.1 export HADOOP_HOME=~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2 bin/hive CREATE EXTERNAL TABLE books(userId BIGINT,userName STRING,books ARRAY<STRING>,orderDetails MAP<STRING, STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY '.' STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat" LOCATION '/books';
Verify the Hive is working with your connected Hadoop cluster. You should see a non-compressed output instead of lzo content.
select * from books;
Start Hive as a service. First, quit the Jive console by clicking CTRL-C.
bin/hive --service metastore
Install and set up Datameer and connect to the LZO cluster
Download and install a Datameer distribution and copy the native Hadoop lzo libs and the Hadoop lzo jar into the related folders using the following code:
cd ~/Desktop/DAS/hadoop-lzo cp build/hadoop-lzo-0.4.15.jar ~/Desktop/DAS/das/etc/custom-jars/ mkdir ~/Desktop/DAS/das/lib/native/Mac_OS_X-x86_64-64 cp build/native/Mac_OS_X-x86_64-64/lib/* ~/Desktop/DAS/das/lib/native/Mac_OS_X-x86_64-64/
Edit the das-common.properties
file and add compression info and files that shouldn't be imported into Datameer using the hidden-file-suffix
property. Add the .index
file you created before.
io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec das.import.hidden-file-suffixes=.index
Start Datameer and connect to your pseudo-Hadoop cluster.
bin/conductor.sh start
On the Administration tab, connect to your Hadoop cluster:
- Create an file import job to the lzo folder that exists on your cluster under
/books/*
. You should see no garbage in preview and snapshot view. - Create an Hive import job and import the books table. You should see no garbage in preview and snapshot mode.
Building Hadoop Native Libs on OSX
Problem
One pain point associated with developing for Hadoop on OS X is lack of native binary support. This can be especially irritating as there is a bug with the non-native GZip codec which blocks DAS on OSX from reading SequenceFile
s with BLOCK or RECORD compression enabled. If you are trying to process a SequenceFile
and see an exception such as the following:
Caused by: java.io.EOFException at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249) at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
Then you are likely running into this issue.
The 64-bit OSX native libs are now checked into the project and distributed with DAS, so building the native libs shouldn't be necessary anymore unless you are running on 32 bit OSX.
Solution
You must have XCode installed with linux development tools in order to build these libraries.
Download the Apache distribution, 0.20.2 and extract the archive somewhere on your machine.
tar xzvf hadoop-0.20.2.tar.gz cd hadoop-0.20.2/
Download the patch file from https://issues.apache.org/jira/browse/HADOOP-3659 and place it in the root of the Hadoop distriubtion (hadoop-0.20.2). Next, apply the patch and compile the native libraries:
cd hadoop-0.20.2/ patch -p0 < hadoop-0.20-compile-native-on-mac.patch ant -Dcompile.native=true compile
This process generates the native libraries in a directory under build/native
:
dobbsferry:hadoop-0.20.2$ ls build/native/Mac_OS_X-x86_64-64/lib/ Makefile libhadoop.1.0.0.dylib libhadoop.1.dylib libhadoop.a libhadoop.dylib libhadoop.la
Now, pass the location of these libs to DAS on startup:
ANT_OPTS="-Xmx512m -XX:MaxPermSize=384m -Ddb.mode=mysql -Djava.library.path=~/resources/hadoop-0.20.2/build/native/Mac_OS_X-x86_64-64/lib/" ant jetty.run
Notes
This process has only been tested and verified with the Apache Hadoop distribution. See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html for more details on native libraries.