Using Compression with Hadoop and Datameer
- 1 Datameer FAQ About Compression
- 2 Hadoop LZO Compression on Linux
- 2.2 LZO Compression on a Mac
- 2.3 Make a Working Directory
- 2.3.1 Get the LZO library
- 2.3.2 Download and install the LZO libraries
- 2.3.3 Build hadoop-lzo
- 2.3.4 Install and configure Hadoop
- 2.3.5 Change JAVA_HOME
- 2.3.6 Formatting HDFS
- 2.3.7 Starting Hadoop
- 2.3.8 Install and configure das
- 3 Hadoop, Hive, Datameer, and LZO
- 4 Building Hadoop Native Libs on OSX
- 5 Notes
Datameer FAQ About Compression
How do I configure Datameer/Hadoop to use native compression?
When working with large data volumes, native compression can drastically improve the performance of a Hadoop cluster. There are multiple options for compression algorithms. Each have their benefits, e.g., GZIP is better in terms of disk space, LZO in terms of speed.
First determine the best compression algorithm for your environment (see the "compression" topic, under Hadoop Cluster Configuration Tips)
Install native compression libraries (platform-dependent) on both the Hadoop cluster (
<HADOOP_HOME>/lib/native/Linux-[i386-32 | amd64-64]) and the Datameer machine (<das install dir>/lib/native/Linux-[i386-32 | amd64-64])
Configure the codec use as custom properties in the Hadoop Cluster section in Datameer:
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodecCustom Hadoop properties needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+:
tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>You can find the correct absolute path in your cluster settings. Depending on your distribution, the absolute path to the native library might look like one of the following examples:
/usr/lib/hadoop/lib/native
/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
/usr/hdp/current/hadoop/lib/native/Or you can use settings such as:
=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:<absolute_path_to_native_lib>
How do I configure Datameer/Hadoop to use LZO native compression?
Add corresponding Java libraries to Datameer/Hadoop and follow the step-by-step guide below to implement the LZO compression codec.
Setup steps for LZO compression in Datameer
In order to let Datameer work with files compressed by LZO, its required to add corresponding Java libraries into the Datameer installation folder and point the application to the list of compression codecs to use.
The following steps are only applicable for the local execution framework on respective Datameer Workgroup editions.
Open the
<datameer-install-path>/conf/das-conductor.propertiesfile and ensure that LZO is listed as adas.import.compression.known-file-suffixesproperty value (this should be a default value).Open the
<datameer-install-path>/conf/das-common.propertiesfile and add the values below under the section## HADOOP Properties Additions.Add
liblzo2*libraries to<datameer-install-path>/lib/native/. Usually there are 3 files *.so, *.so.2, *.so.2.0.0. Sometimes LZO libs might already be installed somewhere on the machine, if they aren't, it would be required to add them by installing LZO related packages. Install LZO related packages by following the directions below:Add
libgplcompression.*libraries to<datameer-install-path>/lib/native/[Linux-amd64-64/i386-32]. Usually there are 5 of files *.a, *.la, *.so, *.so.0, *.so.0.0.0. LZO libs might already be installed somewhere on the machine, but if they aren't, it would be required to add them. Follow Hadoop gpl packaging or Install hadoop-gpl-packaging to do this.Add the corresponding JAR library files to
<datameer-install-path>/etc/custom-jars/. You could loadhadoop-lzo-<version>.jarfrom hadoop-lzo andhadoop-gpl-compression-<version>.jarfrom hadoop-gpl.Restart Datameer.
Setup steps for LZO compression on a Hadoop cluster
In order to be able to work with LZO compression over a Hadoop cluster, it is required to add corresponding Java libraries into Hadoop configuration folders and set appropriate properties in the configuration files.
Copy
liblzo2*andlibgplcompression.*libraries from<datameer-install-path>/lib/native/to the corresponding folder of the Hadoop distribution. Execute commandhadoopchecknative -a to find the Hadoop's native libs location. Here are possible locations for different versions:Copy appropriate JAR library files from
<datameer-install-path>/etc/custom-jars/to the corresponding/libfolder of the Hadoop distribution (this is usually the parent folder for the native library).Add the LZO codec information to the Hadoop configuration files by opening
$HADOOP_HOME/conf/core-site.xmland insert data provided below to the propertyio.compression.codecs.add add the following directly after the
io.compress.codecsproperty:Add the following custom properties in Datameer on the Hadoop Cluster page:
The above settings should be implemented at all cluster nodes. It is recommended to restart cluster services after setup.
Snappy compression codec provides high speed compression with reasonable compression ratio. See the original documentation for more details.
For now, CDH3u1 and newer versions are containing Snappy compression codec already. This page contains the configuration instructions. In addition, Snappy is integrated into Apache Hadoop versions 1.0.2 and 0.23.
Using Cloudera's distribution of Hadoop is required to enable the codec inside Datameer application, either in Hadoop Cluster settings or on per job basis. Add the following settings:
io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec mapred.output.compress=true mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec mapred.output.compression.type=BLOCKIf you encounter errors, set the following custom Hadoop properties might be needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+:
yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native mapreduce.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
Using 3rd Party Compression Libraries
Datameer in general can deal with all sort of compression codecs that are supported by Hadoop, with the exception of EMR. The following examples show you how to enable compression.
Enabling LZO compression
For more information from where to download the LZO codec jar files see the Apache Hadoop wiki - UsingLzoCompression page.
Copy native LZO compression libraries to your distribution:
cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-i386-32or on 64 bit machines:
cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-amd64-64Copy codec jar file to etc/custom-jars
cp /<path>/hadoop-gpl-compression-0.1.0-dev.jar etc/custom-jarsRestart the conductor.
bin/conductor.sh restartMake sure that the compression libraries are installed on your Hadoop cluster by following the documentation under Apache Hadoop wiki - UsingLzoCompression.
Enabling SNAPPY Compression
If you are using a Datameer version for Cloudera, in this example CDH5.2.1-MR1, on the Datameer host the native libraries are already available.
Check native SNAPPY compression libraries on Datameer host.
datameer@datameer-node Datameer-5.3.0-cdh-5.2.1-mr1]$ ll lib/native/ ... -rw-r--r-- 1 datameer datameer 19848 23. Dez 06:39 libsnappy.so -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1 -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1.1.3Make sure that the compression libraries are installed on your Hadoop cluster.
[root@cloudera-cluster lib]# pwd /usr/lib [root@cloudera-cluster lib]# find . -name 'snappy*' ./hadoop/lib/snappy-java-1.0.4.1.jar ... ./hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar ./hadoop-mapreduce/snappy-java-1.0.4.1.jarCopy the from Cloudera provided
snappy-java-1.0.4.1.jarto the Datameer node and put the codec into theetc/custom-jarsfolder.[root@datameer-node]# cp snappy-java-1.0.4.1.jar /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars [root@datameer-node]# chown datameer:datameer /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars/*.* [root@datameer-node]# ll /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars -rw-r--r-- 1 datameer datameer 960374 17. Okt 08:05 mysql-connector-java-5.1.34-bin.jar -rw-r--r-- 1 datameer datameer 995968 23. Feb 05:16 snappy-java-1.0.4.1.jarRestart the conductor.
bin/conductor.sh restartCheck in
logs/conductor.logthat the SNAPPY plugin became loaded. Creating a Datalink and import SNAPPY compressed AVRO files is now possible.