Using Compression with Hadoop and Datameer

Using Compression with Hadoop and Datameer

Datameer FAQ About Compression

How do I configure Datameer/Hadoop to use native compression?

When working with large data volumes, native compression can drastically improve the performance of a Hadoop cluster. There are multiple options for compression algorithms. Each have their benefits, e.g., GZIP is better in terms of disk space, LZO in terms of speed.

  1. First determine the best compression algorithm for your environment (see the "compression" topic, under Hadoop Cluster Configuration Tips)

  2. Install native compression libraries (platform-dependent) on both the Hadoop cluster (<HADOOP_HOME>/lib/native/Linux-[i386-32 | amd64-64]) and the Datameer machine (<das install dir>/lib/native/Linux-[i386-32 | amd64-64] )

 

Configure the codec use as custom properties in the Hadoop Cluster section in Datameer:

mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Custom Hadoop properties needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+:

tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>

You can find the correct absolute path in your cluster settings. Depending on your distribution, the absolute path to the native library might look like one of the following examples:

/usr/lib/hadoop/lib/native /opt/cloudera/parcels/CDH/lib/hadoop/lib/native /usr/hdp/current/hadoop/lib/native/

Or you can use settings such as:

=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:<absolute_path_to_native_lib>

 

How do I configure Datameer/Hadoop to use LZO native compression?

Add corresponding Java libraries to Datameer/Hadoop and follow the step-by-step guide below to implement the LZO compression codec.

Setup steps for LZO compression in Datameer

In order to let Datameer work with files compressed by LZO, its required to add corresponding Java libraries into the Datameer installation folder and point the application to the list of compression codecs to use.

The following steps are only applicable for the local execution framework on respective Datameer Workgroup editions.

  1. Open the <datameer-install-path>/conf/das-conductor.properties file and ensure that LZO is listed as a das.import.compression.known-file-suffixes property value (this should be a default value).

  2. Open the <datameer-install-path>/conf/das-common.properties file and add the values below under the section ## HADOOP Properties Additions.

  3. Add liblzo2* libraries to <datameer-install-path>/lib/native/. Usually there are 3 files *.so, *.so.2, *.so.2.0.0. Sometimes LZO libs might already be installed somewhere on the machine, if they aren't, it would be required to add them by installing LZO related packages. Install LZO related packages by following the directions below:

  4. Add libgplcompression.* libraries to <datameer-install-path>/lib/native/[Linux-amd64-64/i386-32]. Usually there are 5 of files *.a, *.la, *.so, *.so.0, *.so.0.0.0. LZO libs might already be installed somewhere on the machine, but if they aren't, it would be required to add them. Follow Hadoop gpl packaging or Install hadoop-gpl-packaging to do this.

  5. Add the corresponding JAR library files to <datameer-install-path>/etc/custom-jars/. You could load hadoop-lzo-<version>.jar from hadoop-lzo and hadoop-gpl-compression-<version>.jar from hadoop-gpl.

  6. Restart Datameer.

Setup steps for LZO compression on a Hadoop cluster

 In order to be able to work with LZO compression over a Hadoop cluster, it is required to add corresponding Java libraries into Hadoop configuration folders and set appropriate properties in the configuration files.

  1. Copy liblzo2* and libgplcompression.* libraries from <datameer-install-path>/lib/native/ to the corresponding folder of the Hadoop distribution. Execute command hadoop checknative -a to find the Hadoop's native libs location. Here are possible locations for different versions:

  2. Copy appropriate JAR library files from <datameer-install-path>/etc/custom-jars/ to the corresponding /lib folder of the Hadoop distribution (this is usually the parent folder for the native library).

  3. Add the LZO codec information to the Hadoop configuration files by opening $HADOOP_HOME/conf/core-site.xml and insert data provided below to the property io.compression.codecs.

    add add the following directly after the io.compress.codecs property:

  4. Add the following custom properties in Datameer on the Hadoop Cluster page:

The above settings should be implemented at all cluster nodes. It is recommended to restart cluster services after setup.

How do I configure Datameer/Hadoop to use Snappy native compression?

Snappy compression codec provides high speed compression with reasonable compression ratio. See the original documentation for more details.

  1. For now, CDH3u1 and newer versions are containing Snappy compression codec already. This page contains the configuration instructions. In addition, Snappy is integrated into Apache Hadoop versions 1.0.2 and 0.23.

  2. Using Cloudera's distribution of Hadoop is required to enable the codec inside Datameer application, either in Hadoop Cluster settings or on per job basis. Add the following settings:

    io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec mapred.output.compress=true mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec mapred.output.compression.type=BLOCK

    If you encounter errors, set the following custom Hadoop properties might be needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+: 

    yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native mapreduce.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native

Using 3rd Party Compression Libraries 

Datameer in general can deal with all sort of compression codecs that are supported by Hadoop, with the exception of EMR. The following examples show you how to enable compression. 

Enabling LZO compression

For more information from where to download the LZO codec jar files see the Apache Hadoop wiki - UsingLzoCompression page.

  1. Copy native LZO compression libraries to your distribution:

    cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-i386-32

    or on 64 bit machines:

    cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-amd64-64
  2. Copy codec jar file to etc/custom-jars

    cp /<path>/hadoop-gpl-compression-0.1.0-dev.jar etc/custom-jars
  3. Restart the conductor.

    bin/conductor.sh restart
  4. Make sure that the compression libraries are installed on your Hadoop cluster by following the documentation under Apache Hadoop wiki - UsingLzoCompression.

Enabling SNAPPY Compression

If you are using a Datameer version for Cloudera, in this example CDH5.2.1-MR1, on the Datameer host the native libraries are already available. 

  1. Check native SNAPPY compression libraries on Datameer host.

    datameer@datameer-node Datameer-5.3.0-cdh-5.2.1-mr1]$ ll lib/native/ ... -rw-r--r-- 1 datameer datameer 19848 23. Dez 06:39 libsnappy.so -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1 -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1.1.3
  2. Make sure that the compression libraries are installed on your Hadoop cluster.

    [root@cloudera-cluster lib]# pwd /usr/lib [root@cloudera-cluster lib]# find . -name 'snappy*' ./hadoop/lib/snappy-java-1.0.4.1.jar ... ./hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar ./hadoop-mapreduce/snappy-java-1.0.4.1.jar
  3. Copy the from Cloudera provided snappy-java-1.0.4.1.jar to the Datameer node and put the codec into the etc/custom-jars folder.

    [root@datameer-node]# cp snappy-java-1.0.4.1.jar /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars [root@datameer-node]# chown datameer:datameer /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars/*.* [root@datameer-node]# ll /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars -rw-r--r-- 1 datameer datameer 960374 17. Okt 08:05 mysql-connector-java-5.1.34-bin.jar -rw-r--r-- 1 datameer datameer 995968 23. Feb 05:16 snappy-java-1.0.4.1.jar
  4. Restart the conductor.

    bin/conductor.sh restart
  5. Check in logs/conductor.log that the SNAPPY plugin became loaded. Creating a Datalink and import SNAPPY compressed AVRO files is now possible.