Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Datameer X FAQ About Compression

Anchor
native_compression
native_compression
How do I configure Datameer/Hadoop to use native compression?

...

  1. First determine the best compression algorithm for your environment (see the "compression" topic, under Hadoop Cluster Configuration Tips)
  2. Install native compression libraries (platform-dependent) on both the Hadoop cluster (<HADOOP_HOME>/lib/native/Linux-[i386-32 | amd64-64]) and the Datameer X machine (<das install dir>/lib/native/Linux-[i386-32 | amd64-64] )

...

Configure the codec use as custom properties in the Hadoop Cluster section in Datameer :X

No Format
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

...

Setup steps for LZO compression in Datameer

In order to let Datameer X work with files compressed by LZO, its required to add corresponding Java libraries into the Datameer X installation folder and point the application to the list of compression codecs to use.

Note

The following steps are only applicable for the local execution framework on respective Datameer X Workgroup editions.

  1. Open the <datameer-install-path>/conf/das-conductor.properties file and ensure that LZO is listed as a das.import.compression.known-file-suffixes property value (this should be a default value).

    Panel
    titleconf/das-conductor.properties

    ## Importing files ending with one of those specified suffixes will result in an exception if

    ## the proper compression-codec can't be loaded. This helps to fail fast and clear instead of displaying spaghetti data.
    das.import.compression.known-file-suffixes=zip,gz,bz2,lzo,lzo_deflate,Z
  2. Open the <datameer-install-path>/conf/das-common.properties file and add the values below under the section ## HADOOP Properties Additions.

    Panel
    titleconf/das-common.properties

    io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

    io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
  3. Add liblzo2* libraries to <datameer-install-path>/lib/native/. Usually there are 3 files *.so, *.so.2, *.so.2.0.0. Sometimes LZO libs might already be installed somewhere on the machine, if they aren't, it would be required to add them by installing LZO related packages. Install LZO related packages by following the directions below:

    Panel
    titleInstalling LZO packages
    Execute the following command at all the nodes in your cluster:

    RHEL/CentOS/Oracle Linux:
    yum install lzo lzo-devel hadooplzo hadooplzo-native

    For SLES:
    zypper install lzo lzo-devel hadooplzo hadooplzo-native

    For Ubuntu/Debian:
    HDP support for Debian 6 is deprecated with HDP 2.4.2. Future versions of HDP will no longer be supported on Debian 6.
    apt-get install liblzo2-2 liblzo2-dev hadooplzo
  4. Add libgplcompression.* libraries to <datameer-install-path>/lib/native/[Linux-amd64-64/i386-32]. Usually there are 5 of files *.a, *.la, *.so, *.so.0, *.so.0.0.0. LZO libs might already be installed somewhere on the machine, but if they aren't, it would be required to add them. Follow Hadoop gpl packaging or Install hadoop-gpl-packaging to do this.

    Note

    At this point, it might be be required to move the library files directly into the native folder.

  5. Add the corresponding JAR library files to <datameer-install-path>/etc/custom-jars/. You could load hadoop-lzo-<version>.jar from hadoop-lzo and hadoop-gpl-compression-<version>.jar from hadoop-gpl.
  6. Restart Datameer.

...

  1. Copy liblzo2* and libgplcompression.* libraries from <datameer-install-path>/lib/native/ to the corresponding folder of the Hadoop distribution. Execute command hadoop checknative -a to find the Hadoop's native libs location. Here are possible locations for different versions:

    Panel
    titlehadoop checknaticve -a

    /usr/lib/hadoop/lib/native
    /opt/cloudera/parcels/CDH/lib/hadoop/lib/native
    /usr/hdp/current/hadoop/lib/native/

    Note

    For some configurations libgplcompression.* libraries should be moved from [Linux-amd64-64/i386-32] folder directly to /lib/native. Ensure that all symlinks remain the same after copying.

  2. Copy appropriate JAR library files from <datameer-install-path>/etc/custom-jars/ to the corresponding /lib folder of the Hadoop distribution (this is usually the parent folder for the native library).
  3. Add the LZO codec information to the Hadoop configuration files by opening $HADOOP_HOME/conf/core-site.xml and insert data provided below to the property io.compression.codecs.

    Panel
    titlecore-site.xml

    com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec 

    add add the following directly after the io.compress.codecs property:

    Panel
    titlecore-site.xml

    <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>

  4. Add the following custom properties in Datameer X on the Hadoop Cluster page:

    Panel
    titleDatameer X Custom Properties

    tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>

...

  1. For now, CDH3u1 and newer versions are containing Snappy compression codec already. This page contains the configuration instructions. In addition, Snappy is integrated into Apache Hadoop versions 1.0.2 and 0.23.
  2. Using Cloudera's distribution of Hadoop is required to enable the codec inside Datameer X application, either in Hadoop Cluster settings or on per job basis. Add the following settings:

    No Format
    io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec
    mapred.output.compress=true
    mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
    mapred.output.compression.type=BLOCK

    If you encounter errors, set the following custom Hadoop properties might be needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+: 

    Code Block
    yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native
    tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
    yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
    mapreduce.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
    Note
    title:Attention
    title:Attention

    Make sure matching versions of Snappy are installed in Datameer X and on your Hadoop cluster (e.g. verify via checksums from the library files). Version mismatch can result in erroneous behavior during codec loading or job execution.

    Note
    titleAs of Datameer X Version 5.6

    To prevent out of memory errors, set das.job.container.memory-heap-fraction=0.7

Using 3rd Party Compression Libraries 

Datameer X in general can deal with all sort of compression codecs that are supported by Hadoop, with the exception of EMR. The following examples show you how to enable compression. 

...

If you are using a Datameer X version for Cloudera, in this example CDH5.2.1-MR1, on the Datameer X host the native libraries are already available. 

  1. Check native SNAPPY compression libraries on Datameer X host.

    No Format
    datameer@datameer-node Datameer-5.3.0-cdh-5.2.1-mr1]$ ll lib/native/ 
    ... 
    -rw-r--r-- 1 datameer datameer 19848 23. Dez 06:39 libsnappy.so 
    -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1 
    -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1.1.3
  2. Make sure that the compression libraries are installed on your Hadoop cluster.

    No Format
    [root@cloudera-cluster lib]# pwd 
    /usr/lib 
    [root@cloudera-cluster lib]# find . -name 'snappy*' 
    ./hadoop/lib/snappy-java-1.0.4.1.jar 
    ... 
    ./hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar 
    ./hadoop-mapreduce/snappy-java-1.0.4.1.jar
  3. Copy the from Cloudera provided snappy-java-1.0.4.1.jar to the Datameer X node and put the codec into the etc/custom-jars folder.

    No Format
    [root@datameer-node]# cp snappy-java-1.0.4.1.jar /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars 
    [root@datameer-node]# chown datameer:datameer /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars/*.* 
    [root@datameer-node]# ll /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars 
    -rw-r--r-- 1 datameer datameer 960374 17. Okt 08:05 mysql-connector-java-5.1.34-bin.jar 
    -rw-r--r-- 1 datameer datameer 995968 23. Feb 05:16 snappy-java-1.0.4.1.jar
  4. Restart the conductor.

    No Format
    bin/conductor.sh restart
  5. Check in logs/conductor.log that the SNAPPY plugin became loaded. Creating a Datalink and import SNAPPY compressed AVRO files is now possible. 

...

  1. Copy build/hadoop-lzo*.jar to /path/to/das/etc/custom-jars on the conductor.
  2. Copy build/hadoop-lzo*/lib/native to /path/to/das/lib/ on the conductor. No configuration modification is needed for das to pick up lzo compression.
  3. Restart das.

LZO Compression on a Mac

Contact your Datameer X customer support representative. 

...

Install and configure das as shown in the Installation Guide, but don't start the application:

...

Configure the application to use LZO compression, as described above, then test on the command line and with das.

Hadoop, Hive, Datameer

...

X and LZO

Environment

Create a directory on your desktop that contains the Datameer X distribution, Hadoop distribution, and Hive distribution:

...

An output file called books.txt.lzo should be created. Upload this file to the hdfs. Later, create an import job for this file to see if Hadoop and Datameer X work.

Code Block
languagebash
 bin/hadoop fs -mkdir /books
 bin/hadoop fs -copyFromLocal books.txt.lzo /books/

...

Code Block
languagebash
bin/hive --service metastore

Install and set up Datameer X and connect to the LZO cluster

Download and install a Datameer X distribution and copy the native Hadoop lzo libs and the Hadoop lzo jar into the related folders using the following code:

...

Edit the das-common.properties file and add compression info and files that shouldn't be imported into Datameer X using the hidden-file-suffix property. Add the .index file you created before.

Code Block
languagebash
titleconf/das-common.properties
io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
das.import.hidden-file-suffixes=.index

Start Datameer X and connect to your pseudo-Hadoop cluster.

...