Table of Contents |
---|
Datameer X FAQ About Compression
Anchor | ||||
---|---|---|---|---|
|
...
- First determine the best compression algorithm for your environment (see the "compression" topic, under Hadoop Cluster Configuration Tips)
- Install native compression libraries (platform-dependent) on both the Hadoop cluster (
<HADOOP_HOME>/lib/native/Linux-[i386-32 | amd64-64]
) and the Datameer X machine (<das install dir>/lib/native/Linux-[i386-32 | amd64-64]
)
...
Configure the codec use as custom properties in the Hadoop Cluster section in Datameer :X
No Format |
---|
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec |
...
Setup steps for LZO compression in Datameer
In order to let Datameer X work with files compressed by LZO, its required to add corresponding Java libraries into the Datameer X installation folder and point the application to the list of compression codecs to use.
Note |
---|
The following steps are only applicable for the local execution framework on respective Datameer X Workgroup editions. |
Open the
<datameer-install-path>/conf/das-conductor.properties
file and ensure that LZO is listed as adas.import.compression.known-file-suffixes
property value (this should be a default value).Panel title conf/das-conductor.properties ## Importing files ending with one of those specified suffixes will result in an exception if
## the proper compression-codec can't be loaded. This helps to fail fast and clear instead of displaying spaghetti data.
das.
import
.compression.known-
file
-suffixes=zip,gz,bz2,lzo,lzo_deflate,Z
Open the
<datameer-install-path>/conf/das-common.properties
file and add the values below under the section## HADOOP Properties Additions
.Panel title conf/das-common.properties io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
Add
liblzo2*
libraries to<datameer-install-path>/lib/native/
. Usually there are 3 files *.so, *.so.2, *.so.2.0.0. Sometimes LZO libs might already be installed somewhere on the machine, if they aren't, it would be required to add them by installing LZO related packages. Install LZO related packages by following the directions below:Panel title Installing LZO packages Execute the following command at all the nodes in your cluster:
RHEL/CentOS/Oracle Linux:
yum install lzo lzo-devel hadooplzo hadooplzo-native
For SLES:
zypper install lzo lzo-devel hadooplzo hadooplzo-native
For Ubuntu/Debian:
HDP support for Debian 6 is deprecated with HDP 2.4.2. Future versions of HDP will no longer be supported on Debian 6.
apt-get install liblzo2-2 liblzo2-dev hadooplzoAdd
libgplcompression.*
libraries to<datameer-install-path>/lib/native/[Linux-amd64-64/i386-32]
. Usually there are 5 of files *.a, *.la, *.so, *.so.0, *.so.0.0.0. LZO libs might already be installed somewhere on the machine, but if they aren't, it would be required to add them. Follow Hadoop gpl packaging or Install hadoop-gpl-packaging to do this.Note At this point, it might be be required to move the library files directly into the native folder.
- Add the corresponding JAR library files to
<datameer-install-path>/etc/custom-jars/
. You could loadhadoop-lzo-<version>.jar
from hadoop-lzo andhadoop-gpl-compression-<version>.jar
from hadoop-gpl. - Restart Datameer.
...
Copy
liblzo2*
andlibgplcompression.*
libraries from<datameer-install-path>/lib/native/
to the corresponding folder of the Hadoop distribution. Execute commandhadoop
checknative -a to find the Hadoop's native libs location. Here are possible locations for different versions:Panel title hadoop checknaticve -a /usr/lib/hadoop/lib/native
/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
/usr/hdp/current/hadoop/lib/native/Note For some configurations libgplcompression.* libraries should be moved from [Linux-amd64-64/i386-32] folder directly to /lib/native. Ensure that all symlinks remain the same after copying.
- Copy appropriate JAR library files from
<datameer-install-path>/etc/custom-jars/
to the corresponding/lib
folder of the Hadoop distribution (this is usually the parent folder for the native library). Add the LZO codec information to the Hadoop configuration files by opening
$HADOOP_HOME/conf/core-site.xml
and insert data provided below to the propertyio.compression.codecs.
Panel title core-site.xml com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
add add the following directly after the
io.compress.codecs
property:Panel title core-site.xml <property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
Add the following custom properties in Datameer X on the Hadoop Cluster page:
Panel title Datameer X Custom Properties tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
...
- For now, CDH3u1 and newer versions are containing Snappy compression codec already. This page contains the configuration instructions. In addition, Snappy is integrated into Apache Hadoop versions 1.0.2 and 0.23.
Using Cloudera's distribution of Hadoop is required to enable the codec inside Datameer X application, either in Hadoop Cluster settings or on per job basis. Add the following settings:
No Format io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec mapred.output.compress=true mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec mapred.output.compression.type=BLOCK
If you encounter errors, set the following custom Hadoop properties might be needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+:
Code Block yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib> yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native mapreduce.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
Note title:Attention title:Attention Make sure matching versions of Snappy are installed in Datameer X and on your Hadoop cluster (e.g. verify via checksums from the library files). Version mismatch can result in erroneous behavior during codec loading or job execution.
Note title As of Datameer X Version 5.6 To prevent out of memory errors, set
das.job.container.memory-heap-fraction=0.7
Using 3rd Party Compression Libraries
Datameer X in general can deal with all sort of compression codecs that are supported by Hadoop, with the exception of EMR. The following examples show you how to enable compression.
...
If you are using a Datameer X version for Cloudera, in this example CDH5.2.1-MR1, on the Datameer X host the native libraries are already available.
Check native SNAPPY compression libraries on Datameer X host.
No Format datameer@datameer-node Datameer-5.3.0-cdh-5.2.1-mr1]$ ll lib/native/ ... -rw-r--r-- 1 datameer datameer 19848 23. Dez 06:39 libsnappy.so -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1 -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1.1.3
Make sure that the compression libraries are installed on your Hadoop cluster.
No Format [root@cloudera-cluster lib]# pwd /usr/lib [root@cloudera-cluster lib]# find . -name 'snappy*' ./hadoop/lib/snappy-java-1.0.4.1.jar ... ./hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar ./hadoop-mapreduce/snappy-java-1.0.4.1.jar
Copy the from Cloudera provided
snappy-java-1.0.4.1.jar
to the Datameer X node and put the codec into theetc/custom-jars
folder.No Format [root@datameer-node]# cp snappy-java-1.0.4.1.jar /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars [root@datameer-node]# chown datameer:datameer /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars/*.* [root@datameer-node]# ll /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars -rw-r--r-- 1 datameer datameer 960374 17. Okt 08:05 mysql-connector-java-5.1.34-bin.jar -rw-r--r-- 1 datameer datameer 995968 23. Feb 05:16 snappy-java-1.0.4.1.jar
Restart the conductor.
No Format bin/conductor.sh restart
Check in
logs/conductor.log
that the SNAPPY plugin became loaded. Creating a Datalink and import SNAPPY compressed AVRO files is now possible.
...
- Copy
build/hadoop-lzo*.jar
to /path/to/das/etc/custom-jars
on the conductor. - Copy
build/hadoop-lzo*/lib/native
to/path/to/das/lib/
on the conductor. No configuration modification is needed for das to pick up lzo compression. - Restart das.
LZO Compression on a Mac
Contact your Datameer X customer support representative.
...
Install and configure das as shown in the Installation Guide, but don't start the application:
...
Configure the application to use LZO compression, as described above, then test on the command line and with das.
Hadoop, Hive, Datameer
...
X and LZO
Environment
Create a directory on your desktop that contains the Datameer X distribution, Hadoop distribution, and Hive distribution:
...
An output file called books.txt.lzo should be created. Upload this file to the hdfs. Later, create an import job for this file to see if Hadoop and Datameer X work.
Code Block | ||
---|---|---|
| ||
bin/hadoop fs -mkdir /books bin/hadoop fs -copyFromLocal books.txt.lzo /books/ |
...
Code Block | ||
---|---|---|
| ||
bin/hive --service metastore |
Install and set up Datameer X and connect to the LZO cluster
Download and install a Datameer X distribution and copy the native Hadoop lzo libs and the Hadoop lzo jar into the related folders using the following code:
...
Edit the das-common.properties
file and add compression info and files that shouldn't be imported into Datameer X using the hidden-file-suffix
property. Add the .index
file you created before.
Code Block | ||||
---|---|---|---|---|
| ||||
io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec das.import.hidden-file-suffixes=.index |
Start Datameer X and connect to your pseudo-Hadoop cluster.
...