Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

When you schedule Datameer jobs which require additional resources that are unavailable at busy times, elasticity can help by scaling so those resource needs are always met. Amazon EMR is supported by Datameer so you have the EMR is an Amazon service that lets you run use cases on single-purpose short lived clusters that automatically scale to meet demand, or on long running highly available clusters using multi-master deployment mode. The ability to expand or shrink your hardware processing hours on hardware based on your needs . This also applies to ad hoc workloads that sometimes require less resources and sometimes require more. And EMR's scaling configurations free administrators from managing individual applications for resources across your enterpriseis useful for scheduling jobs that require resources that are limited or unavailable at busy times, and for ad hoc workloads with fluctuating resource requirements

Note
iconfalse
titleAs of Datameer Datameeer 7.5

As of version 7.5, Datameer supports Hive within EMR 5.24. if you require more specific information about Hive integration please contact your Datameer service team member.

...

Setting up Datameer on EMR

Begin setting up EMR in Datameer by logging into Datameer with an account that has administrator permissions.

Under the Admin tab, select Hadoop Cluster from the menu on the left side of the screenYou must log on with Datameer Administrator privileges to set up EMR.

In the Admin tab, select Hadoop Cluster. The current configurations for your Hadoop cluster are displayed. Click Click Edit at  at the bottom.

Select Select EMR Hadoop Cluster from  from the drop down menu under Cluster Mode.

Enter your Amazon S3 bucket address and the path to the storage folder.

Datameer uses S3 as storage for all files, both permanent and intermediate, for additional security.

Authenticate to your S3 bucket using  using your key/secret, or select the box to box Use EC2 IAM Role? to authenticate via IAM role.

Under cluster settings, if the top box is left unchecked, the cluster name is used to pull the YARN host name.

Enter Under Cluster Settings select mode for connecting to EMR Cluster:

  • EMR Cluster Name
  • EMR Cluster Id
  • or explicit YARN Resource Manager hostname.

With EMR Cluster Name mode: enter the name of the cluster running EMR.

Set the polling interval time in seconds for Datameer to check if there is a cluster with the name entered above.

Image Removed

Check the top box to enter a specific YARN host name for cluster.

Image Removed

Under Hadoop Properties, you Image Added

With EMR Cluster Id mode: enter the ID of the cluster running EMR.

Image Added

With YARN Resource Manager mode you can provide the EMR Cluster master node hostname directly.

Image Added

Under Hadoop Properties, you can configure default property values or enter additional Hadoop distribution specific properties as well as custom properties. 

Under logging optionsLogging Settings, select the severity of messages to be logged. The logging customization field allows to allows you to record exactly what is needed.

Click Click Save to  to complete the EMR Cluster setup.

...

Datameer uses the Amazon S3 REST API with which in turn uses a custom HTTP scheme based on a keyed-HMAC (Hash Message Authentication Code) for authentication. To authenticate a request, you first concatenate selected elements of the request to form a string. You then use your AWS secret access key to calculate the HMAC of that string. The output of the HMAC algorithm is the signature. It simulates the security properties of a real signature. This signature is added to the request in the standard HTTP Authorization header using the syntax "Authorization: AWS AWSAccessKeyId:Signature".

When the system receives an authenticated request, it fetches the AWS secret access key that you claim to have and uses it in the same way to compute a signature for the message it received. It then compares the signature it calculated against to the signature presented by the requester. If the two signatures match, the system concludes that the requester must have access to the AWS secret access key and therefore acts with the authority of the principal to whom the key was issued. If the two signatures do not match, the request is dropped and the system responds with an error message.

...

IAM roles provide a convenient alternative to using access key/secret for authenticating to S3 from Amazon EC2 instances. When this option is selected, Datameer's S3 client uses the instance profile credentials to sign and authenticate the S3 requests. Instance profile credentials exist within the instance metadata associated with the IAM role for the EC2 instance. The EC2 instance on which Datameer runs is launched with the appropriate IAM role/instance profile. The same is used for launching the EMR Cluster. It is usually sufficient to use the default EC2 instance profile, EMR EMR_EC2_DefaultRole, to launch both the EMR Cluster and the Datameer EC2 instance. The EMR instance, EC2 instance and S3 and S3 Bucket must be in the same AWS Region.

Encryption

Datameer uses S3 as storage for both permanent and intermediate files. Datameer does not write any intermediate or cached data locally on the cluster or to HDFS. The following diagram gives a high-level overview of supported encryption mechanisms.

...

Datameer supports encrypting data at rest on S3. The following server-side encryption mechanisms for S3 are supported within Datameer. Datameer doesn't does not support Amazon S3 Client-Side Encryption.

Datameer doesn't Note that Datameer does not support explicit encryption of cached preview data, properties, configuration, deployment artifacts, or log files on the Datameer EC2 instance, although Amazon EC2 supports encryption of local disks.

LUKS encryption:

The Amazon EC2 instance store volumes and the attached Amazon EBS volumes of cluster instances are encrypted using LUKS. For more information about LUKS encryption, see the LUKS on-disk specification. At-rest encryption doesn't encrypt the EBS root device volume (boot volume).

Amazon EMR version 5.7.0 or later supports encryption supports encryption of the EBS root device volume by volume by specifying a custom AMI. For more information, see Customizing an AMI in the Amazon EMR Management Guide and How to protect data at rest with Amazon EC2 instance store encryption.

Server-side encryption with Amazon S3-Managed encryption keys (SSE-S3):

Server-side encryption with Amazon S3-managed encryption keys (SSE-S3) employs strong multi-factor encryption. Amazon S3 encrypts each object with a unique key. As an additional safeguard, it encrypts the key itself with a master key that it regularly rotates. Amazon S3 server-side encryption uses 256-bit Advanced Encryption Standard (AES-256) , to encrypt your data.

To use SSE-S3, you need to create a bucket policy that enforces encryption as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html

Here is an example of a typical bucket policy:

Code Block
{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::aes-encrypted-test-bucket/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "AES256"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::aes-encrypted-test-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]

On the Datameer side, you need to specify the custom property property das.fs.s3-bucket.encryption.type=AES. When this property is present, Datameer uses the appropriate encryption header in all S3 requests to ensure encryption for all objects that are stored in your bucket. Both authentication mechanisms ,- instance profile and access key/secret based credentials , - are supported.

Server-side encryption with AWS KMS–managed keys (SSE-KMS):

AWS Key Management Service (AWS KMS) is a service that combines secure highly available hardware and software to provide a key management system scaled for the cloud. AWS KMS uses customer master keys (CMKs) to encrypt your Amazon S3 objects. You use AWS KMS via the Encryption Keys section in the IAM console or via AWS KMS APIs to centrally create encryption keys, define the policies that control how keys can be used, and audit key usage to prove they are being used correctly. The first time you add an SSE-KMS–encrypted object to a bucket in a region, a default CMK is created for you automatically. This key is used for SSE-KMS encryption unless you select a CMK that you created separately using AWS Key Management Service. Creating your own CMK gives you more flexibility, including the ability to create, rotate, disable, and define access controls, and to audit the encryption keys used to protect your data.

Besides creating a KMS key and a key policy, you need to create a bucket policy that enforces encryption as described here:     https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html

Here is an example of a typical bucket policy:

Code Block
{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::kms-encrypted-test-bucket/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "aws:kms",
                    "s3:x-amz-server-side-encryption-aws-kms-key-id": "arn:aws:kms:us-east-1:123456789012:key/6789abcd-89ab-ef01-3456-456789abcdef"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::kms-encrypted-test-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]
}

On the Datameer side, you need to specify the custom property property das.fs.s3-bucket.encryption.type=KMS. When this property is present, Datameer uses the appropriate encryption header in all S3 requests to ensure encryption for all objects that are stored in your bucket. Both authentication mechanisms, instance profile and access key/secret based credentials, are supported.

Encrypting data In transit

Datameer implicitly supports encryption mechanisms for data in transit that are supported by Amazon EMR and S3. The encryption mechanisms are EMR release version and application (e.g., Hadoop, Tez, etc.) specific, and don't do not require any special handling in Datameer other than configuration.

  • Hadoop

See Hadoop in Secure Mode in Apache Hadoop documentation. In this configuration, Hadoop RPC is set to "Privacy" and uses SASL. This is activated in Amazon EMR when in-transit encryption is enabled.

...

In Datameer, the following properties need to be specified in the Hadoop Cluster page custom properties section:

Panel

hadoop.rpc.protection=privacy

  • Tez

See Tez Runtime Configuration in Tez documentation. In this configuration, Tez Shuffle Handler uses TLS (tez.runtime.ssl.enable).

...

Panel

tez.runtime.shuffle.ssl.enable=true
tez.runtime.shuffle.keep-alive.enabled=true

  • Amazon S3

Calls are made to Amazon S3 using REST over HTTPS in Datameer. HTTPS is the default protocol for the S3 client SDK used by Datameer. No further configuration is necessary.

  • MySQL

When MySQL is installed on Amazon RDS, SSL needs to be enabled on MySQL. Install the corresponding server certificate on the Datameer instance.

  • Datameer Web/REST API

To enforce in-transit encryption for all calls from the client browser to the Datameer app server, SSL must be enabled in Jetty. If using a custom certificate, installed install it on the Datameer ec2 instance. These instructions are provided in the Datameer's Installation Guide.

Using the REST API for EMR

Datameer's REST API is available to view and update EMR configrationsconfigurations. View See Datameer's EMR REST API.