Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Info
titleINFO

EMR is an Amazon service that lets you run use cases on single-purpose short lived clusters that automatically scale to meet demand, or on long running highly available clusters using multi-master deployment mode. The ability to expand or shrink hardware processing hours based on your needs is useful for scheduling jobs that require resources that are limited or unavailable at busy times, and for ad hoc workloads with fluctuating resource requirements.

...

false

Note
icon


Table of Contents

Setting up Datameer on EMR

Info
titleAs of Datameeer 7.5

As of version 7.5, Datameer supports Hive within EMR 5.24 (and newer). if you require more specific information about Hive integration please contact your Datameer service team member.

Table of Contents

Setting up Datameer on EMR

You must log on with Datameer Administrator privileges to set up EMR.

In the Admin tab, select Hadoop Cluster. The current configurations for your Hadoop cluster are displayed. Click Edit at the bottom.

Select EMR Hadoop Cluster from the drop down menu under Cluster Mode.

...

INFO

To set up Datameer on Amazon EMR, you have to be an administrator.

To set up Datameer: 

  1. Open the Admin Tab and select "Hadoop Cluster"The current configuration for your Hadoop cluster is displayed. 
    Image Added 
  2. Click "Edit"The configuration page opens.
    Image Added 
  3. Select "EMR Hadoop Cluster" from the drop-down under section 'Cluster Mode'.
    Image Added
  4. Enter your Amazon S3 bucket address and the path to the storage folder. 
    INFO: Datameer uses S3 as storage for all files, both permanent and intermediate, for additional security.

...


  1. Image Added
  2. If needed, activate the check box "Use EC2 IAM Role?" to authenticate via IAM

...

  1. role.
    Image Added
  2. If needed, authenticate to your S3 bucket using your key/ secret.

Image Removed

Image Removed

...

  1. Image Added
  2. Select the mode for connecting to the EMR Cluster

...

  1. from the drop-down and set the configuration. 
    INFO: EMR Cluster Name

...

  1. and EMR Cluster ID are validated when saving the configuration.
    INFO: With EMR Cluster Name mode

...

  1. , enter the name of the cluster running EMR.

...

  1.  Set the polling interval time in seconds for Datameer to check if there is a cluster with the name entered above.

...

  1. Image Added
    INFO: With EMR Cluster Id mode

...

  1. , enter the ID of the cluster running EMR.

...

  1. Image Added
    INFO: With YARN Resource Manager mode you can provide the EMR Cluster master node hostname directly.

Image Removed

...

  1. Image Added
  2. Configure the default property values or enter additional Hadoop distribution specific properties as well as custom properties. 

Image Removed

...

  1. Image Added
  2. Select the severity of messages to be logged and confirm with "Save". Configuring the EMR Cluster is finished. 
    INFO:The logging customization field allows you to record exactly what is needed.

Image Removed

...

  1. Image Added

Security

Anchor
s3_auth
s3_auth
S3

...

Authentication

When configuring the EMR Hadoop Cluster, you are presented with two options for authenticating to S3:

  • Access Key/Secret
  • IAM Role

Access

...

Key/

...

Secret

Datameer uses the Amazon S3 REST API which in turn uses a custom HTTP scheme based on a keyed-HMAC (Hash Message Authentication Code) for authentication. To authenticate a request, you first concatenate selected elements of the request to form a string. You then use your AWS secret access key to calculate the HMAC of that string. The output of the HMAC algorithm is the signature. It simulates the security properties of a real signature. This signature is added to the request in the standard HTTP Authorization header using the syntax "Authorization: AWS AWSAccessKeyId:Signature".

When the system receives an authenticated request, it fetches the AWS secret access key that you claim to have and uses it in the same way to compute a signature for the message it received. It then compares the signature it calculated against the signature presented by the requester. If the two signatures match, the system concludes that the requester must have access to the AWS secret access key and therefore acts with the authority of the principal to whom the key was issued. If the two signatures do not match, the request is dropped and the system responds with an error message.

IAM

...

Role

IAM roles provide a convenient alternative to using access key/secret for authenticating to S3 from Amazon EC2 instances. When this option is selected, Datameer's S3 client uses the instance profile credentials to sign and authenticate the S3 requests. Instance profile credentials exist within the instance metadata associated with the IAM role for the EC2 instance. The EC2 instance on which Datameer runs is launched with the appropriate IAM role/instance profile. The same is used for launching the EMR Cluster. It is usually sufficient to use the default EC2 instance profile, EMR_EC2_DefaultRole, to launch both the EMR Cluster and the Datameer EC2 instance. The EMR instance, EC2 instance and S3 Bucket must be in the same AWS Region.

Encryption

Datameer uses S3 as storage for both permanent and intermediate files. Datameer does not write any intermediate or cached data locally on the cluster or to HDFS. The following diagram gives a high-level overview of supported encryption mechanisms.

Encrypting

...

Data at

...

REST

Datameer supports encrypting data at rest on S3. The following server-side encryption mechanisms for S3 are supported within Datameer. Datameer does not support Amazon S3 Client-Side Encryption.

Note that Datameer does not support explicit encryption of cached preview data, properties, configuration, deployment artifacts, or log files on the Datameer EC2 instance, although Amazon EC2 supports encryption of local disks.

LUKS encryption:

The Amazon EC2 instance store volumes and the attached Amazon EBS volumes of cluster instances are encrypted using LUKS. For more information about LUKS encryption, see the LUKS on-disk specification. At-rest encryption doesn't encrypt the EBS root device volume (boot volume).

Amazon EMR version 5.7.0 or later supports encryption of the EBS root device volume by specifying a custom AMI. For more information, see Customizing an AMI in the Amazon EMR Management Guide and How to protect data at rest with Amazon EC2 instance store encryption.

Server-side encryption with Amazon S3-Managed encryption keys (SSE-S3):

Server-side encryption with Amazon S3-managed encryption keys (SSE-S3) employs strong multi-factor encryption. Amazon S3 encrypts each object with a unique key. As an additional safeguard, it encrypts the key itself with a master key that it regularly rotates. Amazon S3 server-side encryption uses 256-bit Advanced Encryption Standard (AES-256) to encrypt your data.

To use SSE-S3, you need to create a bucket policy that enforces encryption as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html

Here is an example of a typical bucket policy:

Code Block
{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::aes-encrypted-test-bucket/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "AES256"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::aes-encrypted-test-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]

On the Datameer side, you need to specify the custom property das.fs.s3-bucket.encryption.type=AES. When this property is present, Datameer uses the appropriate encryption header in all S3 requests to ensure encryption for all objects that are stored in your bucket. Both authentication mechanisms - instance profile and access key/secret based credentials - are supported.

Server-side encryption with AWS KMS–managed keys (SSE-KMS):

AWS Key Management Service (AWS KMS) is a service that combines secure highly available hardware and software to provide a key management system scaled for the cloud. AWS KMS uses customer master keys (CMKs) to encrypt your Amazon S3 objects. You use AWS KMS via the Encryption Keys section in the IAM console or via AWS KMS APIs to centrally create encryption keys, define the policies that control how keys can be used, and audit key usage to prove they are being used correctly. The first time you add an SSE-KMS–encrypted object to a bucket in a region, a default CMK is created for you automatically. This key is used for SSE-KMS encryption unless you select a CMK that you created separately using AWS Key Management Service. Creating your own CMK gives you more flexibility, including the ability to create, rotate, disable, and define access controls, and to audit the encryption keys used to protect your data.

Besides creating a KMS key and a key policy, you need to create a bucket policy that enforces encryption as described here:     https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html

Here is an example of a typical bucket policy:

Code Block
{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::kms-encrypted-test-bucket/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "aws:kms",
                    "s3:x-amz-server-side-encryption-aws-kms-key-id": "arn:aws:kms:us-east-1:123456789012:key/6789abcd-89ab-ef01-3456-456789abcdef"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::kms-encrypted-test-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]
}

On the Datameer side, you need to specify the custom property das.fs.s3-bucket.encryption.type=KMS. When this property is present, Datameer uses the appropriate encryption header in all S3 requests to ensure encryption for all objects that are stored in your bucket. Both authentication mechanisms, instance profile and access key/secret based credentials, are supported.

Encrypting

...

Data In

...

Transit

Datameer implicitly supports encryption mechanisms for data in transit that are supported by Amazon EMR and S3. The encryption mechanisms are EMR release version and application (e.g., Hadoop, Tez, etc.) specific, and do not require any special handling in Datameer other than configuration.

...

Panel

tez.runtime.shuffle.ssl.enable=true
tez.runtime.shuffle.keep-alive.enabled=true

  • Amazon S3

Calls are made to Amazon S3 using REST over HTTPS in Datameer. HTTPS is the default protocol for the S3 client SDK used by Datameer. No further configuration is necessary.

...

To enforce in-transit encryption for all calls from the client browser to the Datameer app server, SSL must be enabled in Jetty. If using a custom certificate, install it on the Datameer ec2 instance. These instructions are provided in the Datameer's Installation Guide.

Using the REST API for EMR

Datameer's REST API is available to view and update EMR configurations. See Datameer's EMR REST API.