/
Set up Datameer on EMR

Set up Datameer on EMR

INFO

EMR is an Amazon service that lets you run use cases on single-purpose short lived clusters that automatically scale to meet demand, or on long running highly available clusters using multi-master deployment mode. The ability to expand or shrink hardware processing hours based on your needs is useful for scheduling jobs that require resources that are limited or unavailable at busy times, and for ad hoc workloads with fluctuating resource requirements.


Setting up Datameer on EMR

INFO

To set up Datameer on Amazon EMR, you have to be an administrator.

To set up Datameer: 

  1. Open the Admin Tab and select "Hadoop Cluster"The current configuration for your Hadoop cluster is displayed. 
     
  2. Click "Edit"The configuration page opens.
     
  3. Select "EMR Hadoop Cluster" from the drop-down under section 'Cluster Mode'.
  4. Enter your Amazon S3 bucket address and the path to the storage folder. 
    INFO: Datameer uses S3 as storage for all files, both permanent and intermediate, for additional security.

  5. If needed, activate the check box "Use EC2 IAM Role?" to authenticate via IAM role.
  6. If needed, authenticate to your S3 bucket using your key/ secret.
  7. Select the mode for connecting to the EMR Cluster from the drop-down and set the configuration. 
    INFO: EMR Cluster Name and EMR Cluster ID are validated when saving the configuration.
    INFO: With EMR Cluster Name mode, enter the name of the cluster running EMR. Set the polling interval time in seconds for Datameer to check if there is a cluster with the name entered above.

    INFO: With EMR Cluster Id mode, enter the ID of the cluster running EMR.

    INFO: With YARN Resource Manager mode you can provide the EMR Cluster master node hostname directly.
  8. Configure the default property values or enter additional Hadoop distribution specific properties as well as custom properties. 
  9. Select the severity of messages to be logged and confirm with "Save". Configuring the EMR Cluster is finished. 
    INFO:The logging customization field allows you to record exactly what is needed.

Security

S3 Authentication

When configuring the EMR Hadoop Cluster, you are presented with two options for authenticating to S3:

  • Access Key/Secret
  • IAM Role

Access Key/ Secret

Datameer uses the Amazon S3 REST API which in turn uses a custom HTTP scheme based on a keyed-HMAC (Hash Message Authentication Code) for authentication. To authenticate a request, you first concatenate selected elements of the request to form a string. You then use your AWS secret access key to calculate the HMAC of that string. The output of the HMAC algorithm is the signature. It simulates the security properties of a real signature. This signature is added to the request in the standard HTTP Authorization header using the syntax "Authorization: AWS AWSAccessKeyId:Signature".

When the system receives an authenticated request, it fetches the AWS secret access key that you claim to have and uses it in the same way to compute a signature for the message it received. It then compares the signature it calculated against the signature presented by the requester. If the two signatures match, the system concludes that the requester must have access to the AWS secret access key and therefore acts with the authority of the principal to whom the key was issued. If the two signatures do not match, the request is dropped and the system responds with an error message.

IAM Role

IAM roles provide a convenient alternative to using access key/secret for authenticating to S3 from Amazon EC2 instances. When this option is selected, Datameer's S3 client uses the instance profile credentials to sign and authenticate the S3 requests. Instance profile credentials exist within the instance metadata associated with the IAM role for the EC2 instance. The EC2 instance on which Datameer runs is launched with the appropriate IAM role/instance profile. The same is used for launching the EMR Cluster. It is usually sufficient to use the default EC2 instance profile, EMR_EC2_DefaultRole, to launch both the EMR Cluster and the Datameer EC2 instance. The EMR instance, EC2 instance and S3 Bucket must be in the same AWS Region.

Encryption

Datameer uses S3 as storage for both permanent and intermediate files. Datameer does not write any intermediate or cached data locally on the cluster or to HDFS. The following diagram gives a high-level overview of supported encryption mechanisms.

Encrypting Data at REST

Datameer supports encrypting data at rest on S3. The following server-side encryption mechanisms for S3 are supported within Datameer. Datameer does not support Amazon S3 Client-Side Encryption.

Note that Datameer does not support explicit encryption of cached preview data, properties, configuration, deployment artifacts, or log files on the Datameer EC2 instance, although Amazon EC2 supports encryption of local disks.

LUKS encryption:

The Amazon EC2 instance store volumes and the attached Amazon EBS volumes of cluster instances are encrypted using LUKS. For more information about LUKS encryption, see the LUKS on-disk specification. At-rest encryption doesn't encrypt the EBS root device volume (boot volume).

Amazon EMR version 5.7.0 or later supports encryption of the EBS root device volume by specifying a custom AMI. For more information, see Customizing an AMI in the Amazon EMR Management Guide and How to protect data at rest with Amazon EC2 instance store encryption.

Server-side encryption with Amazon S3-Managed encryption keys (SSE-S3):

Server-side encryption with Amazon S3-managed encryption keys (SSE-S3) employs strong multi-factor encryption. Amazon S3 encrypts each object with a unique key. As an additional safeguard, it encrypts the key itself with a master key that it regularly rotates. Amazon S3 server-side encryption uses 256-bit Advanced Encryption Standard (AES-256) to encrypt your data.

To use SSE-S3, you need to create a bucket policy that enforces encryption as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html

Here is an example of a typical bucket policy:

{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::aes-encrypted-test-bucket/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "AES256"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::aes-encrypted-test-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]

On the Datameer side, you need to specify the custom property das.fs.s3-bucket.encryption.type=AES. When this property is present, Datameer uses the appropriate encryption header in all S3 requests to ensure encryption for all objects that are stored in your bucket. Both authentication mechanisms - instance profile and access key/secret based credentials - are supported.

Server-side encryption with AWS KMS–managed keys (SSE-KMS):

AWS Key Management Service (AWS KMS) is a service that combines secure highly available hardware and software to provide a key management system scaled for the cloud. AWS KMS uses customer master keys (CMKs) to encrypt your Amazon S3 objects. You use AWS KMS via the Encryption Keys section in the IAM console or via AWS KMS APIs to centrally create encryption keys, define the policies that control how keys can be used, and audit key usage to prove they are being used correctly. The first time you add an SSE-KMS–encrypted object to a bucket in a region, a default CMK is created for you automatically. This key is used for SSE-KMS encryption unless you select a CMK that you created separately using AWS Key Management Service. Creating your own CMK gives you more flexibility, including the ability to create, rotate, disable, and define access controls, and to audit the encryption keys used to protect your data.

Besides creating a KMS key and a key policy, you need to create a bucket policy that enforces encryption as described here:     https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html

Here is an example of a typical bucket policy:

{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::kms-encrypted-test-bucket/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "aws:kms",
                    "s3:x-amz-server-side-encryption-aws-kms-key-id": "arn:aws:kms:us-east-1:123456789012:key/6789abcd-89ab-ef01-3456-456789abcdef"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::kms-encrypted-test-bucket/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]
}

On the Datameer side, you need to specify the custom property das.fs.s3-bucket.encryption.type=KMS. When this property is present, Datameer uses the appropriate encryption header in all S3 requests to ensure encryption for all objects that are stored in your bucket. Both authentication mechanisms, instance profile and access key/secret based credentials, are supported.

Encrypting Data In Transit

Datameer implicitly supports encryption mechanisms for data in transit that are supported by Amazon EMR and S3. The encryption mechanisms are EMR release version and application (e.g., Hadoop, Tez, etc.) specific, and do not require any special handling in Datameer other than configuration.

  • Hadoop

See Hadoop in Secure Mode in Apache Hadoop documentation. In this configuration, Hadoop RPC is set to "Privacy" and uses SASL. This is activated in Amazon EMR when in-transit encryption is enabled.

Rename and put your certificates in a zip file as documented in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-certificates.html. Upload the zip to S3. Create a new Security Configuration from EMR Console. Disable at-rest encryption and enable in-transit encryption. In the TLS certificate provider section, choose "PEM" as Certificate provider type and enter fully qualified S3 path (e.g., "s3://test-bucket/my-certs.zip") for the certificates zip file in the S3 object field. Choose this Security Configuration while creating your EMR Cluster. If a wrong certificates zip path is specified, the Create Security Configuration and Create Cluster operations are still successful, but the cluster is terminated with an error "Access denied when trying to download from s3://test-bucket/my-certs.zip".

In Datameer, the following properties need to be specified in the Hadoop Cluster page custom properties section:

hadoop.rpc.protection=privacy

  • Tez

See Tez Runtime Configuration in Tez documentation. In this configuration, Tez Shuffle Handler uses TLS (tez.runtime.ssl.enable).

In Datameer, the following properties need to be specified in the Hadoop Cluster page custom properties section:

tez.runtime.shuffle.ssl.enable=true
tez.runtime.shuffle.keep-alive.enabled=true

  • Amazon S3

Calls are made to Amazon S3 using REST over HTTPS in Datameer. HTTPS is the default protocol for the S3 client SDK used by Datameer. No further configuration is necessary.

  • MySQL

When MySQL is installed on Amazon RDS, SSL needs to be enabled on MySQL. Install the corresponding server certificate on the Datameer instance.

  • Datameer Web/REST API

To enforce in-transit encryption for all calls from the client browser to the Datameer app server, SSL must be enabled in Jetty. If using a custom certificate, install it on the Datameer ec2 instance. These instructions are provided in the Datameer's Installation Guide.

Using the REST API for EMR

Datameer's REST API is available to view and update EMR configurations. See Datameer's EMR REST API.