Set up Datameer on Google Dataproc

INFO

Datameer X can use Google's Dataproc as execution engine if it is deployed in Google Cloud Platform against a Google Cloud Platform 1.4 cluster.

INFO

Find Google`s documentation here.

Configure Datameer X to Use the Google Dataproc Cluster

INFO

We recommend to install Datameer X on a separate machine. This machine can be on-premises or also within the Google cloud. Use machines in the Google cloud within the same network performance wise.

Datameer X and the Google Dataproc cluster must be in the same VPC and subnet.

Requirements for the configuration of the Google Dataproc cluster are:

To configure Datameer X

  1. Open the "Admin" tab and select "Hadoop Cluster"
    INFO: You have to be logged in as an administrator.
     
  2. Select "Edit" at the bottom. The setting page opens.
     
  3. Select "Google Cloud DataProc" from the drop-down 'Mode'. 
    INFO: For Kerberos security select "Google Cloud Dataproc (secured)". 
     
  4. Enter "gs://<bucket-name>" in the field 'Name Node'. 
     
  5. Enter the path for the Datameer X 'Private Folder'. 
    INFO: Entering a path leads Datameer X putting the private information in Hadoop's file system.
    INFO: Leaving the field empty, Datameer X puts the information in the user's home folder. 
     
  6. If needed, mark the checkbox to enable impersonation.
     
  7. If needed, mark the checkbox to enable HDFS transparent encryption. 
    INFO: If a Google Cloud Storage bucket is used as private folder, it is not recommended to mark the checkbox. 
    INFO: If it is required to use transparent encryption, then the Google Cloud Storage bucket has to be set up with 'encryption managed by Google'.
     
  8. Mark the checkbox 'Use Google IAM Role' if Datameer X is deployed on an IAM enabled Google Cloud instance. 
    INFO: When the checkbox is not marked, a Google Service Account JSON key is used instead. Choose the required file by clicking on "Choose File".  
     
  9. Enter "<clustername>-m:8032" in the field 'YARN Resource Manager Address'.  
     
  10. Enter "http://<clustername>-m:8088" in the field 'YARN Resource Manager Webapp Address'.
     
  11. Enter '<clustername>-m:8030" in the field 'YARN Resource Manager Scheduler Address'.
     
  12. If needed, change the YARN application class path according to your yarn-site.xml or execute the 'yarn classpath' on the command line. 
     
  13. Set the number of simultaneous jobs that may be run simultaneously.
     
  14. View the prefilled properties in section 'Hadoop Properties'.
    INFO: If you want to connect to a secured cluster (Kerberos), you need to uncomment the last line with 'Hadoop.rpc.protection=privacy' in field 'Hadoop Specific Properties'.  
     
  15. If needed, change the entries in the section 'Logging Settings'.
     
  16. Confirm the settings with "Save"The settings are saved. The 'File Browser' opens. Configuring Datameer X to use the Google Dataproc cluster is finished.