Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Hadoop provides scalable data storage using the the Hadoop Distributed File System (HDFS) and  and fast parallel data processing on a fault-tolerant cluster of computers. Learn more about about Hadoop.

See See Hadoop and Datameer to  to learn more about Hadoop and how to use it with Datameer.

...

General configuration

  1. Click the Admin the Admin tab.
  2. Click the the Hadoop Cluster tab  tab at the left side. The current settings are shown.
    Image RemovedImage Added 
  3. Click Edit to make changes.
  4. Click Click Save when  when you are finished making changes.

...

  1. Click the Admin tab.
  2. Click the Hadoop Cluster tab at the left side. The current settings are shown.
  3. Click Edit to make changes.
  4. Select Hadoop Cluster for the mode.
  5. Specify the name node and add a private folder path or use impersonation if applicable.
    Whitespaces aren't supported for use in file/folder paths. Avoid Avoid setting up Datameer storage directories (storage root path, temp paths, execution framework specific staging directories, etc.) with a whitespace in the path.

    Note

    Impersonation notes:
    - There is one-to-one mapping between the Datameer user and the OS user.  
    - The OS user who is launching the Datameer process must be a sudoer.
    - The temp folder for the Datameer installation local file system as well as in the Hadoop cluster (for Datameer) should have read/write access.

      • <Datameer_Installation_Folder>/tmp (Local FileSystem)
      • <Datameer_Private_Folder>/temp (Hadoop Cluster and MapR)

    Learn about Enabling Secure Impersonation with Datameer.

  6. Specify YARN settings.
  7. Use the properties text boxes to add Hadoop and custom properties. 
    Enter a name and value to add a property, or delete a name and value pair to delete the property.

    Note

    Within these edit fields, backslash (\) characters are interpreted by Datameer as an escape character rather than a plain text character. In order to produce the actual backslash character, you have to type two backslashes:

    Code Block
    languagetext
    example.property=example text, a backslash \\ and further text

    The second backslash is needed as you are effectively editing a Java properties file in these edit fields.


  8. Logging options. Select the severity of messages to be logged. The logging customization field allows to record exactly what is needed.
  9. Click Save when you are finished making changes.

...

Anchor
autoconfigure
autoconfigure

Autoconfigure grid mode 

...

Warning

This feature is not supported with Cloudera Manager Safety Valve.

...

  1. Go to Admin tab > Hadoop Cluster.
  2. Select Autoconfigure Cluster in the Mode field.
  3. In the Hadoop Configuration Directory field, enter the directory where the configuration files for Hadoop are located.
  4. Enter the path to the folder where Datameer puts its private information in Hadoop's file system in the Datameer Private Folder field. 
     
  5. Enter the number of concurrent jobs.
  6. Select whether to use secure impersonation. 
  7. Edit the Hadoop or custom properties as necessary.
    If the yarn.timeline-service.enabled property is true in the Hadoop conf files, set yarn.timeline-service.enabled=false as a custom property. (This change is not needed as of Datameer v6.1)
     
  8. Click Save.

To update your Datameer Autoconfig Grid Mode cluster configuration, click Edit on the Hadoop Cluster page and click Save. After saving has been completed, the updated configuration has been applied. New updates to the settings are not shown in the conductor.log but are displayed when clicking again on Edit for Autoconfigure Grid Mode under the Hadoop Properties label.

...

  1. Click the Admin tab at the top of the page.
  2. Click the Hadoop Cluster tab at the left side. The current settings are shown.
  3. Click Edit to make changes and choose MapR in the mode list.
  4. Add the cluster name, the Datameer private folder, and check the boxes if using Simple Impersonation for Datameer to submit jobs and access the HDFS on behalf of Datameer user, and the Max Concurrent jobs. 
     
    There is one-to-one mapping between the Datameer user and the OS user.  
    The OS user who is launching the Datameer process must be a sudoer.
    The temp folder for the Datameer installation local file system as well as in the hadoop cluster (for Datameer) should have read/write access.

      • <Datameer_Installation_Folder>/tmp (Local FileSystem)
      • <Datameer_Private_Folder>/temp (Hadoop Cluster and MapR)
    Note

    Anchor
    secure_mapr
    secure_mapr
    Connecting to a secure MapR cluster

    1) Obtain the MapR ticket for the user who is running the Datameer application. Execute the following command on the shell:

    Code Block
    maprlogin password -user <user_who_starts_datameer>

    2) Install Datameer and open <Datameer_Home>/etc/das-env.sh and add the following system property to the Java arguments:

    Code Block
    -Dmapr.secure.mode=true

    3) Start and configure Datameer using MapR Grid Mode.

    The option to connect using Secure Impersonation is now available.

    4) (Optional) If there is a failure in saving the configuration:

    Code Block
    Caused by: java.io.IOException: Can't get Master Kerberos principal for use as renewer

    Add the following custom Hadoop properties under the Hadoop Admin page: 

    Code Block
    yarn.resourcemanager.principal=<value>

    The value for this property can be found in the yarn-site.xml file in your Hadoop cluster configuration.

    The steps to achieve impersonation are same as for a secured Kerberos cluster.

  5. If required, enter properties. Enter a name and value to add a property, or delete a name and value pair to delete that property.
  6. Logging options. Select the severity of messages to be logged. It is also possible to write custom log settings to record exactly what is needed.
  7. Click Save when you are finished making changes.

Anchor
HA
HA
 Configuring High Availability

...

  1. Specify the NameService to which the Datameer instance should be working with in HDFS NameNode text box: hdfs://nameservice1
  2. Specify the following Hadoop properties in Custom Property field:

    Code Block
    languagebash
    titleCustom Properties
    ### HDFS Name Node (NN) High Availability (HA) ###
    dfs.nameservices=nameservice1
    dfs.ha.namenodes.nameservice1=nn1,nn2
    dfs.namenode.rpc-address.nameservice1.nn1=<server-address>:8020
    dfs.namenode.rpc-address.nameservice1.nn2=<server-address>:8020
    dfs.client.failover.proxy.provider.nameservice1=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

    If HDFS HA is configured for automatic failover by using Quorum Journal Manager (QJM), you need to add the following additional custom properties:

    Code Block
    ### HDFS HA Autotmatic Failover
    # By using the Quorum Journal Manager (QJM)
    dfs.ha.automatic-failover.enabled=true
    ha.zookeeper.quorum=<zookeepperHost1>.<domain>.<tld>:2181,<zookeepperHostn>.<domain>.<tld>:2181
  3. Check the current NameNode (NN) setting within the database:

    Code Block
    mysql -udap -pdap dap -Bse "SELECT uri FROM data" | cut -d"/" -f3 | sort | uniq

    The command above should have only one result, the former <host>.<domain>.<tld>:<port> value configured under Admin tab  tab Hadoop Cluster  > Storage Settings  > HDFS NameNode.

  4. Update the paths to the new location in the Datameer DB:

    Code Block
    ./bin/update_paths.sh hdfs://<old.namenode>:8020/<root-path> hdfs://nameservice1/<root-path>
  5.  Check to ensure the new NameNode has been applied to the database and that the path is correct:

    Code Block
    mysql -udap -pdap dap -Bse "SELECT uri FROM data" | cut -d"/" -f3,4,5 | sort | uni

    The command above has only one result, the virtual NameNode value including the path configured under Admin tab  tab Hadoop Cluster > Storage Settings > Datameer Private Folder.

Anchor
yarn
yarn
YARN

...

  1. Specify the resource manager with which the Datameer instance should be working in Yarn Resource Manager Address field: yarnRM.

     

      YARN Application Classpath is a comma-separated list of CLASSPATH entries.


  2. Specify the following Hadoop properties in Custom Property field:

    Code Block
    languagebash
    titleCustom Properties
    ### Resource Manager (RM) YARN High Availability (HA) ###
    yarn.resourcemanager.cluster-id=yarnRM
    yarn.resourcemanager.ha.enabled=true
    yarn.resourcemanager.ha.rm-ids=rm1,rm2
    yarn.resourcemanager.recovery.enabled=true
    yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
    yarn.resourcemanager.zk-address=<server-adress>:2181
    ## RM1 ##
    yarn.resourcemanager.hostname.rm1=<server>
    yarn.resourcemanager.address.rm1=<server-adress>:8032
    yarn.resourcemanager.scheduler.address.rm1=<server-adress>:8030
    yarn.resourcemanager.webapp.address.rm1=<server-adress>:8088
    yarn.resourcemanager.resource-tracker.address.rm1=<server-adress>:8031
    ## RM2 ##
    yarn.resourcemanager.hostname.rm2=<server>
    yarn.resourcemanager.address.rm2=<server-adress>:8032
    yarn.resourcemanager.scheduler.address.rm2=<server-adress>:8030
    yarn.resourcemanager.webapp.address.rm2=<server-adress>:8088
    yarn.resourcemanager.resource-tracker.address.rm2=<server-adress>:8031

...

Custom properties consist of a name (key) and value pair separated with a '='. These properties are used to configure Hadoop jobs.
For example you can specify the output compression codec for jobs by entering entering mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec into  into the custom property field.

...

There are some additional Datameer specific properties as well. These are:

Name

Default value

Description

Location

Since

das.error.log.max-entries

-1 (unlimited)

The maximum number of errors that should be written per MR task.

das-job.properties


das.error.log.replication-factor

1

The replication factor for the error-logs which are getting written for dropped records

das-job.properties


das.preview.expression.records.max

100

Maximum array size that a function may return for previews. If this value is exceeded the excess records are chopped off. This is only used on the Datameer server to keep the previews small.

das-conductor.properties


das.join.deferred-file.memory-threshold

52428800 (50MB)

Maximum size of memory which is used to cache values before falling back to file. This property is used in self joins and from the join strategies REDUCE_SIDE and PRE_SORT_MAP_SIDE

das-job.properties


das.join.map-side.memory.max-file-size

15728640 (15MB)

Maximum size for the smaller join file of a join. Datameer uses different join strategies. The fastest join strategy is when one input of the two fits into memory. This property marks the upper bound when a file, depending on its size, is assumed to fit into memory. Note that depending on compression and record-structure a 15 MB file on disk can easily blow up to 500 MB in memory.

das-job.properties


das.join.disabled-strategies

empty

A comma separated list of Join strategies which should explicitly not be used. Available are MEMORY_BACKED_MAP_SIDE, REDUCE_SIDE and PRE_SORT_MAP_SIDE

das-job.properties


das.io.compression.codecs.addition

datameer.dap.common.util.ZipCodec

Additional compression codecs that are added to io.compression.codecs

das-common.properties


das.splitting.max-split-count


Datameer merges multiple small files into combined splits and splits large files into multiple smaller splits. Setting this property can control the maximum number of splits that should be created. This is good to reduce the number of file handlers due to a lack of RPC resources.



das.splitting.min-split-count


Sets the minimum number of splits that should be created. This could be used in increase the number of map-tasks for smaller jobs to better utilize the cluster.



das.jdbc.records.per.commit

50

Sets the JDBC batch size for export jobs. Helpful on debugging export issues during JDBC batch commit.



das.jdbc.import.transaction-isolation

TRANSACTION_READ_COMMITTED

The jdbc transaction-level which should be used for imports from database. You can choose between TRANSACTION_NONE, TRANSACTION_READ_UNCOMMITTED, TRANSACTION_READ_COMMITTED, TRANSACTION_REPEATABLE_READ and TRANSACTION_SERIALIZABLE which corresponds to the transaction-isolation levels from java.sql.Connection.

das-job.properties


das.debug.jobfalseShortcut property for enabling a set of debug related properties for map-reduce jobs like periodic thread-dumps, periodic counter dumps, task-log retrieval, etc..
1.4.8

das.debug.periodic-thread-dumps-period

(-1=disabled, >0 enabled)

A thread dump is taken every x ms and logged to stdout (so it ends in the task-logs).



das.debug.periodic-counter-dumps-period

 (-1=disabled, >0 enabled)
 A
 A hadoop-counter dump is taken every x ms and logged to stdout (so it ends in the task-logs).
1.4.8

das.debug.tasks.logs.collect.max

20

Number of task-logs which getting copied into job-artifacts folder when job completes.



das.debug.tasks.logs.collect.force

false

When true, copies the task-logs into job-artifacts folder even when job completes successfully.



das.sampling.lookahead.maxdepth5Smart sampling uses information from the downstream sheets in a workbook while sampling data for the preview of a sheet. This parameter sets the number of downstream levels which are considered.
2.0
das.job.trigger.type

Tells who/what triggered a job. (USER, SCHEDULER, RESTAPI, IMPORTJOB, EXPORTJOB, WORKBOOK)


3.1

das.job.hdfs.ownership


The name of the owner for each file and folder that will be created during a job run in HDFS
3.1
das.job.hdfs.groupname

The name of the group for each file and folder that will be created during a job run in HDFS


3.1

das.job.execution.username


The name of the user thats executes the job.


3.1

das.job.configuration.owner.username


The name of the user thats the owns the configuration of the job.


3.1
das.map-tasks-per-node10Set the number of map tasks per node.
5.0
das.reduce-tasks-per-node7

Set the number of reduce tasks per node


5.0

Custom properties exclusive for Datameer Smart Execution 

Note
iconfalse

Available as of 5.0 with Smart Execution license

NameDefault valueDescriptionLocationSince
tez.am.shuffle-vertex-manager.desired-task-input-size50Set the input size for each TEZ sub-taskdas-common.properties5.0

das.execution-framework

smart

Force a job to be run with a specific execution framework
Smart Execution = Smart
Optimized MapReduce = Tez
SmallJob Execution = SmallJob 

Spark Smart Execution = Smart

Local Mode = local (default and the only execution framework in local mode)


5.0
das.execution-framework.small-job.max-records1000000Set a threshold size for number of records to choose execution framework
5.0
das.execution-framework.small-job.max-uncompressed1000000000Set a threshold size for uncompressed file size to choose execution framework
5.0

Configuring Number of Hadoop Cluster Nodes

...

  1. Add all available vcores from both head and worker nodes.
  2. Divide total number of vcores by the number of nodes in the cluster. This average constitutes the size of a single node. (If the node-based license allows 3 nodes, 3 times the average number of vcores can be used.)
  3. Update the YARN scheduler on the Hadoop cluster to the number of specified vcores.

    Code Block
    yarn.scheduler.capacity.<queue_name>.maximum-allocation-vcores=12
  4.  Add a custom Hadoop property in Datameer to  to recognize the number of available vcores.

    For Tez:

    Code Block
    tez.queue.name=<queue_name>

    For Spark:

    Code Block
    spark.yarn.queue=<queue_name>

...

Datameer's <datameer-install-path>/tmp folder  folder stores temporary files required for Datameer. 

...

  • tmp/tez-plugin-jars

If the /tmp folder  folder is consuming too much space and other files are present in the folder, Datameer suggests it is safe to purge files that are older than three days.

...

Datameer's Distributed File System cache cache <datameer-install-path>cache/dfscache is  is used for or caching data link and import job sample records as well as log and jdbc files.

...

Code Block
 <cache
        name="dfsCache"
        maxEntriesLocalHeap="10000"
        timeToIdleSeconds="60"
        timeToLiveSeconds="60"
        eternal="false"
        diskPersistent="false"
        overflowToDisk="false"
    >

This This dfscache grows  grows in size as the number of sample records from import jobs and data links is increased. Follow the troubleshooting guide if the Datameer filesystem is full due to excessive reduce-part-#### files generated in in dfsCache folder folder.

The The dfscache folder  folder and its contents can be moved to a different location if needed by following the guide in our knowledge base.

...

Define the time zone that should be used for displaying the date in the UI and for parsing date strings that do not specify a timezone. Use Use default as  as the value to use the local server's time zone.

In In conf/default.properties you  you can change the value designating the time zone:

system.property.das.default-timezone=default

If the time zone is changed on the machine where Datameer is running, Datameer must be restarted to show the new default time zone configuration.

Examples

Time zone

Description

defaultLocal server time

PST

Pacific Standard Time

PST8PDT

This time zone changes to daylight saving time (DST) in the spring. The GMT offset is UTC/GMT -7 hours (PDT) during this time. In the fall it changes back to standard time, the GMT offset is then UTC/GMT -8 hours (PST).

CST

Central Standard Time

America/Los_Angeles

Time zone for Los Angeles (USA), this time zone changes to daylight saving time (DST) in the spring. The GMT offset is UTC/GMT -7 hours during this time. In the fall it changes back to standard time, the GMT offset is then UTC/GMT -8 hours.

EST5EDT

This time zone changes to daylight saving time (DST) in the spring. The GMT offset is UTC/GMT -4 hours (EDT) during this time. In the fall it changes back to standard time, the GMT offset is then UTC/GMT -5 hours (EST).