Page Comparison

Table of Contents

General Information about Custom Properties

Info

title	INFO

A custom property consists of a name (key) and a value pair, separated by a "=".

Example: 'das.job.health.check=true'

List of DAS Properties

...

Info

...

title

...

das.big-decimal.precision

...

The replication factor for the error-logs which are getting written for dropped records.

...

INFO

Set up the custom properties in the '/conf' folder within your Datameer X instance.

execution related properties - define parameters for a job that Datameer X sends to the cluster, e.g. container memory, split size.
conductor related properties - define the application behavior, e.g. proxy settings, preview size, thread pool.

Name Default Value Description

Execution Related Properties

das.io.compression.codecs.addition datameer.dap.common.util.ZipCodec Additional compression codecs that are added to io.compression.codecs

das.error.log.max-entries -1 (unlimited) The maximum number of errors that should be written per MR task.

das.error.log.replication-factor 1 The replication factor for the error-logs which are getting written for dropped records.

das.join.deferred-file.memory-threshold 52428800 (50MB) Maximum size of memory which is used to cache values before falling back to file. This property is used in self joins and from the join strategies REDUCE_SIDE and PRE_SORT_MAP_SIDE.

das.join.map-side.memory.max-file-size 15728640 (15MB) Maximum size for the smaller join file of a join. Datameer X uses different join strategies. The fastest join strategy is when one input of the two fits into memory. This property marks the upper bound when a file, depending on its size, is assumed to fit into memory. Note that depending on compression and record-structure a 15 MB file on disk can easily blow up to 500 MB in memory.

das.join.disabled-strategies empty A comma separated list of Join strategies which should explicitly not be used. Available are MEMORY_BACKED_MAP_SIDE, REDUCE_SIDE and PRE_SORT_MAP_SIDE.

das.jdbc.import.transaction-isolation TRANSACTION_READ_COMMITTED The jdbc transaction-level which should be used for imports from database. You can choose between TRANSACTION_NONE, TRANSACTION_READ_UNCOMMITTED, TRANSACTION_READ_COMMITTED, TRANSACTION_REPEATABLE_READ and TRANSACTION_SERIALIZABLE which corresponds to the transaction-isolation levels from java.sql.Connection.

das.job.trigger.type Tells who/what triggered a job. (USER, SCHEDULER, RESTAPI, IMPORTJOB, EXPORTJOB, WORKBOOK).

das.job.hdfs.ownership The name of the owner for each file and folder that will be created during a job run in HDFS.

das.job.hdfs.groupname The name of the group for each file and folder that will be created during a job run in HDFS.

das.job.execution.username The name of the user thats executes the job.

das.job.configuration.owner.username The name of the user thats the owns the configuration of the job.

das.debug.tasks.logs.collect.max 20 Number of task-logs which getting copied into job-artifacts folder when job completes.

das.debug.tasks.logs.collect.force false When true, copies the task-logs into job-artifacts folder even when job completes successfully.

das.splitting.max-split-count Datameer X merges multiple small files into combined splits and splits large files into multiple smaller splits. Setting this property can control the maximum number of splits that should be created. This is good to reduce the number of file handlers due to a lack of RPC resources.

das.splitting.nonmin-splittablesplit-encodings UTF-16, UTF-16BE, UTF-16LE Lists the file encodings for which splitting is restricted to only file level. For example, a file with this kind of encoding is not split at the block level.

das.job.concurrent-mr-jobs.new-graph-api 5 Only for new graph API: The maximum number of MR jobs that can run concurrently for each Datameer job.

das.join.deferred-file.memory-threshold 52428800 Maximum size of memory which is used to cache values during join before falling back to file. 50 MB default.

das.join.map-side.memory.max-uncompressed-file-size 31457280 das.join.map-side.memory.max-file-size 10485760

das.preview .generation.additional-records-ratio 0.2f All tasks together will try to create a preview of maxPreviewSize*the-configured-ratio in order to balance tasks with less records.

das.preview .group-series-expression.records.max 100 Maximum number of records a group series expression can generate for a preview.

das.join.disabled-strategies

das.range-join.map-side.max-uncompressed-file-size 31457280 Max uncompressed file size the file can have that we attempt to load into memory for a map-side range join.

das.range-join.map-side.max-records 10000 Max number of records the file can have that we attempt to load into memory for a map-side range join.

das.range-join.reduce-side.max-reducers -1 Max number of reducers to use for reduce-side range join implementation, -1 mean determine based on input sizing.

das.range-join.reduce-side.min-bytes-per-reducer 536870912 Minimum number of bytes each reducer should see, used by auto reducer count algorithm to help determine the number of reducers.

das.range-join.reduce-side.volume-weighting-factor

0.7

Factor to control weighting of the reducer data volume when determining the number of reducers for the reduce-side range join.

das.jdbc.imp ort.sql.badwords

commit, savepoint, rollback, set, checkpoint, grant,
revoke, insert, update, delete, index, alter, create,
lock, drop, truncate, ;, rename, comment, call, merge, explain, shutdown, show, describe, exec, go, declare

List of words that are not allowed in SQL entered by the user in SQL import. The string is broken up at the ',' and any spaces are part of the bad words. The string is parsed as a String[]. Examples of bad word sql follows:
- select * from mytable; -- semicolon is invalid 
- drop database dap -- drop is invalid 
- create table ... -- create is invalid 
- 
- select "savepoint", * from table -- currently the bad words can't show up in column names or
- data either

Info

title	INFO

This property should be updated on the das-conductor.properties file. After updating this property, restart Datameer.

das.jdbc.import.transaction-isolation TRANSACTION_READ_COMMITTED Sets a custom transaction isolation level for a JDBC connection to read data. Other possible values include TRANSACTION_NONE, TRANSACTION_READ_UNCOMMITED, TRANSACTION_REPEATABLE_READ, and TRANSACTION_SERIALIZABLE.

das.import.hidden-file-prefixes "." Comma separated list of file prefixes which will be used to exclude files from import.

das.import.hidden-file-suffixes Comma separated list of file suffixes which will be used to exclude files from import.

das.import.wizard.max-splits 5 The maximum number of files/splits which are retrieved by the import-job wizard in order to create import preview records.

das.import .ssh.default-max-mappers 5 Default number to use for max mappers when using the ScpProtocol for import.

das.import .whole-text-file.max-size ByteUnit.MB.toBytes(4) Max size in bytes to read when importing/uploading an HTML file.

das.compute.column-metrics true Whether or not column metrics should be computed (Column metrics are a requirement for the flipside of a sheet).

das.merge-sort.max-files

200

Maximum number of files that should be opened at once when doing a merge sort. If this number is exceeded the merge sort will happen in multiple steps.

das.merge-sort.max-file-size

5242880

Maximum file size of a file that can be included in a multi step merge sort. Files bigger than that will not be combined (5 MB).

das.network.proxy.type

HTTP

Proxy settings to be used for http/https outgoing connections for web service import jobs etc. das.network.proxy.type can be HTTP or SOCKS or SOCKSV4 (Socks v4)

das.network.proxy.host localhost das.network.proxy.port 3128

das.network.proxy.blacklist

localhost|127.0.0.1|192.168.*.*|*.datameer.com

To exclude hosts from connecting to through the configured proxy you can specify a blacklist of hosts, ips & masks (with * acting as a wild card).

das.network.proxy.username For proxies requiring authentication, set the username. das.network.proxy.password For proxies requiring authentication, set the password.

das.tde-export.temp-location=/path

java.io.tmpdir

For controlling hard disk space for TDE files if the Datanode hard disk is limited.

das.job.hadoop-progress.timeout

(no default)

Sets a duration, in seconds, after which a Datameer job will terminate with a "Timed out" status if there is no progress on the job. Progress is calculated as a percentage of overall job volume. Introduced in Datameer 7.5.

das.job.production-mode Set to true to apply Production Mode to an environment globally.count Sets the minimum number of splits that should be created. This could be used in increase the number of map-tasks for smaller jobs to better utilize the cluster.

das.jdbc.records.per.commit 50 Sets the JDBC batch size for export jobs. Helpful on debugging export issues during JDBC batch commit.

das.debug.job false Shortcut property for enabling a set of debug related properties for map-reduce jobs like periodic thread-dumps, periodic counter dumps, task-log retrieval, etc..

das.debug.periodic-thread-dumps-period (-1=disabled, >0 enabled) A thread dump is taken every x ms and logged to stdout (so it ends in the task-logs).

das.debug.periodic-counter-dumps-period (-1=disabled, >0 enabled) A hadoop-counter dump is taken every x ms and logged to stdout (so it ends in the task-logs).

das.sampling.lookahead.maxdepth 5 Smart sampling uses information from the downstream sheets in a workbook while sampling data for the preview of a sheet. This parameter sets the number of downstream levels which are considered.

das.map-tasks-per-node 10 Set the number of map tasks per node.

das.reduce-tasks-per-node 7 Set the number of reduce tasks per node.

das.job.reduce .tasks.correct true If the reducer count should be adapted based on the split-count (map-task count).

das.job.concurrent-mr-jobs.new-graph-api 5 Only for new graph API: The maximum number of MR jobs that can run concurrently for each Datameer X job.

das.aggregation.map-cache-size ByteUnit.MB.toBytes(16) Size of Cache for Map-Side Aggregation in bytes. Less than or equal to 0 means disable Map-Side Aggregation.

das.debug.job-metadata enabled Possible values are these from the enum JobMetaDataMode (ENABLED, DISABLED, DEFLATED)

das.debug.job.report-progress-interval -1 Interval in milliseconds a reporting thread will report progress to hadoop's tasktracker. If set to <= 0 no thread will be installed and progress is reported by mappers, reducers, readers, etc... as usual.

das.jobconf .cache-properties Properties like " das.plugin-registry.conf,map-runnable,reducer,mapred.input.dir" Properties are evacuated to the distributed cache. Evacuation means, before submitting the job, the properties are stripped out of the job-conf and written into one file in the distributed cache. On a task side, before any serious work happens, the properties are read from the distributed cache and added to the job-conf. Disable the whole mechanism by setting ' das.jobconf .cache-properties' to ''.

das.job.map -task.memory 2048

das.job.reduce -task.memory 2048

das.job.application-manager.memory 2048

das.job.reduce .tasks -1 {@link datameer.dap.sdk.util.HadoopConstants.MAPRED#REDUCE_TASKS} is calculated dynamically. If for any reason you want to enforce a specific reduce count, set this property to greater then zero.

das.error.log.replication-factor 2 The replication factor for the error-logs which are getting written for dropped records.

das.splitting.map-wave-count 3 The map-wave-count to influence the number of map-task DAS uses for internal and for file-based import jobs.

das.splitting.map-slots.safety-factor 1 The number of map-slots which isn't used for the last map-wave in order to absorb node-failures.

das.splitting.disable-combining false Determines how splitting occurs. Setting the value to true ensures that a files are not split based on blocks.

das.splitting.disable-individual-file-splitting false Determines how splitting occurs for an individual file. Setting the value to true ensures that a single file isn't split based on its blocks.

das.splitting.max-rewind-count 1024 * 1024 * 10 For support of {@link MultipleSourceRecordParser}. Determines the number of bytes a middle-split is rewinded at the maximum in case the split doesn't start with a valid raw-record.

das.splitting.max-split-count Datameer X merges multiple small files into combined splits and splits large files into multiple smaller splits. Setting this property can control the maximum number of splits that should be created. This is good to reduce the number of file handlers due to a lack of RPC resources.

das.splitting.non-splittable-encodings UTF-16, UTF-16BE, UTF-16LE Lists the file encodings for which splitting is restricted to only file level. For example, a file with this kind of encoding is not split at the block level.

das.join.deferred-file.memory-threshold 52428800 Maximum size of memory which is used to cache values during join before falling back to file. 50 MB default.

das.join.map-side.memory.max-uncompressed-file-size 31457280

das.join.map-side.memory.max-file-size 10485760

das.join.disabled-strategies

das.range-join.map-side.max-uncompressed-file-size 31457280 Max uncompressed file size the file can have that we attempt to load into memory for a map-side range join.

das.range-join.map-side.max-records 10000 Max number of records the file can have that we attempt to load into memory for a map-side range join.

das.range-join.reduce-side.max-reducers -1 Max number of reducers to use for reduce-side range join implementation, -1 mean determine based on input sizing.

das.range-join.reduce-side.min-bytes-per-reducer 536870912 Minimum number of bytes each reducer should see, used by auto reducer count algorithm to help determine the number of reducers.

das.range-join.reduce-side.volume-weighting-factor

0.7

Factor to control weighting of the reducer data volume when determining the number of reducers for the reduce-side range join.

das.cluster.hadoop.fs.cache.scheme.whitelist scp,sftp Allows using the Hadoop FileServer cache on the cluster; with the FileServer cache enabled for SCP/ SFTP scheme, there will be a single SSH connection for each 'das.cluster.hadoop.fs.cache.scheme.whitelist=scp,sftp'.

das.import.parquet.int96.timestamp.adjusted-to-utc

das.import.parquet.int64.timestamp.adjusted-to-utc

true

Timezone control for importing timestamps from Parquet.
true: timezone values in Parquet files have been adjusted to UTC
false: timezone values in Parquet have NOT been adjusted to UTC and assumed timezone will be controlled by `system.property.das.default-timezone` value in default.properties config file

das-conductor.properties

das.preview.expression.records.max 100 Maximum array size that a function may return for previews. If this value is exceeded the excess records are chopped off. This is only used on the Datameer X server to keep the previews small.

das.big-decimal.precision 32 During data ingestion, data field types of big integer and big decimal have the maximum precision rounded to this value.

das.expression.records.max Integer.MAX_VALUE Maximum array size that a function may return. If this value is exceeded the excess records are chopped off.

das.group-series-expression.records.max -1 Maximum number of records a group series expression can generate.

das.preview .generation.additional-records-ratio 0.2f All tasks together will try to create a preview of maxPreviewSize*the-configured-ratio in order to balance tasks with less records.

das.preview .group-series-expression.records.max 100 Maximum number of records a group series expression can generate for a preview.

das.jdbc.imp ort.sql.badwords

commit, savepoint, rollback, set, checkpoint, grant,
revoke, insert, update, delete, index, alter, create,
lock, drop, truncate, ;, rename, comment, call, merge, explain, shutdown, show, describe, exec, go, declare

List of words that are not allowed in SQL entered by the user in SQL import. The string is broken up at the ',' and any spaces are part of the bad words. The string is parsed as a String[]. Examples of bad word sql follows:
- select * from mytable; -- semicolon is invalid 
- drop database dap -- drop is invalid 
- create table ... -- create is invalid 
- 
- select "savepoint", * from table -- currently the bad words can't show up in column names or
- data either

Info

title	INFO

This property should be updated on the das-conductor.properties file. After updating this property, restart Datameer.

das.conductor.default.record.sample.size

1,000

controls what 'Records for schema detection' value is set by default whenever a user creates a new Import Job, Data Link or File Upload
should be adjusted, e.g. when working with large JSON objects (should be decreased to 250 - 500)
available as of Datameer 11.2.3
a restart is needed after changing the property in the 'conf/default.properties' file
once updated, it affects only newly created Import Jobs, the 'Records for schema detection' value for existing artifacts remains intact

List of Hadoop Properties

List of Execution Framework Properties

NameDefault ValueDescription

General

das.execution-framework

TezSets the execution framework e.g. 'das.execution-framework=Spark' to run Spark as the execution framework or 'das.execution-framework=Tez' to run Tez as the execution framework.

Spark

TEZ

Name	Default Value	Description
fs.AbstractFileSystem.hdfs.impl	org.apache.hadoop.fs.Hdfs	Property is required for successfully starting yarn application as these properties are stripped out while setting up the overlay by DAS framework.
io.compression.codecs.addition	datameer.dap.common.util.ZipCodec,datameer.dap.common.util.LzwCodec	Property is required for successfully starting yarn application as these properties are stripped out while setting up the overlay by DAS framework. Additional comma=separated compression codecs that are added to io.compression.codecs
das.yarn.available-node-vcores	auto	This property sets the number of available node vcores and is used in order to calculate the optimal number of splits and tasks (numberOfNodes * das.yarn.available-node-vcores). It can be set either to the number of free CPUs per node, or it can be set to auto. In auto mode, Datameer X fetches information about the vCores from every node and sets das.yarn.available-node-vcores to the average of the available vCores.
das.job.health.check	true	Turn das job configuration health check on for cluster configuration.
das.local.exec.map.count	5	DAS property to influence the number of mappers in Local Execution mode.
das.job.
das.local.exec.map.count	5	DAS property to influence the number of mappers in Local Execution mode.

List of Hive Properties

...

queue

List of Hive Properties

Name	Default Value	Description
dap.hive.use-datameer-file-splitter	fales	Property decides whether Datameer X should use Datameer's FileSplitter logic to generate RewindableFileSplits or Hive's own splitting logic.

List of Execution Framework Properties

Name	Default Value	Description
General
das.execution-framework	Tez	Sets the execution framework e.g. 'das.execution-framework=Spark' to run Spark as the execution framework or 'das.execution-framework=Tez' to run Tez as the execution framework.
TEZ
das.tez.session-pool.max-cached-sessions	auto	Maximum number of idle session to be held in cache. 0 means no session pooling, auto uses the number of nodes in the cluster as value for the maximum number of sessions.
das.tez.session-pool.max-idle-time	2m	The timeout for the sessions held in pool for which they wait for DAG to be submitted.
das.tez.session-pool.max-time-

cached-sessionsautoMaximum number of idle session to be held in cache. 0 means no session pooling, auto uses the number of nodes in the cluster as value for the maximum number of sessions.das.tez.session-pool.max-idle-time2mThe timeout for the sessions held in pool for which they wait for DAG to be submitted.das.tez.session-pool.max-time-to-live2hMaximum amount of time a tez session will be alive irrespective of being idle or active since its start time.tez.runtime.compresstrueSettings for intermediate compression with Tez.tez.runtime.compress.codecorg.apache.hadoop.io.compress.SnappyCodecSettings for intermediate compression with Tez.framework.local.das.parquet-storage.max-parquet-block-size67108864For local execution framework setting the max parquet block size to 64MB(can't be raised beyond 64MB but, can be lowered) by default.tez

to-live	2h	Maximum amount of time a tez session will be alive irrespective of being idle or active since its start time.
tez.runtime.compress	true	Settings for intermediate compression with Tez.
tez.runtime.compress.codec	org.apache.hadoop.io.compress.SnappyCodec	Settings for intermediate compression with Tez.
framework.local.das.parquet-storage.max-parquet-block-size	67108864	For local execution framework setting the max parquet block size to 64MB(can't be raised beyond 64MB but, can be lowered) by default.
tez.shuffle-vertex-manager.desired-task-input-size	52428800	Sets the input size for each TEZ sub-task.
das.cluster.plugin.resources.fixer.enabled	true	Intercepts the YARN errors YARN-3591 and YARN 6641 and rebuilds the missing plug-in class files on the fly. Setting the value to 'false' disables the property.
YARN MR2
das.yarn.base-counter-wait-time	40000	YARN Counters take a while to show up at the Job History Server, so we use this base wait time + a small factor * the number of executed tasks to wait for the counters to finally show up.

List of Smart Execution Properties

Info

title	INFO

These properties are exclusively for the Smart Execution license.

Name	Default Value	Description
tez.am.shuffle-vertex-manager.desired-task-input-size

52428800

50

Sets

Set the input size for each TEZ sub-task

.das.cluster.plugin.resources.fixer.enabledtrueIntercepts the YARN errors YARN-3591 and YARN 6641 and rebuilds the missing plug-in class files on the fly. Setting the value to 'false' disables the property.

YARN MR2

das.yarn.base-counter-wait-time40000YARN Counters take a while to show up at the Job History Server, so we use this base wait time + a small factor * the number of executed tasks to wait for the counters to finally show up

.
das.execution-framework	smart	Force a job to be run with a specific execution framework Smart Execution = Smart Optimized MapReduce = Tez SmallJob Execution = SmallJob Spark Smart Execution = Smart Local Mode = local (default and the only execution framework in local mode)
das.execution-framework.small-job.max-records	1000000	Set a threshold size for number of records to choose execution framework.
das.execution-framework.small-job.max-uncompressed	1000000000	Set a threshold size for uncompressed file size to choose execution framework.

List of Windows Properties

Name	Default Value	Description
windows.das.join.disabled-strategies	MEMORY_BACKED_MAP_SIDE	By default, memory backed joins are enabled on Windows platform, if required to disable, please uncomment the property.
mapreduce.input.linerecordreader.line.maxlength	2097152	Controls the maximum line size (in characters) allowed before filtering it on read. Below value is around a maximum of 4MB per line.

List of Error Handling Properties

Name	Default Value	Description
workbook.error-handling.default	IGNORE	Legal values for workbook error handling are IGNORE, DROP_RECORD, ABORT_JOB. The equivalent error handling modes in the UI are Ignore, Skip and Abort.
export-job.error-handling.default	DROP_RECORD	Legal values for export job error handling are IGNORE, DROP_RECORD, ABORT_JOB. The equivalent error handling modes in the UI are Ignore error, Drop record and Abort job.
import-job.error-handling.default	DROP_RECORD	Legal values for import job error handling are DROP_RECORD, ABORT_JOB. The equivalent error handling modes in the UI are Drop record and Abort job.
data-link.error-handling.default	DROP_RECORD	Legal values for data link error handling are DROP_RECORD, ABORT_JOB. The equivalent error handling modes in the UI are Drop record and Abort job.

...

To add a system-wide custom property:

Click on the the "Admin" tab tab and select the menu tab tab "Hadoop Cluster". The overview page for the Hadoop Cluster configuration opens.
Click on on "Enter". The Hadoop Cluster configuration page opens.
Enter the custom properties in the field field "Custom Properties" and and confirm with with "Enter". The configuration is saved.

...

Double-click the required workbook from the File Browser. The workbook opens.
Select Select "File" and select and select "Configure". The workbook configuration page opens.
Enter the required custom properties in the field field "Custom Properties" and and confirm with with "Save". The workbook job configuration is saved.

Editing Custom Properties for Compressed Data

When importing compressed data, the requirements are different for settings custom properties than other custom properties.

To decompress data you're importing, all compression codec details need to be put within the the conf/das-common.properties files files.

This includes, for LZO compression:

io.compression.codecs
io.compression.codec.lzo.class
mapred.map.output.compression.codec

As a best practice, put all compression details in the properties files, and adjust their tuning parameters, if necessary, in the interface.

...

Versions Compared

Old Version 1

New Version Current

Key

General Information about Custom Properties

List of DAS Properties

List of Hadoop Properties

List of Execution Framework Properties

General

Spark

List of Hive Properties

List of Hive Properties

List of Execution Framework Properties

General

TEZ

YARN MR2

List of Smart Execution Properties

YARN MR2

List of Windows Properties

List of Error Handling Properties

Editing Custom Properties for Compressed Data