Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

If the cluster is shared across other Datameer jobs and other Yarn applications, it's recommended that each Hadoop node has at least 16 vcores and plenty of memory depending on the datasets (see capacity planning).

When deploying with cloud Hadoop vendors, the following instance types are good starting points (and also account for additional memory requirements).

...

  1. Log in to Datameer through the user interface.
  2. Open the Admin tab.
  3. Select Plug-ins from the side menu.
  4. Find Datameer Runtime Analytics Plug-in and click the gear icon to configure the plug-in.
  5. Under settings, select Yarn as the backend type for a production environment. (The Embedded type can be used for testing and debugging on small datasets since it provisions RTA in same JVM as Datameer). Embedded mode is not recommended for production deployments.
  6. Configure additional settings for your specific environment.   (See Capacity Planning below.)
  7. Click Save.
  8. Select the new menu option Runtime Analytics under the Admin Tab.
  9. Click Start to enable Runtime Analytics in Datameer.

...

Active columns refers to how many columns of a particular dataset the users are expected to use. Given that RTA indexes selected columns on demand, understanding what columns users could be expected to leverage can help plan if more memory should be allocated or if the datsets datasets should be constrained to a smaller number of columns for visual exploration.

...

After understanding how much memory is needed for expected datasets (and adjusting for the unexpected), you need to start thinking about compute and parallelism. In order to get quick response times, you need to consider paralelism parallelism and CPUs. One way of looking at this is to think in terms of response times. Are you looking for response times in 2 seconds or less or would the business be happy with response times of 10 seconds or less? Datameer observed outstanding results in a few seconds (less than 5 seconds) when there is approximately a mapping between 1 vcore allocated per 2.5 million records. Datameer has also observed great performance numbers with response times between 3 and 8 seconds when allocating 1 vcore per 8 million records.

...

  • In this type of case, it is recommended to provision additional Hadoop nodes to support RTA. But first, caculate calculate based on the target datasets and desired response times.
  • CPU:
    • From the previous example, you know that you need 40 vcores to process 100 million records optimally.
    • 40 vcores needed / 12 vcores per server (assuming they maintain the same compute densitiy density with new hardware) = ~ 3.3 = 4 additional nodes.
  • Memory:
    • From the initial sizing, you figured 140GB are needed to fit the target data into memory.
    • 140 GB / 32GB per node = ~ 4.4 = 5 additional nodes.
  • Conclusion: Add at least 5 additional nodes to the cluster to support the targeted datasets.

Memory for RTA master node

RTA launchs launches one container that is assigned as the master node for RTA (See Deployment above for architectual architectural reference). Generally, assign half the amount of RAM configured to the slave containers, but with a minimum of 2GB.

...

CPU capacity has the highest probability of being the limiting factor. When optimizing for the best performance for largest datasets, you are diminishing the chances RTA calculates multiple requests at the same time. RTA can process concurrent requests and the impact on response times vary depending on deployment configuration. When deployment is optimized for the best performance for largest datasets, RTA should be able to handle 10 concurrent queries comfortably. It is important to understand though that concurrency is a function of the amount of resources allocated for RTA in the cluster.

Runtime Analytics REST API

The Runtime Analytics process can be started, stopped, and have the status checked though Datameer's REST API

...

Materialized data as the product of a workbook execution tighly tightly couples the schema and the data. If you change the schema for a processed worksheet, you must run the workbook again before Visual Explorer is available for data exploration.

...

  • Has proper sizing being done? See the Capacity Planning section.
  • Ensure you select MultiParquetDataLoader in the Data loader field in Advanced Configurations. This ensures optimal parallelism when Datameer's job output writes one large part file.
  • Have you used all configured memory? Delays can occur if new datasets start evicting other active indices in the memory cache. See the Capacity Planning section.
  • Are you running on EMR? If so, note that the initial column reads and the Visual Explorer load time take longer than an on premise cluster due to EMR ↔ S3 latency.

...

  • conductor.log & runtime-analytics.log
    • Useful to investigate problems with Runtime Analytics starting, stopping, or crashing.
  • runtime-analytics-query.log
    • Read this log to investigate problems with Runtime Analytics crashing, querys queries failing, or the wrong results being delivered.

...