Set Up Runtime Analytics

Set Up Runtime Analytics

Runtime Analytics Basics

Runtime Analytics (RTA) is a patented Datameer processing engine for sub-second aggregation of massive datasets that enable instant response to analytical queries through Datameer's Visual Explorer.

RTA keeps a persistent container running on the cluster that indexes a dataset, leveraging the best of columnar stores for fast data access and on-demand, in-memory indices in real-time (Lazy Indexing). Combined with fully distributed search technology it provides exceptionally fast aggregation, filtering, and faceting capabilities on volumes of data comprising billions of records. By removing the need for precomputed indices, RTA allows for unconstrained data exploration without straining compute and storage resources.

If you do not have the RTA module, contact support@datameer.com for purchase information. 

Visual Explorer

The Runtime Analytics plug-in is required to utilize the Visual Explorer feature in Datameer. Visual Explorer is an interactive data analysis feature designed to quickly and easily understand your data. With the purchase of Visual Explorer, you can instantly explore up to billions of records with sub-second response times when optimally configured. The data is first displayed visually over different types of charts where you can drill in and add/adjust attributes or aggregations. When the initial data exploration has been completed, the findings are added to a worksheet within your workbook where further analysis can be made. Visual Explorer integrates with Hadoop via YARN working seamlessly on multi-tenant clusters.

Deployment

System Requirements

A Hadoop cluster for a currently supported Hadoop vendor.

If the cluster is shared across other Datameer jobs and other Yarn applications, it's recommended that each Hadoop node has at least 16 vcores and plenty of memory depending on the datasets (see capacity planning).

When deploying with cloud Hadoop vendors, the following instance types are good starting points (and also account for additional memory requirements).

  • AWS: EMR: c4.4xlarge (or newer equivalents)

  • Azure: HDI: d5v2 (or newer equivalents)

Queues and preemption

RTA provides Datameer administrators the option to launch the RTA Yarn App in a completely separate Yarn queue. This can be helpful for oversubscribed Hadoop clusters because Yarn applications launched in queues with a higher priority can start cannibalizing other Yarn applications provisioned in lower priority queues.

If response time and stability are important for a particular RTA deployment, it is highly recommended that RTA is deployed in a Yarn queue that is exempt from scheduler preemption rules in Hadoop.

Resource management

RTA requires and operates with a fixed amount of memory and containers that are configurable. Once RTA is started and successfully allocates its configured resources, it it will fully support the analytical queries sent to it. In Hadoop clusters that are shared with multiple types of applications and operating at high capacity, it can be difficult to reacquire resources that have been allocated to RTA.

Preferred upgrade path

Starting in Datameer 6.3, Datameer stores job outputs as Parquet files. If upgrading from 5.11 or 6.1, customers must re-run targeted workbooks for Visual Explorer support. Customers upgrading from 6.3 or 6.4 should already have data for job outputs in Parquet format.

Deployment architecture

Configuring and Enabling Runtime Analytics

Steps to install and start Runtime Analytics

  1. Log in to Datameer through the user interface.

  2. Open the Admin tab.

  3. Select Plug-ins from the side menu.

  4. Find Datameer Runtime Analytics Plug-in and click the gear icon to configure the plug-in.

  5. Under settings, select Yarn as the backend type for a production environment. (The Embedded type can be used for testing and debugging on small datasets since it provisions RTA in same JVM as Datameer). Embedded mode is not recommended for production deployments.

  6. Configure additional settings for your specific environment.  (See Capacity Planning below.)

  7. Click Save.

  8. Select the new menu option Runtime Analytics under the Admin Tab.

  9. Click Start to enable Runtime Analytics in Datameer.

Steps to stop Runtime Analytics

  1. Log in to Datameer through the user interface.

  2. Open the Admin tab.

  3. Select Runtime Analytics from the side menu.

  4. Click Stop.

Capacity Planning

Before deploying the Runtime Analytics Engine (RTA), plan resource allocation for the target use cases. This starts with your desired response time for targeted datasets. While there isn't a one-size-fits-all rule for using Runtime Analytics, these guidelines and examples provide guidance on a good starting point.

Before starting, it is important to understand that RTA deploys natively on Yarn as a "long-lived" Yarn application (See Deployment). This means that once the Datameer administrator launches RTA (see Configuring and Enabling Runtime Analytics), the RTA Yarn application occupies the configured amount of containers and memory for the duration the RTA Yarn app is live. In other words, unless there are other issues in the cluster that kill the Yarn app, the RTA Yarn app is active unless explicitly stopped.

Understanding data characteristics that impact capacity planning

When thinking about resource requirements, the number of records is an important dimension. There are other equally important dimensions such as the number of "active" columns, data types, and cardinality of such columns. 

Number of "active" columns

Active columns refers to how many columns of a particular dataset the users are expected to use. Given that RTA indexes selected columns on demand, understanding what columns users could be expected to leverage can help plan if more memory should be allocated or if the datsets should be constrained to a smaller number of columns for visual exploration.

Data types & cardinality

Not all data types are created equal. For supported data types, cardinality (uniqueness) of data has a huge impact on the memory resources it uses once indexed into memory. At the highest level for this engine, there are two data types, numerics and strings.

  • Numerics: include dates, floats, integers, and Boolean. 

  • Strings: the explicit string column in Datameer. RTA is design to handle string columns in the context of categorical information, not to handle large string columns for search purposes. RTA by default truncates text values to 16 characters, though it may be configured to allow up to 64 characters using the Maximum characters for prefix field in the advanced configuration for the plug-in.

  • The Datameer list data type isn't supported by RTA. If you have data of this data type, the recommendation is to expand the lists in the workbook to a supported data type for analysis.

Memory planning

RTA loads into memory the columns users start to interact with (aka Active Columns) for aggregation and filtering operations. Since there are too many variables such as data types, the type of interaction with a column (filtering or aggregation), number of records, cardinality, lenght of characters for string columns, etc it is difficult to provide a single formula that covers all the different cases for sizing. The best way to plan memory capacity is to start with the largest datasets known that are planned for use with Visual Explorer.  For this initial deployment for sizing you can use this very conservative formula: 1GB of memory per 2.5 million to 8 million records (assuming 10 active columns). Next, use the URL:  http://<RTA Master Node>:9200/_parlenecache/stats?pretty (requires having Elasticsearch HTTP interface enabled in the plug-in configuration) so that a Datameer administrator can test potential columns of interest to get a sense how much memory is required for the largest dataset. Please note that additional user utilization for the same dataset doesn't increase memory requirements. See the Troubleshooting section below for additional information on finding the RTA Master Node URL.

Example:

You have a dataset with 100 million records. After testing different dimensions/measures of interest you learn that you need about 7GB of memory for touching 12 columns of this dataset. This is summarized by the top level "SumBytes" value with the URL above. This is the sample output from http://<RTA Master Node>:9200/_parlenecache/stats?pretty

{ "SumBytes" : "6.9 GB", "EntryTypes" : [ { "name" : "TERMS", "SumBytes" : "1 GB", "Indices" : [ { "index" : "235_ca2a1a3c-2f0d-4f01-bdf1-c07557b8cccc_1216_np", "SumBytes" : "1 GB", "Fields" : [ { "field" : "nppes_provider_zip", "size" : "1 GB" } ] } ] }, { "name" : "SORTED_DOC_VALUES", "SumBytes" : "2.4 GB", "Indices" : [ { "index" : "235_ca2a1a3c-2f0d-4f01-bdf1-c07557b8cccc_1216_np", "SumBytes" : "2.4 GB", "Fields" : [ { "field" : "hcpcs_description", "size" : "227.8 MB" }, { "field" : "nppes_entity_code", "size" : "95.4 MB" }, { "field" : "nppes_provider_city", "size" : "288.8 MB" }, { "field" : "nppes_provider_state", "size" : "147.2 MB" }, { "field" : "nppes_provider_zip", "size" : "1.5 GB" }, { "field" : "provider_type", "size" : "96.7 MB" } ] } ] }, { "name" : "NUMERIC_DOC_VALUES", "SumBytes" : "3.5 GB", "Indices" : [ { "index" : "235_ca2a1a3c-2f0d-4f01-bdf1-c07557b8cccc_1216_np", "SumBytes" : "3.5 GB", "Fields" : [ { "field" : "date_of_service", "size" : "426.3 MB" }, { "field" : "medicare_allowed_amt", "size" : "770.2 MB" }, { "field" : "medicare_payment_amt", "size" : "770.2 MB" }, { "field" : "nppes_provider_latitude", "size" : "822.9 MB" }, { "field" : "nppes_provider_longitude", "size" : "822.9 MB" }, { "field" : "partition_column", "size" : "7.3 MB" } ] } ] } ] }

Assume that after doing some other testing with other datasets, you conclude that you need 100GB of memory to fit all the datasets of interest in memory before the cache starts to cycle the in memory indices (LRU cache strategy is used). Datameer recommends to conclude the amount of memory needed to be the 100GB + 30% to 50% more to handle the unexpected and to have wiggle room to satisfy the Fields Cache Heap Fraction field requirement (this is found in the Advanced Settings for the plug-in configuration). RTA requires additional memory to operate outside of the memory it needs to hold on to the data. This means that once RTA uses the configured percentage amount of memory specified, it starts cycling the indexes in the cache.

With the example of 100GB of memory requirement for the dataset and a configuration value for Fields Cache Heap Fraction at 80%, that means that you would need about 125GB of memory configured for the Yarn app to support the 100GB memory requirements for the dataset (100/0.8). In addition, Datameer recommends configuring a few more GB to handle other datasets users might want to use with Visual Explorer, in this case 140GB would be a safer decision.

CPU planning

After understanding how much memory is needed for expected datasets (and adjusting for the unexpected), you need to start thinking about compute and parallelism. In order to get quick response times, you need to consider paralelism and CPUs. One way of looking at this is to think in terms of response times. Are you looking for response times in 2 seconds or less or would the business be happy with response times of 10 seconds or less? Datameer observed outstanding results in a few seconds (less than 5 seconds) when there is approximately a mapping between 1 vcore allocated per 2.5 million records. Datameer has also observed great performance numbers with response times between 3 and 8 seconds when allocating 1 vcore per 8 million records.

Before analyzing these details, it's worth understanding how Yarn applications behave and work with CPU resources. In general, Yarn is good at locking down memory resources while giving the freedom of using as many CPU resources as available for a given Yarn application. This is good news for RTA because due to traditional Hadoop workloads that tend to be intensive on I/O, this leaves CPU resources available. Now in Hadoop clusters where CPU resources are a bit constrained, you can also configure RTA to only use up to a fixed number of vcores in the server by changing the value of the Virtual CPUs per container field and checking the checkbox for "Elasticsearch processors" on the plug-in configuration screen.

Deciding on the number of containers

With a clear understanding of memory and CPU requirements for the desired datasets, the next step is to split this by the number of optimal containers for RTA deployment.

Generally, it is good practice to map 1 Yarn container per worker/slave node. Adding more containers than servers doesn't help and in some cases may make the response times slower as the workload increases for the same vcore count and the additional overhead.

Example 1

Using the example dataset in this document, you have 20 nodes with 16 vcores each and 64GB of memory assigned to Yarn. Hadoop administrators inform you that on average the cluster capacity is at 50% in terms of available containers with peaks around 75% with busy workloads.

  • CPU: For optimal response times, you use the recommendation of 2.5 million records per vcore:

    • Get an idea of how many vcores you have to work with: 20 nodes * 16 vcores per node = 320 vcores total.

    • The largest dataset has 100 million records so you divide that by 2.5 million records per vcore and conclude that 40 vcores is sufficient to process 100 million records optimally.

    • This equates to roughly 3 nodes rounding up (40/16=2.5) as a good estimate to start with.

    • Since the cluster is used 50% on average, you can probably increase this to 5 nodes (almost twice as much).

  • Memory: You previously concluded 140GB of memory is a good fit for the targeted dataset.

    • With this assumptions you then assume that only half of the RAM will be available at any given time on average for RTA on a per node basis (32GB).

    • Using 32GB as the baseline, divide that by 140GB and conclude that 5 nodes with 32GB of memory will be sufficient.

  • Conclusion: Due to memory requirements, fitting all the targeted data requires 5 nodes.