Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Runtime Analytics Basics

Runtime Analytics (RTA) is a data processing engine highly optimized to provide answers to analytical questions on large volumes of data at the speed of thought. RTA is unique because it leverages patented Datameer processing engine for sub-second aggregation of massive datasets that enable instant response to analytical queries through Datameer's Visual Explorer.

RTA keeps a persistent container running on the cluster that indexes a dataset, leveraging the best of columnar stores for fast data access and for creating on-demand, in-memory indices in real-time (Lazy Indexing). This is combined with the best of distributed search technologies to provide users Combined with fully distributed search technology it provides exceptionally fast aggregation, filterfiltering, and faceting capabilities on large volumes of data (billions of records). This approach eliminates the need of precomputing indices ahead of time which typically increases compute and storage requirements on the supported data platforms unleashing unconstrained data exploration capabilities.If you haven't yet purchased the module and are interested in using RTA for Datameercomprising billions of records. By removing the need for precomputed indices, RTA allows for unconstrained data exploration without straining compute and storage resources.

If you do not have the RTA module, contact support@datameer.com for additional purchase information. 

Visual Explorer

The Runtime Analytics plug-in is needed required to utilize the Visual Explorer feature in Datameer. Visual Explorer is  is an interactive data analysis feature designed to quickly and easily understand your data. With the purchase of Visual Explorer, you can instantly explore up to billions of records with sub-second response times when optimally configured. The data is first displayed visually over different types of charts where you can drill in and add/adjust attributes or aggregations. When the initial data exploration has been completed, the findings are added to a worksheet within your workbook where further analysis can be made. Visual Explorer integrates with Hadoop via YARN working seamlessly on multi-tenant clusters.

...

Queues and preemption

RTA provides Datameer administators administrators the option to launch the RTA Yarn App in a completely separate Yarn queue. This can be helpful for oversubscribed Hadoop clusters because Yarn applications launched in queues with a higher priority can start cannibalizing other Yarn applications provisioned in lower priority queues.

...

RTA requires and operates with a fixed amount or of memory and containers that are configurable. Once RTA starts is started and it successfully allocates the its configured resources it is able to support the analytical queries coming in. RTA has been designed and optimized for predictable response times, it it will fully support the analytical queries sent to it. In Hadoop clusters that are shared with multiple types of applications and operating at high capacity, it can be difficult to get back the needed resources to support the business if resources are returned to the clusterreacquire resources that have been allocated to RTA.

Preferred upgrade path

Starting in Datameer 6.3, Datameer stores job outputs as Parquet files. If upgrading from 5.11 or 6.1, customers must re-run targeted workbooks for for Visual Explorer support support. Customers upgrading from 6.3 or 6.4 should already have data for job outputs in Parquet format.

Deployment architecture

Image Modified

Anchor
configure
configure
Configuring and Enabling Runtime Analytics

...

  1. Log in to Datameer through the user interface.
  2. Open the the Admin tab tab.
  3. Select Select Plug-ins from  from the side menu.
  4. Find Datameer  Datameer Runtime Analytics Plug-in and  and click the the gear icon to  to configure the plug-in.
  5. Under settings, select select Yarn as the backend type for a production environment. (The Embedded type can be used for testing and debugging on small datasets since it provisions RTA in same JVM as Datameer). Embedded mode is not recommended for production deployments.
  6. Configure additional settings for your specific environment.  (See See Capacity Planning below below.)
  7. Click Click Save.
  8. Select the new menu option Runtime Analytics under the Admin Tab.
  9. Click Click Start to  to enable Runtime Analytics in Datameer.

...

  1. Log in to Datameer through the user interface.
  2. Open the Admin tab.
  3. Select Runtime Analytics from  from the side menu.
  4. Click Click Stop.

Anchor
capacity
capacity
Capacity Planning

...

Before starting, it is important to understand that RTA deploys natively on Yarn as a "long-lived" Yarn application (See See Deployment). This means that once the Datameer administrator launches RTA through the Datameer administration menu under the Admin tab (See once the Datameer administrator launches RTA (see Configuring and Enabling Runtime Analytics), the RTA Yarn application occupies the configured amount of containers and memory for the duration the RTA Yarn app is live. In other words, unless there are other issues in the cluster that kill the Yarn app, the RTA Yarn app lives as long as the Datameer administrator doesn't press the stop button in the Runtime Analytics menu under the Admin tabis active unless explicitly stopped.

Understanding data characteristics that impact capacity planning

...

Active columns refers to how many columns of a particular dataset the users are expected to use. Given that RTA indexes selected columns on demand, understanding what columns users could be expected to leverage can help plan if more memory should be allocated or if the datasets datsets should be constrained to a smaller number of columns for visual exploration.

...

Not all data types are created equal. For supported data types, cardinality (uniqueness) of the data plays has a massive role huge impact on the memory resources it uses once indexed into memory. At the highest level for this engine, there are two data types, numerics and strings.

  • Numerics: This includes include dates, floats, integers, and Boolean. 
  • Strings: This is the explicit string column in Datameer. RTA is built design to handle string columns in the context of categorical information and , not to handle large string columns for search purposes. RTA by default truncates text values down to 16 characters, but though it can may be configured to allow up to 64 characters with the using the Maximum characters for prefix field  field in the advanced configuration for the plug-in.
  • The Datameer list data type isn't supported by RTA. If you have data of this data type, the recommendation is to expand the lists in the workbook to a supported data type for analysis.

...

RTA loads into memory the columns users start to interact with (aka Active Columns) for aggregation and filtering operations. Since there are too many variables such as data types, the type of interaction with a column (filtering or aggregation), number of records, cardinality, length lenght of characters for string columns, etc it is difficult to provide a single formula that covers all the different cases for sizing. The best way to plan memory capacity is to start with the largest datasets known that are planned for use with Visual Explorer.  For this initial deployment for sizing you can use this very conservative formula: 1GB of memory per 2.5 million to 8 million records (assuming 10 active columns). Next, use the URL:  http://<RTA Master Node>:9200/_parlenecache/stats?pretty (requires having Elasticsearch HTTP interface enabled in the plug-in configuration) so that a Datameer administrator can test potential columns of interest to get a sense how much memory is required for the largest dataset. Please note that additional user utilization for the same dataset doesn't increase memory requirements.  See See the Troubleshooting section below for additional information on finding the RTA Master Node URL.

Example:

You have a dataset with 100 million records. After testing different dimensions/measures of interest you learn that you need about 7GB of memory for touching 12 columns of this dataset. This is summarized by the top level "SumBytes" value with the URL above. This is the sample output from http://<RTA Master Node>:9200/_parlenecache/stats?pretty

...

Assume that after doing some other testing with other datasets, you conclude that you need 100GB of memory to fit all the datasets of interest in memory before the cache starts to cycle the in memory indices (LRU cache strategy is used). Datameer recommends to conclude the amount of memory needed to be the 100GB + 30% to 50% more to handle the unexpected and to have wiggle room to satisfy the the Fields Cache Heap Fraction field  field requirement (this is found in the the Advanced Settings for  for the plug-in configuration). RTA requires additional memory to operate outside of the memory it needs to hold on to the data. This means that once RTA uses the configured percentage amount of memory specified, it starts cycling the indexes in the cache.

...

With the example of 100GB of memory requirement for the dataset and a configuration value for for Fields Cache Heap Fraction at  at 80%, that means that you would need about 125GB of memory configured for the Yarn app to support the 100GB memory requirements for the dataset (100/0.8). In addition, Datameer recommends configuring a few more GB to handle other datasets users might want to use with Visual Explorer, in this case 140GB would be a safer decision.

...

After understanding how much memory is needed for expected datasets (and adjusting for the unexpected), you need to start thinking about compute and parallelism. In order to get quick response times, you need to consider parallelism paralelism and CPUs. One way of looking at this is to think in terms of response times. Are you looking for response times in 2 seconds or less or would the business be happy with response times of 10 seconds or less? Datameer observed outstanding results in a few seconds (less than 5 seconds) when there is approximately a mapping between 1 vcore allocated per 2.5 million records. Datameer has also observed great performance numbers with response times between 3 and 8 seconds when allocating 1 vcore per 8 million records.

Before analyzing these details, it's worth understanding how Yarn applications behave and work with CPU resources. In general, Yarn is good at locking down memory resources while giving the freedom of using as many CPU resources as available for a given Yarn application. This is good news for RTA because due to traditional Hadoop workloads that tend to be intensive on I/O, this leaves CPU resources available. Now in Hadoop clusters where CPU resources are a bit constrained, you can also configure RTA to only use up to a fixed number of vcores in the server by changing the value of the the Virtual CPUs per container field and  and checking the checkbox for for "Elasticsearch processors" on the plug-in configuration screen.

...

  • In this type of case, it is recommended to provision additional Hadoop nodes to support RTA. But first, calculate caculate based on the target datasets and desired response times.
  • CPU:
    • From the previous example, you know that you need 40 vcores to process 100 million records optimally.
    • 40 vcores needed / 12 vcores per server (assuming they maintain the same compute density densitiy with new hardware) = ~ 3.3 = 4 additional nodes.
  • Memory:
    • From the initial sizing, you figured 140GB are needed to fit the target data into memory.
    • 140 GB / 32GB per node = ~ 4.4 = 5 additional nodes.
  • Conclusion: Add  Add at least 5 additional nodes to the cluster to support the targeted datasets.

Memory for RTA master node

RTA launches launchs one container that is assigned as the master node for RTA (See See Deployment above  above for architectural architectual reference). Generally, assign half the amount of RAM configured to the slave containers, but with a minimum of 2GB.

...

CPU capacity has the highest probability of being the limiting factor. When optimizing for the best performance for largest datasets, you are diminishing the chances RTA calculates multiple requests at the same time. RTA can process concurrent requests and the impact on response times vary depending on deployment configuration. When deployment is optimized for the best performance for largest datasets, RTA should be able to handle 10 concurrent queries comfortably. It is important to understand though that concurrency is a function of the amount of resources allocated for RTA in the cluster.

Runtime Analytics REST API

The Runtime Analytics process can be started, stopped, and have the status checked though Datameer's REST API

...

  • RTA deploys and it's compatible in Secured Hadoop clusters with Kerberos and Impersonation enabled.
  • In Secured Hadoop clusters, the RTA Yarn application is launched as the Datameer service user.
    • It is important to understand that the service user that launches the RTA Yarn application doesn't attempt to access data.
    • Every query that comes from Visual Explorer is answered using the user in context through the authentication and Kerberos ticket information in RTA.
  • In Secured MapR clusters, the maximum uptime of RTA is limited to the configured configured yarn.mapr.ticket.expiration property in  property in yarn-site.xml (default observed to be 7 days).

...

  • String columns: The length of a string is limited to the configured configured Maximum character for prefix field  field value. By default this is 16 characters and can be configured to 64. If the datasets have categorical data that exceeds typical datasets with 16 characters, change the value and restart RTA for the change to apply. See See Capacity Planning above above.
  • List columns: Lists are not supported in RTA and list columns don't show as a column option in Visual Explorer. To work around this limitation, use the the EXPAND function  function in the workbook to flatten out the data in the list. Note that this increases your data size. Consider additional data preparation steps if needed.

...

Materialized data as the product of a workbook execution tightly tighly couples the schema and the data. If you change the schema for a processed worksheet, you must run the workbook again before Visual Explorer is available for data exploration.

...

  • Has proper sizing being done? See the Capacity Planning section section.
  • Ensure you select MultiParquetDataLoader in the  in the Data loader field in  field in Advanced Configurations. This ensures optimal parallelism when Datameer's job output writes one large part file.
  • Have you used all configured memory? Delays can occur if new datasets start evicting other active indices in the memory cache. See the Capacity Planning section.
  • Are you running on EMR? If so, note that the initial column reads and the Visual Explorer load time take longer than an on premise cluster due to EMR ↔ S3 latency.

RTA takes a long time to start

...

  • This is a common occurrence when the configured amount of resources (number of containers and memory) aren't available at that point in time. The recommendation is to start RTA when it is known that the configured resources are available in the cluster.

Relevant log files

  • conductor.log & runtime runtime-analytics.log 
    • Useful to investigate problems with Runtime Analytics starting, stopping, or crashing.
  • runtime-analytics-query.log
    • Read this log to investigate problems with Runtime Analytics crashing, queries querys failing, or the wrong results being delivered.

...

  • Find the Yarn app with the application type type DRA. This is the long-lived Yarn app for RTA. Click on the Application ID.
  • Different attempts to launch the Yarn app are displayed on this page. Find the active attempt that succeeded and copy of the Node URL.
  • Select the option below that applies to you:
    • If you are in an intranet and you have access to that host, copy the URL to your clipboard and paste it in a new tab.
    • If you don't have access to the internal network of the Hadoop cluster and can't tunnel into it (e.g., EMR Cluster with a locked down security group config),
      • Copy the URL. 
      • SSH into the Datameer server that has access to all of the Hadoop nodes.
      • Paste the URL in a wget statement. For example: wget http://<url>

    • In both options:
      • Dismiss the current port and use 9200.
      • URL: The desired debugging URL. In this case use http://<URL>:9200/_cat/nodes?pretty
      • The output would similar to the example below. The IP with the "master" label would be the one of interest.

        Code Block
        172.30.3.31 172.30.3.31  9 59 0.33 d - node-0132_01_000013_marketing-eureka-demo-hdp-s03-prod
        172.30.3.36 172.30.3.36  9 54 0.00 d - node-0132_01_000014_marketing-eureka-demo-hdp-s13-prod
        172.30.3.21 172.30.3.21  9 56 0.01 d - node-0132_01_000011_marketing-eureka-demo-hdp-s12-prod
        172.30.3.25 172.30.3.25  9 56 0.11 d - node-0132_01_000016_marketing-eureka-demo-hdp-s02-prod
        172.30.3.9  172.30.3.9   9 57 0.01 d - node-0132_01_000012_marketing-eureka-demo-hdp-s05-prod
        172.30.3.6  172.30.3.6   2 99 0.02 - * master
        172.30.3.13 172.30.3.13  9 57 0.06 d - node-0132_01_000006_marketing-eureka-demo-hdp-m01-prod
        172.30.3.57 172.30.3.57  9 56 0.02 d - node-0132_01_000005_marketing-eureka-demo-hdp-s08-prod
        172.30.3.62 172.30.3.62  9 53 0.00 d - node-0132_01_000015_marketing-eureka-demo-hdp-s15-prod
        172.30.3.15 172.30.3.15  9 55 0.05 d - node-0132_01_000008_marketing-eureka-demo-hdp-s07-prod
        172.30.3.32 172.30.3.32  9 56 0.16 d - node-0132_01_000004_marketing-eureka-demo-hdp-s04-prod
        172.30.3.49 172.30.3.49  9 55 0.02 d - node-0132_01_000010_marketing-eureka-demo-hdp-s14-prod
        172.30.3.60 172.30.3.60  9 56 0.03 d - node-0132_01_000009_marketing-eureka-demo-hdp-s09-prod
        172.30.3.4  172.30.3.4  10 54 0.00 d - node-0132_01_000002_marketing-eureka-demo-hdp-s06-prod
        172.30.3.14 172.30.3.14  9 56 0.01 d - node-0132_01_000007_marketing-eureka-demo-hdp-s01-prod
        172.30.3.17 172.30.3.17  9 41 0.03 d - node-0132_01_000003_marketing-eureka-demo-hdp-s11-prod
    • (Optional)

      • If the URL used doesn't work, use the the _cat/nodes?pretty URL  URL from any other node that has RTA deployed. Do this by clicking on the Application ID link that gives you details of what containers are deployed and their respective nodes.
      • On the following displayed page, scroll down and find the node URLs for that Yarn app.

...