Smart Sampling

Datameer offers many features that enable users to work with large data sets. One of those features is Smart Sampling. Smart Sampling reliably provides a view on your data that enables you to build an analytic model with real-time feedback.

For optimized performance and an accurate representation of your data, Datameer provides a default initial sample of 5000 records (the size of which can be customized for data links, import jobs and file uploads) when you start your analysis. This is a representative sample generated via the distributed reservoir sampling technique:

  • Each task generates a ((k / n-tasks) * oversample- factor) sized sample
  • At the end of the job, the samples are merged to generate a k-sized final sample

During design of the analysis, using the initial representative sample, you might end up with an empty sheet (e.g., due to usage of filters) even though the full data set might contain records that would pass the filter. Smart sampling overcomes this problem by resampling the data, taking the designed operations up to that point into account. Run your partially designed workbook and the new smart sample populates the sheet with record, and Flip Side occasionally uses this preview data. 

Unbiased Randomized Samples

Unbiased randomized samples do have limitations:

  • Increasing the size can reduce some of them
  • Outliers could be missed or included affecting the preview calculations
  • Join and filter actions can result in empty sheets when trying to match against an underrepresented subset of the records

In order to deal with the empty sheet problem, Datameer biases the samples.

Filtered and Joined Sheets

For filtered sheets, we generate a biased sample based on which records pass through the filter by assigning a larger weight to records that pass. This requires computing the formulas and filter twice, but CPU time isn’t usually the bottleneck.

For join sheets we generate a biased sample by evenly distributing the join keys, which lowers the chance of sampling a record if Datameer has seen its join key too many times.

Optimized Previews

The optimized previews (smart samples) can only be generated when the workbook is being run, and they are generated for the data source sheets and kept sheets. Optimized preview generation can be disabled by setting the property: das.sampling.lookahead.maxdepth to 0, which can be useful for critical, time constrained workbooks.

Datameer reads 5GB (5368709120) per default of the file and then selects 5000 rows per default as sample data.

The amount of data that Datameer reads for a data link can be adjusted. If a cluster is busy, to improve performance, the amount can be decreased by adding/editing the following custom property :

The value for the property is measured in bytes. E.g., 1073741824 = 1GB

das.splitting.datalink.max-sample-size=

This property can be added at a server level under the Administration tab -> Hadoop Cluster -> Configuration -> Custom Properties or at a job level through the data link wizard under Schedule -> Advanced -> Custom Properties.

Sampling from Partitions

Datameer gives has the ability to divide up your importing/linking/uploading of data through date partitioning. Each section of the partition is read by Datameer and has its own sampling to get the best representation of the incoming data. When importing a job with with multiple partitions, take into account that the sample size set during the configuring/editing of settings is applied to each partition. (E.g., An import job that has three partitions with a sample size value set at 5000 brings in up to 15,000 sample records.) The sample size can be decreased to improve performance, though a larger sample size better represents the data.