Datameer offers many features that enable users to work with large data sets. One of those features is Smart Sampling. Smart Sampling reliably provides a view on your data that enables you to build an analytic model with real-time feedback.
For optimized performance and an accurate representation of your data, Datameer provides a default initial sample of 5000 records (the size of which can be customized for data links, import jobs and file uploads) when you start your analysis. This is a representative sample generated via the distributed reservoir sampling technique:
- Each task generates a ((k / n-tasks) * oversample- factor) sized sample
- At the end of the job, the samples are merged to generate a k-sized final sample
During design of the analysis, using the initial representative sample, you might end up with an empty sheet (e.g., due to usage of filters) even though the full data set might contain records that would pass the filter. Smart sampling overcomes this problem by resampling the data, taking the designed operations up to that point into account. Run your partially designed workbook and the new smart sample populates the sheet with record, and Flip Side occasionally uses this preview data.
Unbiased Randomized Samples
Unbiased randomized samples do have limitations:
- Increasing the size can reduce some of them
- Outliers could be missed or included affecting the preview calculations
- Join and filter actions can result in empty sheets when trying to match against an underrepresented subset of the records
In order to deal with the empty sheet problem, Datameer biases the samples.
Filtered and Joined Sheets
For filtered sheets, we generate a biased sample based on which records pass through the filter by assigning a larger weight to records that pass. This requires computing the formulas and filter twice, but CPU time isn’t usually the bottleneck.
For join sheets we generate a biased sample by evenly distributing the join keys, which lowers the chance of sampling a record if Datameer has seen its join key too many times.
Optimized Previews
The optimized previews (smart samples) can only be generated when the workbook is being run, and they are generated for the data source sheets and kept sheets. Optimized preview generation can be disabled by setting the property: das.sampling.lookahead.maxdepth
to 0, which can be useful for critical, time constrained workbooks.
Data Links
The initial sample for data links aren’t representative. Currently, the sample is just the first k records of the data source. Once the data link is used in a workbook, an optimized preview is generated for it.
Sampling from Partitions
Datameer gives has the ability to divide up your importing/linking/uploading of data through date partitioning. Each section of the partition is read by Datameer and has its own sampling to get the best representation of the incoming data. When importing a job with with multiple partitions, take into account that the sample size set during the configuring/editing of settings is applied to each partition. (E.g., An import job that has three partitions with a sample size value set at 5000 brings in up to 15,000 sample records.) The sample size can be decreased to improve performance, though a larger sample size better represents the data.