Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Datameer X offers many features that enable users to work with large data sets. One of those features is Smart Sampling. Smart Sampling reliably provides a view on your data that enables you to build an analytic model with real-time feedback.

...

For optimized performance and an accurate representation of your data, Datameer X provides a default initial sample of 5000 records (the size of which can be customized for data links, import jobs and file uploads) when you start your analysis. This is a representative sample generated via the distributed reservoir sampling technique:

...

During design of the analysis, using the initial representative sample, you might end up with an empty sheet (e.g., due to usage of filters) even though the full data set might contain records that would pass the filter. Smart sampling overcomes this problem by resampling the data, taking the designed operations up to that point into account. Run your partially designed workbook and the new smart sample populates the sheet with record, and Flip Side occasionally uses this preview data. 

...

In order to deal with the empty sheet problem, Datameer X biases the samples.

Filtered and Joined Sheets

...

For join sheets we generate a biased sample by evenly distributing the join keys, which lowers the chance of sampling a record if Datameer X has seen its join key too many times.

...

The optimized previews (smart samples) can only be generated when the workbook is being run, and they are generated for the data source sheets and kept sheets. Optimized preview generation can be disabled by setting the property: das.sampling.lookahead.maxdepth to 0, which can be useful for critical, time constrained workbooks.

Datameer X reads 5GB (5368709120) per default of the file and then selects 5000 rows per default as sample data.

The amount of data that Datameer X reads for a data link can be adjusted. If a cluster is busy, to improve performance, the amount can be decreased by adding/editing the following custom property :

...

Anchor
sample_partition
sample_partition
Sampling from Partitions

Datameer X gives has the ability to divide up your importing/linking/uploading of data through date data partitioning. Each section of the partition is read by Datameer X and has its own sampling to get the best representation of the incoming data. When importing a job with with multiple partitions, take into account that the sample size set during the configuring/editing of settings is applied to each partition. (E.g., An import job that has three partitions with a sample size value set at 5000 brings in up to 15,000 sample records.) The sample size can be decreased to improve performance, though a larger sample size better represents the data.