Technical Bulletin

Duplicate Records Processed in Workbook Functions

Details

Functions output duplicate records on some workbook sheets. Specifically, source Parquet files less than the threshold may be read twice. Any downstream calculations include the duplicated data.

The problem only occurs if the job processes at least 1 small file (smaller than the threshold) and at least 1 large file (bigger than the threshold). The default threshold is 256MB but could be altered by system settings.

Versions Affected:

Datameer 7.1: 7.1.3, 7.1.4 and 7.1.5
Datameer 6.4: 6.4.7, 6.4.8 and 6.4.9
Datameer 6.3: 6.3.9 and 6.3.10

Users Affected:

Datameer users that reference data from data sources that are stored in Parquet and is below a certain size, the small file is read twice. Users of downstream workbooks may be indirectly impacted.

Severity:

High

Resolved Versions:

It is recommended to install the latest Datameer Maintenance Release. At a minimum, this issue is resolved in the following Datameer versions:

6.4.10
7.1.6

Datameer versions beyond 7.1.6 also include the fix.

Immediate Action Required:

To work-around this issue, there is an immediately available option:

Add the following custom property das.splitting.disable-individual-file-splitting=true to the Hadoop Cluster's Custom Properties section (Note: this option may cause a significant performance for very large parquet files, which will not be split and can therefore not processed in parallel anymore)

More information and updates may be available through the following KB article: Duplicate Records Processed in Workbook Functions