Technical Bulletins
The Output Count of Union Sheet Change If Some Sheets Aren't KeptÂ
Details
Unkept union sheets in hierarchical unions and map-side-join sheets output partial results. The problem only occurs in a very specific configuration that meets all of the requirements:Â
- The workbook contains at least three union sheets:
- The results of two union sheets are the sources for the third union sheet.
- The union sheets aren't kept.
- The workbook contains at least one join sheet:
- The join sheet must process as a map-side-join at runtime.
- The join sheet must process data before at least one of the union sheets.
If all the above requirements are met, the problem is as follows: If two union sheets ("UnionA" and "UnionB") feed another union sheet ("UnionResult") downstream in the same workbook and the source union sheets (UnionA and/or UnionB) aren't kept sheets, the final union (UnionResult) results lacks records of three of the four initial input sources.Â
Versions affected
- Datameer 7.2:Â 7.2.3
- Datameer 7.1: 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.1.7 and 7.1.8
- Datameer 6.4: 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7, 6.4.8, 6.4.9, 6.4.10, 6.4.11 and 6.4.12
- Datameer 6.3: All released versions
Users affected
Datameer users that build workbooks with hierarchical unions where the union sheets aren't kept may be affected. Users of downstream workbooks may be indirectly impacted.
Datameer administrators may verify if any workbooks are affected by using the following SQL command:Â
|
If the output of this query is 0 rows, then no workbooks are affected. If the output of this query contains workbook IDs, they may be affected and need attention in affected versions of Datameer.
Severity
High
Resolved versions
It is recommended to install the latest Datameer Maintenance Release. At a minimum, this issue is resolved in the following Datameer versions:
- 6.4.13
- 7.1.9
- 7.2.4
Datameer versions beyond 7.2.4 will also include the fix.Â
Immediate action required
To workaround this issue, there is an immediately available option:Â
In affected workbooks, configure all union sheets to be kept.Â
This workaround might impact execution time of workbooks that contain newly saved sheets. It is recommended to schedule an update to the latest maintenance patch as soon as possible.Â
More information and updates may be available through the following KB article:Â Output count of Union Sheet changes if some sheets are not kept
Duplicate Records Processed in Workbook FunctionsÂ
Details
Functions output duplicate records on some workbook sheets. Specifically, source Parquet files less than the threshold may be read twice. Any downstream calculations include the duplicated data.
The problem only occurs if the job processes at least 1 small file (smaller than the threshold) and at least 1 large file (bigger than the threshold). The default threshold is 256MB but could be altered by system settings.
Versions affected
- Datameer 7.1: 7.1.3, 7.1.4 and 7.1.5
- Datameer 6.4: 6.4.7, 6.4.8 and 6.4.9
- Datameer 6.3:Â 6.3.9 and 6.3.10
Users affected
Datameer users that reference data from data sources that are stored in Parquet and is below a certain size, the small file is read twice. Users of downstream workbooks may be indirectly impacted.
Severity
High
Resolved versions
It is recommended to install the latest Datameer Maintenance Release. At a minimum, this issue is resolved in the following Datameer versions:
- 6.4.10
- 7.1.6
Datameer versions beyond 7.1.6 also include the fix.Â
Immediate action required
To work-around this issue, there is an immediately available option:Â
- Add the following custom propertyÂ
das.splitting.disable-individual-file-splitting=true
 to the Hadoop Cluster's Custom Properties section (Note: this option may cause a significant performance for very large parquet files, which will not be split and can therefore not processed in parallel anymore)
More information and updates may be available through the following KB article:Â Duplicate Records Processed in Workbook Functions