GROUPCOUNTDISTINCT
Syntax
GROUPCOUNTDISTINCT(<any>)
Description
Counts the distinct values of a group.
This is an aggregate function.
Example
Given the following data:
Customer_ | Name | Item |
---|---|---|
2535 | Jon | Zamioculcas |
2888 | Mike | Spider Lily |
2535 | Jon | Datura |
2535 | Jon | Dahlia |
5788 | Jon | Snapdragon |
2888 | Mike | Marguerite |
2535 | Jon | Windflower |
2888 | Mike | Elephant's Ear |
2535 | Jon | Chinese Evergreen |
5788 | Jon | Begonia |
5788 | Jon | Starfish Plant |
2535 | Jon | Hare's Foot Fern |
2535 | Jon | Venus Flytrap |
3545 | Monique | Red Flame Ivy |
First create a group using GROUPBY(#RawData!Name)
Name |
---|
Jon |
Mike |
Monique |
Then use the GROUPCOUNTDISTINCT(#RawData!Customer_), and the results are a count of each time an individual value appears in that column in relation to the GROUPBY() column.
Name | Customer_COUNTDISTINCT |
---|---|
Jon | 2 |
Mike | 1 |
Monique | 1 |
Known issue:
When working with the GROUPCOUNTDISTINCT function on a dataset with a very high number of records in a group, out of memory exceptions might occur.
Cause:
In order to determine the true distinct value, GROUPCOUNTDISTINCT shifts all mapped information to disk on one reducer. This can cause disk space and performance/memory issues on large datasets. This is a known limitation that Datameer is working to resolve.
Solution:
In order to work around this issue, an intermediary sheet along with a combination of GROUPBY and GROUPCOUNT functions is suggested as described below.
Original:
Sheet 1: user_id, product
Sheet 2: GROUPBY(Sheet1.product), GROUPCOUNTDISTINCT(Sheet2.user_id)
Workaround:
Sheet 1: user_id, produt
Sheet 2: groupby(Sheet1.product), groupby(Sheet1.user_id)
Sheet 3: group by(Sheet2.product), groupcount
This work around gives the same desired output of the GROUPCOUNTDISTINCT function.