Multiple Correlation Coefficient
This app calculates the coefficient of multiple correlation of every possible combination of features of a data set. The app then ranks each combination and finds the feature sets with the best descriptive/predictive power.
Multiple Correlation Coefficient Example
In many analyses, you try to describe or predict the values in a specific column using a collection of other columns in the data sheet. However, data sheets can contain hundreds of columns, making a brute-force approach computationally inconvenient.
Datameer can make many analyses on high-dimensional data sheets simpler if you can find a smaller subset of columns that correlate highly with the single column you are trying to predic.
This concept is highlighted by the example in the image below, where several different cars have been driven over varying distances and data was recorded about the results of the drive and the technical specifications of the cars themselves.
Data recorded includes:
- Distance travelled
- Price of the car
- Horsepower
- Weight
- Acceleration
- Length of the car frame
- Width of the car frame
- Wheelbase (the distance between front and back wheels)
- Rate of fuel consumption (mpg.)
The column of interest is mpg, so you use the data to find out which factors most affect the rate of fuel consumption and which combination of columns is the most accurate and efficient predictor of mpg.
With mpg as the target column, adopt a naming convention to help distinguish between two different kinds of columns. Each column constitutes one variable and there are two kinds of variable: dependent, which is the thing that you are trying to predict and independent, which is a column that is used in the prediction of the dependent variable.
Once your data sheet has been added in to the workbook and connected with the relevant sheets, you can run the workbook and view the results page.
Example results
The column in red is the MCC (sorted in descending order), which is the rating of how well correlated each group of independent variables is with the dependent variable. In this case, the group (highlighted in blue) with acceleration, distance, horsepower and weight is the most highly correlated and thus the best predictor.
While the second group contains an extra variable (carWidth) it doesn't provide enough predictive power to justify its addition.
Results sheet
The results sheet, adjustedRSquared, is the final sheet of the two workbooks that comprise the app. The workbook with the results has five columns:
- CombinationID: Shows each combination of features is designated with an ID number
- Combination: Lists which features are being considered. Each row has a different combination.
- RSquared: Shows the first value that is calculated is the R-squared. This measures the set of features' ability to explain the variance of the output.
- adjustedRSquared: Lists the adjusted R-squared, which penalises each feature unless it provides enough descriptive/predictive power. In comparison, R-squared always increases, regardless of the efficacy (or lack thereof) of the feature.
- MCC: Lists the MCC, which is the square root of the adjustedRSquared column. This Multiple Correlation Coefficient rates how well the features can be combined in to a linear equation to describe the output.
Connection and Import
The upload is used in the Input sheet in the _01_CorrelationMatrixInput workbook, but can be swapped with real life data by adding a new data sheet, using it to create a column dependencies sheet, and then adjusting the following sheets to reference the new column dependencies.
Workbook
_01_CorrelationMatrixInput
In this workbook, the individual correlation coefficients are calculated and prepared for the linear algebra functions. Pearson's correlation coefficient is calculated for every possible pair of features using column dependencies. Then, every possible pair of features is generated in the PairFinder sheet and the relevant Pearson coefficients from the Interdependency sheet are joined with each combination.
Next, the Pearson coefficient between the output variable and each feature stored in the YCoefficients sheet is joined with the relevant combination from the Joined3 sheet. This join sheet is duplicated and then used as input for the next workbook.
_02_Correlations
This workbook contains the linear algebra that actually calculates the MCC. It takes the MultipleCorrelationInput sheet from the previous workbook as input. The original Input sheet is also added as a data sheet in order to obtain the count of data points. The GROUP_R_SQUARED function is applied to the first input sheet and a GROUPCOUNT is applied to the second input sheet. The results of these functions are joined and the adjusted R-Squared (and consequently the MCC) is calculated in the final adjustedRSquared sheet.