Fraud Walkthrough
Statistical Analysis, Outlier Study, Seasonality
This answers questions such as:
- What are outlying transactions by spend category?
- Are there trends in fraudulent transitions across industries?
- Are there trends in fraudulent transitions during certain time frames?
- Who are the most common vendors targeted for fraud?
- What industries are most susceptible to fraud and when does it occur?
This is transferable to:
- Analytics designed to identify outliers
- Studies looking for seasonality trends
- Statistical analysis using standard deviation and function mathematics
Building the Use Case
About the data
Name/description | Type | Size | Location |
---|---|---|---|
CreditCardData | CSV | ~100,000 Records | |
CatCodes | CSV | ~900 Records |
About the use case
This use case builds three workbooks in order of complexity starting with the simplest first. The entire analytic use case includes:
- Outliers by Category
- Outliers by Spend
- Affected Vendors
Workbook 1 - Outliers by Category
About outliers by category
This workbook shows you if there are any category or industry trends when looking at time frames of three days. You can answer questions such as “What are the top five categories generating flagged transactions when looking at three-day time periods?”
Add the first data set
CreditCardData is a file upload using a CSV connector.
Schema notes:
- Delimiter: ,
- Column Names: are contained in the first row
- ADVANCED- Quote Character: “
Define a season
By creating small, three-day buckets, you can look for seasonality trends over small time frames. You can do this using the GROUPBYBIN function and setting the perimeters to 3d. This sheet references CreditCardData.
Create a new sheet and name it ThreeDayStats.
Ref | Column name | Formula |
---|---|---|
#A | CatCode | GROUPBY(#CreditCardData!VendorCategoryCode) |
#B | BinStart | GROUPBYBIN(#CreditCardData!Date;3d) |
#C | BinEnd | GROUPMAX(#CreditCardData!Date) |
#D | FlaggedTransacitnos | GROUPSUM(IF(DENULLIFY(#CreditCardData!isFraud);1;0)) |
This function converts a Boolean value from the original data set to an integer writing true values as a “1” and false as a “0”. It also accommodates null values by assuming it is false and writing “0” using the DENULLIFY function.
Add the second data set
CatCodes is a CSV file upload with no special schema notes, quote character is ( “ ). Column names are contained in the first row.
Join CatCodes to your ThreeDayStats
This join gives you the Category translation for the Category Code mentioned in your stats sheet. For example, CatCode 1536 could be Airlines. Use the join keys CatCode and CategoryCode, name the joined sheet Joined.
Determine the top 5 categories
Using your now enriched data, determine the top five categories generating flagged transactions by day.
Create a new sheet and name it top5FlaggedCategoriesByDay.
Ref | Column name | Formula |
---|---|---|
#A | Date | GROUPBY(#Joined!ThreeDayStats.BinStart) |
#B | Count | GROUPTOPN(#Joined!ThreeDayStats.FlaggedTransactions;5)* |
#C | Category | COPY(#Joined!CatCodes.Category) |
*GROUPTOPN allows you to expand the topN values in any given group, avoiding unnecessary sorts and filters. In this case, group on the date and reveal only the top five industries makes for a good chart.
Workbook 2 - Outliers by Spend
About outliers by spend
This workbook includes some more advanced analytics using statistical analysis and multi-column joins to build enriched datasets. Using those enriched records, you can find statistical outliers and build a defined seasonality model based on those extremes.
Add both datasets
CreditCardData was uploaded in the previous exercise and should be reused.
CatCodes was uploaded in the previous exercise and should be reused.
Set a baseline
In order to determine the outlying spends, a baseline must be set. You identify the average and the standard deviation in order to define the extremes in your dataset. This sheet references the CreditCardData set.
Create a new sheet and name it TransactionStats.
Ref | Column name | Formula |
---|---|---|
#A | Category | GROUPBY(#CreditCardData!VendorCategoryCode) |
#B | AverageSpend | GROUPAVERAGE(#CreditCardData!AmountUSD) |
#C | SpendStandardDeviation | GROUPSTDEVP(#CreditCardData!AmountUSD) |
Enrich the records with a join
The TransactionStats sets a baseline, but it’s missing key points of information. You want to see that baseline, but you also want the category code translation and the transaction details such as date and amount. To accomplish this, you need to multi-column join the CreditCardData with the CatCodes and the TransactionStats. Name this join sheet EnrichedRecords.
Choose your own adventure
There are to approaches to accomplishing the filter that follows. You can do an advanced filter within the sheet or build out the stats by duplicating the sheet and adding math to the end. You see both options listed below, you should feel comfortable demonstrating both methods but only execute one in the final workbook!
Determining outliers option 1: advanced filtering
The first option makes for a smaller workbook with fewer sheets, but also asks that you be comfortable with conditional language syntax. In natural language, this filter says “Show results if the amount minus the average is more than twice the size of the standard deviation in either direction (positive or negative).”
Duplicate EnrichedRecords. Name the sheet Outliers and include:
- CreditCardData.Date
- CatCodes.Category
- CreditCardData.AmountUSD
- TransactionStats.AverageSpend
- TransactionStats.SpendStandardDeviantion
Rename the columns to friendlier titles:
- PurchaseDate
- Category
- AmountUSD
- AverageSpend
- SpendStandardDeviation
Create a new filter within the sheet, click the Advanced tab, apply the following filter:
ABS(#AmountUSD - #AverageSpend) > 2* #SpendStandardDeviation
Determining outliers option 2: duplicate sheet with math column
For introductory classes, the math sheet can make more sense than an advanced filter. This allows you to leverage the auto-complete feature.
Duplicate EnrichedRecords. Name the sheet Outliers and include:
- CreditCardData.Date
- CatCodes.Category
- CreditCardData.AmountUSD
- TransactionStats.AverageSpend
- TransactionStats.SpendStandardDeviation
Rename the columns to friendlier titles:
- PurchaseDate
- Category
- AmountUSD
- AverageSpend
- SpendStandardDeviation
Ref | Column name | Formula |
---|---|---|
#A | PurchaseDate | #EnrichedRecords!CreditCardData.Date |
#B | Category | #EnrichedRecords!CatCodes.Category |
#C | AmountUS | #EnrichedRecords!CreditCardData.AmountUSD |
#D | AverageSpend | #EnrichedRecords!TransactionStats.AverageSpend |
#E | SpendStandardDeviation | #EnrichedRecords!TransactionStats.SpendStandardDeviation |
#F | isOutlier | ABS(#Outliers!AmountUSD-#Outliers!AverageSpend)>2*#Outliers!SpendStandardDeviation |
Datameer recommends starting with the basic math and wrapping ABS around it after demonstrating the basic formula.
Determine outlier trends by day
Now that you have a definition of an outlier, you can better understand if there are trends by industry or category on a given day or days throughout the year. For example, are the certain industries that are more likely to be targeted for fraud over the holiday season? Building this sheet with a filter could give you an idea, but a chart in an infographic shows you trends over the year. This sheet references the Outliers sheet.
Create a new sheet and name it OutlyingByDate.
Ref | Column name | Formula |
---|---|---|
#A | Date | GROUPBY(#Outliers!PurchaseDate) |
#B | Category | GROUPBY(#Outliers!Category) |
#C | OutlyingSpends | GROUPCOUNT() |
Within the sheet, create an ascending sort on the Date column [+].
Within the sheet, add a descending sort on the OutlyingSpends column.
- This sort highlights the number outlying spend for each date.
Create a new simple filter within the sheet where Date is before November 1st 2008.
This filter is in place to focus on a time frame of data that you know is interesting.
Workbook 3 - Affected Vendors
About affected vendors
This workbook is the most advanced of the use case because it encompasses concepts from the previous two workbooks and introduces new ideas like date format and column styling. This workbook tells you about significant fraud trends across vendors and categories.
You could have included these analytics in our previous workbooks, but it creates a large amount of sheets, limits your ability to share analytics across multiple teams, and makes it difficult to understand intended purpose of each workbook.
Add both data sets
CreditCardData was uploaded in the previous exercise and should be reused.
CatCodes was uploaded in the previous exercise and should be reused.
Determine the average fraud across vendors
Using some basic grouping and averages, You can determine the average occurrence of fraud for a given vendor. This sheet references the CreditCardData set.
Create a new sheet and name it VendorStats.
Ref | Column name | Formula |
---|---|---|
#A | VendorID | GROUPBY(#CreditCardData!VendorID) |
#B | Vendor | GROUPANY(#CreditCardData!Vendor)* |
#C | VendorFraudAvg | GROUPAVERAGE(IF(DENULLIFY(#CreditCardData!isFraud);1;0))** |
* GROUPANY takes any record from a given group and returns it. A VendorID is unique to each vendor, so we can take any VendorID from the group and it is an accurate name.
** You generate an average occurrence of fraud, this column is converting Boolean into integer in order to apply the average math. You also account for null values and converting them to 0s using DENULLIFY.
Determine the average fraud across categories
Using some basic grouping and averages, you can determine the average occurrence of fraud for a given category. This sheet is made referencing the CreditCardData set. You aren't using a GROUPANY in this case because the category translation doesn’t exist in your CreditCardDataset, and you would have to create a join ahead of time and this information is not valuable yet. You create a join later in the workbook when the category translation provides more value.
Create a new sheet and name it CategoryStats.
Ref | Column name | Formula |
---|---|---|
#A | VendorCategoryCode | GROUPBY(#CreditCardData!VendorCategoryCode) |
#B | CategoryFraudAvg | GROUPAVERAGE(IF(DENULLIFY(#CreditCardData!isFraud);1;0)) |
You are generating an average occurrence of fraud just as in the previous table. This column is converting Boolean into integer in order to apply the average math.
Build robust three-day statistics
The next sheet builds very powerful analytics and is used to further enrich our records. You build a three-day time window and measure fraud by vendor, their industry, how many instances of fraud occur, and what the average is during that time frame. Some of these concepts should resonate from the previous workbooks as the process is similar. This sheet references the CreditCardData set.
Create a new sheet and name it ThreeDayStats.
Ref | Column name | Formula |
---|---|---|
#A | Date | GROUPBYBIN(#CreditCardData!Date;3d) |
#B | VendorID | GROUPBY(#CreditCardData!VendorID) |
#C | VendorCategoryCode | GROUPANY(#CreditCardData!VendorCategoryCode) |
#D | FraudCount | GROUPSUM(IF(DENULLIFY(#CreditCardData!isFraud);1;0)) |
#E | ThreeDayVendorAvg | GROUPAVERAGE(IF(DENULLIFY(#CreditCardData!isFraud);1;0)) |
Enrich the ThreeDayStats with a join
In order to create meaningful analytics, you need to collect information that’s speed across four different sheets. You want vendor information, category information, and statistical results you've created about vendors and categories. Use Inner Joins and name the sheet EnrichedRecords.
Find the “hot vendors”
The meaningful insight of this use case lies in the ability to target specific vendors who are susceptible to fraudulent transactions. You can determine this using advanced conditional logic in the filter window. There are multiple ways to perform this logic including spreading the math and conditions across multiple columns, but you use an advanced filter in this case to demonstrate the flexibility and power of this option. If you demonstrated the advanced filter option in the previous workbook, this builds nicely on top of that concept.
Duplicate EnrichedRecords. Name the sheet HotVendors.
Visualize your 3-day timespan
In column A, format the date to make it easier to populate an infographic with the most readable date format. This formula references the 3-day bin you created in a previous sheet and adds three days because the bins display the “minimum” value of the range and you want to see the entire three day timespan. The purpose of this formula is to create a visually appealing infographic.
Ref | Column name | Formula |
---|---|---|
#A | Date | FORMATDATE(#EnrichedRecords!ThreeDayStats.Date;"MMM dd") +" - " +FORMATDATE(#EnrichedRecords!ThreeDayStats.Date + 3d;"MMM dd, yyyy") |
#B | Vendor | #EnrichedRecords!VendorStats.Vendor |
#C | Category | #EnrichedRecords!CatCodes.Category |
#D | FraudCount | #EnrichedRecords!ThreeDayStats.FraudCount |
#E | VendorFraudAvg | #EnrichedRecords!VendorStats.VendorFraudAvg |
#F | CategoryFraudAvg | #EnrichedRecords!CategoryStats.CategoryFraudAvg |
#G | ThreeDayVendorAvg | #EnrichedRecords!ThreeDayStats.ThreeDayVendorAvg |
Find the outliers with an advanced filter
The plain English translation says "Show me results if the Three Vendor Average Fraud is greater than 120% of the General Fraud Average OR if the Three Day Vendor Average is greater than 120% of the General Category Fraud Average AND the fraud count is greater than 0." The 120% is a number defined by the analyst based on their definition of an outlier or significant deviation.
Create a new filter within the sheet, click the Advanced tab, and apply the following filter:
(#ThreeDayVendorAvg > 1.2 * #VendorFraudAvg || #ThreeDayVendorAvg > 1.2 * #CategoryFraudAvg) && #FraudCount > 0
View the top targets
By applying a sort, you can now see when fraud is common, across what industry categories, and at exactly which vendors they typically occur.
Within the sheet, create a descending sort on the FraudCount column.
Building the Infographics
Outlier transactions by fraud flag
When a credit card is being used after it has been reported as missing, it raises a flag. This chart shows which types of vendors are most affected by possible credit card fraud and when, most likely, the fraud occurs.
Widget type | Workbook/sheet | Label | Data | Category |
---|---|---|---|---|
Line and Area Chart | OutliersByCategory/ top5FlaggedCategoriesByDay | Date | Count | Category |
Max Rows | 2500 | |||
Label Spacing | 6 | |||
X-Label Spacing | 12% |
Outlier transactions by spends
Credit cards being used outside of their normal spending patterns is a good way to detect possible fraud. This chart looks at time and vendor categories that are most affected by credit cards being used outside of their normal spend patterns.
Widget type | Workbook/sheet | Label | Data | Category |
---|---|---|---|---|
Line and Area Chart | OutlyingByDate | Date | OutlyingSpends | Category |
Max Rows | 5000 | |||
Label Spacing | 1 |
Affected Vendors
This table shows the vendor name where the most fraud occurrences have taken place within a three day time frame.
Widget type | Workbook/sheet | Data1 | Data2 | Data3 |
---|---|---|---|---|
Table | HotVendors | Vendor | Date | FraudCount |
Max Rows | 100 | |||
Stroke Sizes | 5, 2, 2, 0 | |||
Cell Height, Width | 28, 70 |