Fraud Walkthrough

Statistical Analysis, Outlier Study, Seasonality

This answers questions such as:

  • What are outlying transactions by spend category? 
  • Are there trends in fraudulent transitions across industries? 
  • Are there trends in fraudulent transitions during certain time frames? 
  • Who are the most common vendors targeted for fraud?
  • What industries are most susceptible to fraud and when does it occur?

This is transferable to:

  • Analytics designed to identify outliers
  • Studies looking for seasonality trends
  • Statistical analysis using standard deviation and function mathematics

Building the Use Case

About the data

Name/descriptionTypeSizeLocation
CreditCardDataCSV~100,000 Records
CatCodesCSV~900 Records

About the use case

This use case builds three workbooks in order of complexity starting with the simplest first. The entire analytic use case includes: 

  •  Outliers by Category
  •  Outliers by Spend
  •  Affected Vendors

Workbook 1 - Outliers by Category

About outliers by category

This workbook shows you if there are any category or industry trends when looking at time frames of three days. You can answer questions such as “What are the top five categories generating flagged transactions when looking at three-day time periods?”

Add the first data set

CreditCardData is a file upload using a CSV connector.

Schema notes:

  • Delimiter: ,
  • Column Names: are contained in the first row
  • ADVANCED- Quote Character:

Define a season

By creating small, three-day buckets, you can look for seasonality trends over small time frames. You can do this using the GROUPBYBIN function and setting the perimeters to 3d. This sheet references CreditCardData.

Create a new sheet and name it ThreeDayStats.

RefColumn nameFormula

#A

CatCode

GROUPBY(#CreditCardData!VendorCategoryCode)

#B

BinStart

GROUPBYBIN(#CreditCardData!Date;3d)

#C

BinEnd

GROUPMAX(#CreditCardData!Date)

#D

FlaggedTransacitnos

GROUPSUM(IF(DENULLIFY(#CreditCardData!isFraud);1;0))

This function converts a Boolean value from the original data set to an integer writing true values as a “1” and false as a “0”. It also accommodates null values by assuming it is false and writing “0” using the DENULLIFY function.

Add the second data set

CatCodes is a CSV file upload with no special schema notes, quote character is ( ). Column names are contained in the first row.

Join CatCodes to your ThreeDayStats

This join gives you the Category translation for the Category Code mentioned in your stats sheet. For example, CatCode 1536 could be Airlines. Use the join keys CatCode and CategoryCode, name the joined sheet Joined.

Determine the top 5 categories

Using your now enriched data, determine the top five categories generating flagged transactions by day. 

Create a new sheet and name it top5FlaggedCategoriesByDay.

RefColumn nameFormula
#ADate GROUPBY(#Joined!ThreeDayStats.BinStart)
#BCount GROUPTOPN(#Joined!ThreeDayStats.FlaggedTransactions;5)*

#C

 Category COPY(#Joined!CatCodes.Category)

*GROUPTOPN allows you to expand the topN values in any given group, avoiding unnecessary sorts and filters. In this case, group on the date and reveal only the top five industries makes for a good chart.

Workbook 2 - Outliers by Spend


About outliers by spend

This workbook includes some more advanced analytics using statistical analysis and multi-column joins to build enriched datasets. Using those enriched records, you can find statistical outliers and build a defined seasonality model based on those extremes. 

Add both datasets

CreditCardData was uploaded in the previous exercise and should be reused.

CatCodes was uploaded in the previous exercise and should be reused.

Set a baseline

In order to determine the outlying spends, a baseline must be set. You identify the average and the standard deviation in order to define the extremes in your dataset. This sheet references the CreditCardData set.

Create a new sheet and name it TransactionStats.

RefColumn nameFormula
#ACategory

GROUPBY(#CreditCardData!VendorCategoryCode)

#BAverageSpend

GROUPAVERAGE(#CreditCardData!AmountUSD)

#CSpendStandardDeviation

GROUPSTDEVP(#CreditCardData!AmountUSD)

Enrich the records with a join

The TransactionStats sets a baseline, but it’s missing key points of information. You want to see that baseline, but you also want the category code translation and the transaction details such as date and amount. To accomplish this, you need to multi-column join the CreditCardData with the CatCodes and the TransactionStats. Name this join sheet EnrichedRecords.

Choose your own adventure 

There are to approaches to accomplishing the filter that follows. You can do an advanced filter within the sheet or build out the stats by duplicating the sheet and adding math to the end. You see both options listed below, you should feel comfortable demonstrating both methods but only execute one in the final workbook!

Determining outliers option 1: advanced filtering

The first option makes for a smaller workbook with fewer sheets, but also asks that you be comfortable with conditional language syntax. In natural language, this filter says “Show results if the amount minus the average is more than twice the size of the standard deviation in either direction (positive or negative).”

Duplicate EnrichedRecords. Name the sheet Outliers and include:

  • CreditCardData.Date
  • CatCodes.Category
  • CreditCardData.AmountUSD
  • TransactionStats.AverageSpend
  • TransactionStats.SpendStandardDeviantion

Rename the columns to friendlier titles:

  • PurchaseDate
  • Category
  • AmountUSD
  • AverageSpend
  • SpendStandardDeviation

Create a new filter within the sheet, click the Advanced tab, apply the following filter:

  •   ABS(#AmountUSD - #AverageSpend) > 2* #SpendStandardDeviation

Determining outliers option 2: duplicate sheet with math column

For introductory classes, the math sheet can make more sense than an advanced filter. This allows you to leverage the auto-complete feature.

Duplicate EnrichedRecords. Name the sheet Outliers and include:

  • CreditCardData.Date
  • CatCodes.Category
  • CreditCardData.AmountUSD
  • TransactionStats.AverageSpend
  • TransactionStats.SpendStandardDeviation


Rename the columns to friendlier titles:

  • PurchaseDate
  • Category
  • AmountUSD
  • AverageSpend
  • SpendStandardDeviation


RefColumn nameFormula
#A

PurchaseDate

#EnrichedRecords!CreditCardData.Date

#B

Category

#EnrichedRecords!CatCodes.Category

#C

AmountUS

#EnrichedRecords!CreditCardData.AmountUSD

#D

AverageSpend

#EnrichedRecords!TransactionStats.AverageSpend

#E

SpendStandardDeviation

#EnrichedRecords!TransactionStats.SpendStandardDeviation

#F

isOutlier

ABS(#Outliers!AmountUSD-#Outliers!AverageSpend)>2*#Outliers!SpendStandardDeviation

Datameer recommends starting with the basic math and wrapping ABS around it after demonstrating the basic formula.

Determine outlier trends by day

Now that you have a definition of an outlier, you can better understand if there are trends by industry or category on a given day or days throughout the year. For example, are the certain industries that are more likely to be targeted for fraud over the holiday season? Building this sheet with a filter could give you an idea, but a chart in an infographic shows you trends over the year. This sheet references the Outliers sheet.

Create a new sheet and name it OutlyingByDate.

RefColumn nameFormula
#ADate

GROUPBY(#Outliers!PurchaseDate)

#BCategory

GROUPBY(#Outliers!Category)

#COutlyingSpends

GROUPCOUNT()

Within the sheet, create an ascending sort on the Date column [+].

Within the sheet, add a descending sort on the OutlyingSpends column.

  • This sort highlights the number outlying spend for each date.

Create a new simple filter within the sheet where Date is before November 1st 2008.

  •   This filter is in place to focus on a time frame of data that you know is interesting.

Workbook 3 - Affected Vendors

About affected vendors

This workbook is the most advanced of the use case because it encompasses concepts from the previous two workbooks and introduces new ideas like date format and column styling. This workbook tells you about significant fraud trends across vendors and categories. 

You could have included these analytics in our previous workbooks, but it creates a large amount of sheets, limits your ability to share analytics across multiple teams, and makes it difficult to understand intended purpose of each workbook. 

Add both data sets

CreditCardData was uploaded in the previous exercise and should be reused.

CatCodes was uploaded in the previous exercise and should be reused.

Determine the average fraud across vendors

Using some basic grouping and averages, You can determine the average occurrence of fraud for a given vendor. This sheet references the CreditCardData set. 

Create a new sheet and name it VendorStats.

RefColumn nameFormula
#AVendorID

GROUPBY(#CreditCardData!VendorID)

#BVendor

GROUPANY(#CreditCardData!Vendor)*

#C

VendorFraudAvgGROUPAVERAGE(IF(DENULLIFY(#CreditCardData!isFraud);1;0))**

GROUPANY takes any record from a given group and returns it. A VendorID is unique to each vendor, so we can take any VendorID from the group and it is an accurate name.

** You generate an average occurrence of fraud, this column is converting Boolean into integer in order to apply the average math. You also account for null values and converting them to 0s using DENULLIFY.

Determine the average fraud across categories 

Using some basic grouping and averages, you can determine the average occurrence of fraud for a given category. This sheet is made referencing the CreditCardData set. You aren't using a GROUPANY in this case because the category translation doesn’t exist in your CreditCardDataset, and you would have to create a join ahead of time and this information is not valuable yet. You create a join later in the workbook when the category translation provides more value. 

Create a new sheet and name it CategoryStats.

RefColumn nameFormula
#A

VendorCategoryCode

GROUPBY(#CreditCardData!VendorCategoryCode)

#B

CategoryFraudAvg

GROUPAVERAGE(IF(DENULLIFY(#CreditCardData!isFraud);1;0))

You are generating an average occurrence of fraud just as in the previous table. This column is converting Boolean into integer in order to apply the average math.

Build robust three-day statistics  

The next sheet builds very powerful analytics and is used to further enrich our records. You build a three-day time window and measure fraud by vendor, their industry, how many instances of fraud occur, and what the average is during that time frame. Some of these concepts should resonate from the previous workbooks as the process is similar. This sheet references the CreditCardData set.

Create a new sheet and name it ThreeDayStats.

RefColumn nameFormula
#A

Date

GROUPBYBIN(#CreditCardData!Date;3d)

#B

VendorID

GROUPBY(#CreditCardData!VendorID)

#C

VendorCategoryCode

GROUPANY(#CreditCardData!VendorCategoryCode)

#D

FraudCount

GROUPSUM(IF(DENULLIFY(#CreditCardData!isFraud);1;0))

#E

ThreeDayVendorAvg

GROUPAVERAGE(IF(DENULLIFY(#CreditCardData!isFraud);1;0))

Enrich the ThreeDayStats with a join 

In order to create meaningful analytics, you need to collect information that’s speed across four different sheets. You want vendor information, category information, and statistical results you've created about vendors and categories. Use Inner Joins and name the sheet EnrichedRecords.

Find the “hot vendors” 

The meaningful insight of this use case lies in the ability to target specific vendors who are susceptible to fraudulent transactions. You can determine this using advanced conditional logic in the filter window. There are multiple ways to perform this logic including spreading the math and conditions across multiple columns, but you use an advanced filter in this case to demonstrate the flexibility and power of this option. If you demonstrated the advanced filter option in the previous workbook, this builds nicely on top of that concept.

Duplicate EnrichedRecords. Name the sheet HotVendors.


 Visualize your 3-day timespan

In column A, format the date to make it easier to populate an infographic with the most readable date format. This formula references the 3-day bin you created in a previous sheet and adds three days because the bins display the “minimum” value of the range and you want to see the entire three day timespan. The purpose of this formula is to create a visually appealing infographic. 

RefColumn nameFormula
#A

Date

FORMATDATE(#EnrichedRecords!ThreeDayStats.Date;"MMM dd") +" - " +FORMATDATE(#EnrichedRecords!ThreeDayStats.Date + 3d;"MMM dd, yyyy")
#B

Vendor

#EnrichedRecords!VendorStats.Vendor

#C

Category

#EnrichedRecords!CatCodes.Category

#D

FraudCount

#EnrichedRecords!ThreeDayStats.FraudCount

#E

VendorFraudAvg

#EnrichedRecords!VendorStats.VendorFraudAvg

#F

CategoryFraudAvg

#EnrichedRecords!CategoryStats.CategoryFraudAvg

#G

ThreeDayVendorAvg

#EnrichedRecords!ThreeDayStats.ThreeDayVendorAvg

Find the outliers with an advanced filter

The plain English translation says "Show me results if the Three Vendor Average Fraud is greater than 120% of the General Fraud Average OR if the Three Day Vendor Average is greater than 120% of the General Category Fraud Average AND the fraud count is greater than 0." The 120% is a number defined by the analyst based on their definition of an outlier or significant deviation.

Create a new filter within the sheet, click the Advanced tab, and apply the following filter:

(#ThreeDayVendorAvg > 1.2 * #VendorFraudAvg || #ThreeDayVendorAvg > 1.2 * #CategoryFraudAvg) && #FraudCount > 0



View the top targets 

By applying a sort, you can now see when fraud is common, across what industry categories, and at exactly which vendors they typically occur. 

Within the sheet, create a descending sort on the FraudCount column.

Building the Infographics

Outlier transactions by fraud flag

When a credit card is being used after it has been reported as missing, it raises a flag. This chart shows which types of vendors are most affected by possible credit card fraud and when, most likely, the fraud occurs.

Widget typeWorkbook/sheetLabelDataCategory
Line and Area Chart
 

OutliersByCategory/

top5FlaggedCategoriesByDay

DateCountCategory
Max Rows2500


Label Spacing6


X-Label Spacing12%


Outlier transactions by spends

Credit cards being used outside of their normal spending patterns is a good way to detect possible fraud. This chart looks at time and vendor categories that are most affected by credit cards being used outside of their normal spend patterns.

Widget typeWorkbook/sheetLabelDataCategory

Line and Area Chart

OutlyingByDateDateOutlyingSpendsCategory
Max Rows5000


Label Spacing1


Affected Vendors

This table shows the vendor name where the most fraud occurrences have taken place within a three day time frame.

Widget typeWorkbook/sheetData1Data2Data3

Table

HotVendorsVendorDateFraudCount
Max Rows100


Stroke Sizes

5, 2, 2, 0




Cell Height, Width

28, 70