Identity Fraud and Analytics – An Overview

With the consumers being increasingly concerned about identity theft, leading financial institutions are leveraging analytics to detect Identity Fraud as it happens.

Joseph R. Barr,, March 2014

Imagine your shock of discovering that unbeknown to you, you are now the owner of Visa credit-card from some bank in Utah, and worse, the credit reporting agency informs you that “your” $3,000 balance on said card is 30-day past due. As a result, you react like many of us will, in sheer panic, because you don’t recognize any of the details and can’t figure out how, given your conservative financial habits, this could have happened to you.

Well, congratulations! You are now one of the millions who share this experience: you are a victim of a particular kind of fraud called identity theft.

Identity TheftLoosely speaking, identity theft is a form of fraud where a fraudster assumes a different persona, of some“innocent bystander” in order to (fraudulently) receive goods or services they don’t intend to pay for. Although fraud is hardly a new thing - impersonation & deceit for personal gains is as old as Methuselah - but in this brave-new-world of easy credit, and the impersonal means by which credit is obtained, and because of the magnitude and impact, identity theft has recently become a real problem.

Consequences of identity fraud are significant to both consumers and the credit-awarding enterprises alike. Estimated losses from identity fraud is in the billions (estimates vary, e.g., Javelin Strategy & Re-search of Pleasanton, California, puts it at upward of 20 billion, 2012), affecting millions (Javelin estimates that 5% of US adults are affected). Even if ultimately you weren’t held responsible for the losses, invariably, losses to the enterprise are passed on to you, the consumer. The graph below summarizes the number of affected and total losses.
Losses from Identity TheftVictims of Identity Theft

Fighting identity theft isn’t easy. Absent a rigorous authentication process, it’s difficult to catch when, say, an adult daughter assumes her mother’s identity. The reality is that credit-awarding organizations (banks, retailers, cell phone servicers, etc.,) encounter competitive pressures to driving up volumes where any [sic] unnecessary friction with consumers is frowned upon. Indeed, many consider identity verification as an unnecessary friction point.

In addition, business operates under sometime strict regulatory constraints, which, right or wrong, limits their options to vet customers based upon, say, physical appearance. Increasingly banks and merchants rely on scoring solution to helps fight identity thieves.

To remind you, a score is a numerical value thought of as the probability that an application is fraudulent. Score-producing statistical algorithms rely on narrative which strives to captures the essential features, logic and mechanics of identity thefts (and thieves), from which risk factors are identified and extracted.

Industry recognizes that there are hundreds of risk factors or features, although admittedly, some o those provide a rather weak signal. This extraction process results in a vector of features, the input vector, from which a statistician or data scientist must learn a functional relationship that quantifies the level of risk associated with each vector input.

Although the technical details are quite complex the basic idea is this: Applicants provide personal identifying information (PII=name, DOB, SSN, address) – which presumably, in some cases is not true or “authentic”.

Various, often proprietary algorithms involving e.g., proximity, matching and velocity calculations, will process raw data that result in input vectors corresponding to each record. Training set is labeled with values of +1 and -1, where confirmed fraud is tagged with +1, and with -1 otherwise.

It is recognized that the fraud class +1 is rare, normally consisting of fewer than 5 percent of the entire portfolio. The training data is therefore input-output pairs as above, with +1 up-sampled to ensure +1/-1 near-parity. As the reader surely knows, industry is fond of a handful of statistical methods or machine learning algorithms, to estimate the probability that an input vector belongs to class with label +1.

At the same time, the industry is far from monolithic and for one reason or another one organization prefers logistic regression, while another prefers boosting. Specifically, ID Analytics/LifeLock uses boosting with stumps (a depth-1 decision trees) for its ID Score algorithm, for 2 reasons:

  1. it seems to offer better resistance for that pesky overfitting problem, and
  2. it’s interpretable, i.e., the weights associated with the input variables help ‘explain’ the score.

Logistic regression (LOGIT) is similarly interpretable, but requires significant effort to develop especially with high-dimensional input space. LOGIT is a legacy ID Analytics product, but LexisNexis uses it extensively for its fraud score products.

Originally developed by HNC software, Neural Networks is used by FICO’s Falcon transaction fraud product. A neural network model however is not easily interpretable and a high regulatory hurdle pre-vents it from being used in FCRA-related scoring solutions.
(FCRA is The Fair Credit Act, a Federal law that prohibits discrimination in lending based upon race, gender, creed, etc.)

Other popular methods are decision trees/random forests, and support-vector machines. The “arms race” between fraudsters and fraud-fighters continues in full fury: fraudsters seem to quickly adapt to improvements in scoring algorithms, and the cycle continues notwithstanding harsh penalties imposed by law.

Joseph Barr
Joseph Barr is Data scientist with 20-year track record of providing value to customers in government, healthcare, energy/utilities and finance. He is currently teaching machine learning at San Diego State University and is at True Bearing Analytics. He is also an Advisor at Analytics Ventures and Cyber United, and Advisory Board Member at Dataskill.