Big Data Analytics for Lenders and Creditors

Credit scoring means applying a statistical model to assign a risk score to a credit application or to an existing credit account. Here we are suggesting how data science and big data can help making the better sense of different risk factors and accurate predictions.

Larger credit scoring process

Modeling is the process of creating a scoring rule from a set of examples. In order for modeling to be effective, it has to be integrated into a larger process. Let’s look at application scoring. On the input side, before the modeling, the set of example applications has to be prepared. On the output side, after the modeling, the scoring rule has to be executed on a set of new applications so that credit granting decisions can be made.

The collection of performance data is at the beginning and at the end of the credit scoring process. Before a set of example applications can be prepared, performance data has to be collected so that applications can be tagged as ‘good’ or ‘bad’. After new applications have been scored and decided upon, the performance of the accepts again has to be tracked and reports created so that the scoring rule can be validated and possibly substituted, the acceptance policy is fine-tuned and the current risk exposure be calculated.

Choosing the right model

With available analytical technologies, it is possible to create a variety of model types, such as scorecards, decision trees or neural networks. When you evaluate, which model type is best suited for achieving your goals, you may want to consider criteria such as the ease of applying the model, the ease of understanding it and the ease of justifying it. At the same time, for each particular model of whatever type, it is important to assess its predictive performance, i.e. the accuracy of the scores that the model assigns to the applications and the consequences of the accept/reject decisions that it suggests. A variety of business relevant quality measures, such as concentration, strategy and profit curves are used for this (see section Model Assessment in the case study section below). The best model will, therefore, be determined both by the purpose for which the model will be used and by the structure of the data set that it is validated on.


The traditional form of a credit scoring model is a scorecard. This is a table that contains a number of questions that an applicant is asked (called characteristics) and for each such question a list of possible answers (called attributes). One such characteristic may, for example, be the age of the applicant, and the attributes for this characteristics then are a number of age ranges that an applicant can fall into. For each answer, the applicant receives a certain amount of points – more if the attribute is one of low risk, less vice versa. If the application’s total score exceeds a specified cut-off amount of points, it is recommended for acceptance. Scorecard model, apart from being a long established method in the industry, still has several advantages when compared with more recent ‘data mining’ types of models, such as decision trees or neural networks.  A scorecard is easy to apply: if needed the scorecard can be evaluated on a sheet of paper in the presence of the applicant. It is easy to understand: a number of points for one answer don’t depend on any of the other answers and across the range of possible answers for one question the amount of points usually increases in a simple way (often monotonically or even linearly). It is therefore often also easy to justify a decision that is made on the basis of a scorecard to the applicant. It is possible to disclose groups of characteristics where the applicant has a potential for improving the score and to do so in broad enough terms not to risk manipulated future applications.

Scorecard development process

Development sample

The development sample (input data set) is a balanced sample consisting of 1500 good and 1500 bad accepted applicants. ‘Bad’ has been defined as having been 90 days past due once. Everyone, not ‘bad’ is ‘good’, so there are no ‘indeterminates’.  A separate data set contains the data on rejects. The modeling process, especially the validation charts, require information about the actual good/bad proportion in the accept population. Sampling weights are used here for simulating that proportion. A weight of 30 is assigned to a good application and a weight of 1 to a bad one. Thereafter all nodes in the process flow diagram treat the sample as if it consisted of 45 000 good applications and 1500 bad applications. Figure 3 shows the distribution of good/bad after the application of sampling weights. The bad rate is 3.23%. A Data Partition node then splits a 50 % validation data set away from the development sample. Models will later be compared based on this validation data set.


Classing is the process of automatically and/or interactively binning and grouping interval, nominal or ordinal input variables in order to

  •  manage the number of attributes per characteristic
  • improve the predictive power of the characteristic
  • select predictive characteristics
  • and thereby the amount of points in the scorecard – vary smoothly or even linearly across the attributes

The amount of points that an attribute is worth in a scorecard is determined by two factors:

  • the risk of the attribute relative to the other attributes of the same characteristic and
  • the relative contribution of the characteristic to the overall scoreThe relative risk of the attribute is determined by its ‘Weight of Evidence’. The contribution of the characteristic is determined by its co-efficient in a logistic regression (see section Regression below).

The Weight of Evidence of an attribute is defined as the logarithm of the ratio of the proportion of goods in the attribute over the proportion of bads in the attribute. High negative values, therefore, correspond to high risk, high positive values correspond to low risk. Since an attribute’s amount of points in the scorecard is proportional to its Weight of Evidence (see section Score Points Scaling below) the classing process determines how many points an attribute is worth relative to the other attributes of the same characteristic.After classing has defined the attributes of a characteristic, the characteristic’s predictive power, i.e. its ability to separate high risks from low risks, can be assessed with the so-called Information Value measure.  This will aid the selection of characteristics for inclusion in the scorecard. The Information Value is the weighted sum of the Weights of Evidence of the characteristic’s attributes. The sum is weighted by the difference between the proportion of goods and the proportion of bads in the respective attribute. The Information Value should be greater than 0.02 for a characteristic to be considered for inclusion in the scorecard. Information Values lower than 0.1 can be considered weak, smaller than 0.3 medium and smaller than 0.5 strong. If the Information Value is greater than 0.5, the characteristic may be over-predicting, meaning that it is in some form trivially related to the good/bad information.

There is no single criterion when a grouping can be considered satisfactory. A linear or at least monotone increase or decrease of the Weights of Evidence is often what is desired in order for the scorecard to appear plausible. Some analysts would always only include those characteristics where a sensible re-grouping can achieve this. Others may consider a smooth variation sufficiently plausible and would include a non-monotone characteristic such as ‘income’, where risk is high for both high and low incomes, but low for medium incomes, provided the Information Value is high enough.

Regression analysis

After the relative risk across attributes of the same characteristic has been quantified, a logistic regression analysis now determines how to weigh the characteristics against each other.   The Regression node receives one input variable for each characteristic. This variable contains as values the Weights of Evidence of the characteristic’s attributes. (see table 1 for an example of Weight of Evidence coding). Note that Weight of Evidence coding is different from dummy variable coding, in that single attributes are not weighted against each other independently, but whole characteristics are, thereby preserving the relative risk structure of the attributes as determined in the classing stage

A variety of further selection methods (forward, backward, stepwise) can be used in the Regression node to eliminate redundant characteristics. In our case, we use a simple regression. These values are in the following step multiplied with the Weights of Evidence of the attributes to form the basis for the score points in the scorecard.

Score points calling

For each attribute, its Weight of Evidence and the regression co-efficient of its characteristic could now be multiplied to give the score points of the attribute. An applicant’s total score would then be proportional to the logarithm of the predicted bad/good odds of that applicant.  However, score points are commonly scaled linearly to take more friendly (integer) values and to conform with industry or company standards. We scale the points such that a total score of 600 points corresponds to good/bad odds of 50 to 1 and that an increase of the score of 20 points corresponds to a doubling of the good/bad odds. For the derivation of the scaling rule that transforms the score points of each attribute see equations 3 and 4. The scaling rule is implemented in the Scorecard node (see Figure 1), where it can be easily parameterized. The resulting scorecard is output as a table in HTML and is shown in table 2.  Note, how the score points of the various characteristics cover different ranges. The score points develop smoothly and, with the exception of the ‘Income’ variable, also monotonically across the attributes.

Reject Inference

The application scoring models we have built so far, even though we have done everything correctly, still suffer from a fundamental bias. They have been built based on a population that is structurally different from the population to which they are supposed to be applied. All the example applications in the development sample are applications that have been accepted by the old generic scorecard that has been in place during the last two years. This is so because only for those accepted applications it is possible to evaluate their performance and to define a good/bad variable.  However, the through-the-door population that is supposed to be scored is composed of all applicants, those that would have been accepted and those that would have been rejected by the old scorecard. Note that this is only a problem for application scoring, not for behavioral scoring. As a partial remedy to this fundamental bias, it is common practice to go through a process of reject inference. The idea of this approach is to score the data that is retained of the rejected applications with the model that is build on the accepted applications. Then rejects are classified as inferred goods or inferred bads and are added to the accepts data set that contains the actual good and bad. This augmented data set then serves as the input data set of a second modeling run. In case of a scorecard model, this involves the re-adjustment of the classing and the re-calculation of the regression coefficients.