AI for Fun & Profit: Using the new Genie Cognitive Computing Platform for P2P Lending

This tutorials uses the recently-released Genie (an acronym for General Evolving Networked Intelligence Engine) platform to learn from P2P (peer-to-peer) loan data. Experts and non-experts alike can leverage Genie to analyze Big Data, recognize objects, events, and patterns, and more.



Deploying a Genie to a Bottle

Switch to the Genie Manager page.  Find this new genie in “Your Genies” list, pick a bottle size (the "Developer" bottle is a lower cost VM useful for testing hypotheses, the "Production" bottle has greater memory and processing capabilities useful for production deployments), and click “spawn”.

Genie

In about 2 minutes, the genie’s bottle will be ready.  When the bottle’s name turns green, click on it to go to its web interface.  Log into the bottle using your Genie Factory account and password.

Backtesting Platform

Click on the “Backtesting” link.  From “Upload Data Set”, upload each Lending Club .gdz file.  These are large files so they may take some time to upload.

Genie

When done, select the files we’d like to make available for the test.  Then, select the fraction of that data to test.  Since this is a very large data set, I chose to use 1% (0.01) of the available data.

Genie

The right-hand menu shows our current configuration:

Genie

Last, pick the “ingress” and “query” nodes.  Ingress nodes are primitives that receive the sent data.  Query nodes are primitives from which we request answers, i.e. predictions, classifications, etc.

Genie

Notice that P3 gets the data, but P4 is used to query the response. (Since the input topologies are identical, P2 and P3 will give exactly the same responses.  We could have used P3 as a query node, instead of P2. But, the point to make was that ingress nodes don't necessarily need to be query nodes, too.  If there's an abstraction, then the query nodes are higher in the network hierarchy.)

Click Run to start the test.  Our test will run 5 times.  Each time, the genie's knowledge bases are completely erased.

This will take some hours to complete.  (While its running, though, we can take a peak at the “testing phase” predictions from the “Live Interface” page.)

When testing is done we're presented with a report.  The report has three sections.

The first section illustrates the input data used for the tests.  This is separated into the training and testing data sets.  We can easily spot random skewing that will likely effect our results by looking at any differences between the training and testing classes.  In our test, there is about a 1.3% skew towards negative data while testing.

Genie

The second section is titled "Receiver Operating Characteristic".  The first of its charts plots the False Positive ("FP") Rates against the True Positive ("TP") Rates for each query node.  The second plots the False Negative ("FN") Rates against the True Negative ("TN") Rates for the same query nodes.  We use this section to quantify our best solutions.  The best performers are towards the top-left of both charts.  Hovering over the dots provides more details.  If we're more interested in finding loans that likely pay back, use the FP vs. TP rates.  "P2" provides the best solution for this requirement:

Genie

If we're more interested in avoiding the loans that default, use the FN vs TN rates.  "P1" provides the best solution for this requirement:

Genie

The third section contains the details from each query node.  Let's look at P1's output:

Genie

The Fidelity pie chart shows the percentage of the test data this solution correctly predicted, had wrong, and was unable to predict.  The "Predicted Class Statistics" pie chart shows the number of positive, negative, and neutral predictions.  (The neutral predictions are the "unknowns" in the Fidelity chart.)

The Progressive Model Statistics chart is useful for fine-tuning the number of returned predictions requested from the Cognitive Processor.  The more predictions returned, the more statistically meaningful the results become.  Fewer predictions are more specific to the observed data and better uncover the long-tail of the underlying distribution.

Looking at our results, it seems that the solution would benefit from reducing the number of returned predictions to around, perhaps, 5.  So, in our next iteration of this solution, we can change the "top n matches" primitive property from 100 to 5.