Imperfect Intelligence, Part I – Garbage Data
Artificial Intelligence is here to make your life easier. It can recommend movies that you will like, it can drive you around relatively safely, diagnose your medical ailments, and it can even predict what you’re going to buy before you yourself know. In the right hands, AI is undoubtedly here to enhance your life. But what happens when we use AI to support big financial decisions? Could an AI solution turn down a mortgage application for a first-time buyer simply because they’ve spent too wildly with Deliveroo in the previous month?
AI has a glaring Achilles heel: the data that feeds it only has the meaning we, humans, give it. Data cannot be objective. Rather, its significance is constructed by humans, fed into machine programs written by humans, and its results are subsequently interpreted by humans.
As AI and Machine Learning continues to creep into more and more business decisions, we need to be careful of its shortcomings and how to mitigate them. It is important to remember the common refrain “garbage in, garbage out”: a computer algorithm cannot produce useful outputs if the data it is fed is biased. While financial services can employ algorithms to help with many of their primary functions, like who to lend to and where to invest, they will not work unless they are trained on accurate, realistic and plentiful data.
I have narrowed down 3 of the key reasons why businesses end up creating ‘garbage’ data that prevents their AI programs from being objective:
Framing the Problem
For every solution a deep learning program is trying to achieve, there must be a problem. When the machines are trained to produce a certain output, they are at the mercy of how the problem has been defined by the business.
Take, for example, an algorithm designed to increase the profits of a retail credit division. Framed by the business in this manner, it is probable the solution will identify that those less likely to repay debts in a timely fashion are more immediately profitable, therefore making negligent recommendations about product suitability – an outcome that the business certainly wouldn’t have been seeking.
The first step to creating a robust, useful and accurate AI solution is a clear and objective articulation of the business problem that considers the end customer benefit that is being aimed for. Without this, the final solution will be riddled with the biases and errors applied before the technical development even begins.
The selection and collation of the data being interpreted by machines hugely influences the results. If the data set does not reflect reality, it will give you skewed results, or worse still can reinforce existing biases or barriers.
Last year Amazon had to halt its AI recruiting tool that reviewed job applications to improve its talent identification process. As tech in general is a male-dominated field, the programme taught itself that male candidates were preferable, whilst discriminating against female candidates. Any resume that had the word “women”, as in “women’s tennis team” or “women’s leadership group”, was automatically penalized.
Furthermore, the baseline for most statistical models is historical data, which helps create trends that will train the model effectively. When there isn’t sufficient historical data, the output is inherently skewed. If there is nothing to compare your findings too, then it is really hard to tell how accurate your model is.
Even if it seems you have the necessary quantity of data to train the machine, it’s important to scrutinise whether you have the right data to provide an accurate picture.
Most banks will be at the mercy of data points that have been captured by systems that weren’t originally designed to support a specific AI problem statement – potentially resulting in key information being neglected as an input into the algorithm.
Say you choose to investigate the main reasons why customers are unable to make their mortgage repayments using internal account and transaction data; it is plausible you may not have enough context to generate an accurate finding. You may have captured customers’ age, income and Deliveroo habits, but this might not give you the whole picture. Are those least likely to miss a mortgage repayment also a carer for an elderly parent? Did they go on holiday and forget to pay off their bills ahead of time? Have they had a change in relationship status?
Feature selection is a key component of data mining, and as such has been described as the “art” of machine learning. Every data set is made up of different “attributes” which must each be determined as significant for consideration or not before being ingested into a computer algorithm.
The problem here arises when feature selection itself is subject to human bias, or even when the data is trained on features that are perhaps ethically inappropriate. For example, a computer algorithm in the US used to help predict the likelihood of a criminal re-offending erroneously flagged black defendants as being twice as likely to break the law again than their white counterparts.
If an AI/ML program indicates that age is the most important factor in determining credit worthiness – as the older you are the better you are at paying back loans – does that mean young people should be less eligible for a home loan?
Understanding Garbage Data
You don’t need to be a data scientist or computer programmer to understand that if the data used to feed an AI program is flawed and skewed by human bias, then whatever information it tells us is also going to be flawed and equally skewed. If we don’t account for these shortcomings, we can’t be objective and might just be reinforcing the very biases we seek to eliminate.
With good data, AI can be put to impressive use, and I’m not just talking about recommending movies on Netflix. However, even with good data, algorithms can be trained on hidden biases that continue to give us negligible results… More on this to come!