Marching into the future: Exploring the Power of Machine Learning in Insurance Pricing
Continuing the topic of pricing sophistication journey, it is time to delve into the world of pure Machine Learning (ML) in insurance pricing. While this subject has been extensively discussed and scrutinised by prominent researchers and actuaries, it still carries some note of uncertainty. The black-box nature of these methods is an obvious elephant in the room for many companies, but perhaps needlessly so. In this blog post, I will attempt to dispel some of the mystery by addressing the following questions:
- What is Machine Learning?
- What ML estimators could be utilised for insurance pricing?
- How to build ML-based insurance pricing models?
- What are the advantages and possible drawbacks?
What is Machine Learning?
Contrary to the image of a repulsive creature straight from fairy-tale swamps, the definition of Machine Learning is way more approachable. According to Wikipedia1:
“ML is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks…”
And according to researchers’ Frankenstein ChatGPT itself, ML is the following:
“Machine learning is the field of study and practice that enables computers to learn from data and make predictions or decisions without explicit programming.”
Machine Learning has become an integral part of various industries, including insurance, where it enables insurers to extract valuable insights from vast amounts of data and optimize their pricing strategies. By harnessing the power of algorithms and data, Machine Learning revolutionizes the way insurers understand risks, personalise policies, and provide more accurate and tailored pricing to their customers.
Machine Learning methods can serve many different purposes, however, due to the nature of insurance pricing, it is worth focusing on a few supervised Machine Learning algorithms since actuaries usually deal with labeled data – meaning that each record or data point contains features and an associated label. Supervised Machine Learning deals with the following challenges:
- Classification - the problem of identifying which of a set of categories an observation/-s belongs to, e.g. binary classification where values of target variables are 0 or 1.
- Regression - estimating the relationships between a dependent variable and one or more independent variables, e.g. house price estimates based on regional and economic data.
What ML estimators could be utilised for insurance pricing?
In the realm of insurance pricing, the ultimate goal of any modelling approach this or any other insurance blog post would discuss is the most accurate prediction of future events – should it be claims severity and frequency, persistency rates, demand for the policies or desire to pick a bundle over a single product, etc.
Considering the definitions mentioned earlier and especially Wiki’s one, one could see that well-known Generalised Linear Models (GLMs) easily fit into this category. Personally, I do support this view. It makes talking about and using ML in the whole insurance industry slightly less stressful if it is just an extension or a more sophisticated method of something that actuaries, data scientists have been familiar with for quite some time.
While various estimators and statistical models fall into the category of Machine Learning, it is suggested to begin the modern Machine Learning journey (excluding GLMs) with a select few, based on extensive research by prominent actuarial experts2. These include:
- Tree-based estimators, which can be further divided into:
- Decision Tree
- Random Forest
- Gradient Boosting Machine
- Neural network
These methods have been extensively described on numerous occasions, however, it is still worthwhile to recap the key ideas standing behind them.
Decision Tree
This model is fairly straightforward to understand especially when only conditional control statements are considered. One would start with a single question (test), e.g. “is this driver older than 28 years of age?”, and create a separate node for each unique answer possible (decision). After another question is asked, the structure grows another set of nodes, which then start to resemble a tree, hence the name Decision Tree.
In real business cases, these Decision Trees are significantly more complex. However, the initial assumption remains the same. They have been commonly used across different fields such as decision analysis and have also emerged as one Machine Learning possibility for insurance pricing.
Random Forest
Where one Decision Tree could be considered too simple, the ensemble of Decision Trees - a large number of copies representing different available states of such a tree - could be looked at simultaneously to improve predictive accuracy. This is exactly the reason why the idea of a Random Forest has been created.
For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean prediction of the individual trees is returned.
Gradient Boosting Machines
The constant need for improvement in terms of predictive accuracy was one of the possible reasons, why when around 1988 Kearns and Valiant asked, "Can a set of weak learners create a single strong learner?" the world did not need to wait for long time to get an answer. In the late 1990s, Leo Breiman and Jerome Friedman were among the main pioneers who formulated and popularised Gradient Boosting Machines approach.
It follows the ensemble idea introduced previously, but this time with a concept of „weak” and „strong” learners, where the former aims to improve the strong one in an iterative manner.
They are the most common modern ML approach utilised in insurance pricing, usually via XGBoost or LightGBM frameworks.
Neural Networks
Observations from one field often encourage researchers to seek for similar applications in different fields of study. The human body, mainly brain and understanding the logic behind the way it operates, was an inspiration for an artificial Neural Network approach where people try to simulate some properties of biological neural networks and utilise them within predictive modelling.
The complex calculations are embedded within a set of input, hidden and output layers where all the magic happens. Successful application of neural networks has been seen e.g. in speech recognition or computer vision fields and is now also being heavily tested in insurance pricing.
More recently, the topic of Generative AI (Gen AI) gained on popularity by rapid developments of algorithms such as Generative Adversarial Networks (GANs) or Large Language Models (LLMs) used by tools such as ChatGPT or DALL-E. These models focus on learning patterns and structure of the input data to later generate new data with similar characteristics – photos, text, etc. Gen AI will be further scrutinised as a part of the following articles.
How to build ML-based insurance pricing models?
In a nutshell – automatically. The previously mentioned ML pioneers had to not only create those brilliant ideas, but also implement them with very limited computational power at their disposal. These days are luckily long gone, and the reader’s smartphone is most probably significantly more efficient than the tools they were using back in 1990s.
In 2023 Machine Learning is still a very popular topic and has been widely discussed throughout the last decade or so. These extensive discussions and researches mean that the need to write programming code manually is also slowly fading away. Tools like Quantee allow the user to take advantage of their no-code Machine Learning capabilities and it seems that it will be the case moving forward.
Personally, I started learning about Machine Learning using R, Python or even Excel for simple estimators like Linear Regression. However, my colleagues and I tend to value time for analysis more than polishing the code you need to write. Regardless of the choice of an estimator, the approach is fairly standard and similar to what has already been the main topic of discussion in the context of previous blog posts:
- Load, understand, and clean the data.
- Understand what variables would be used in the model – from this step onwards the process is usually iterative.
- Choose your desired estimator or group of estimators and Perform Feature Engineering – fill the missing values, perform necessary data transformations – e.g. one-hot encoding to reshape categorical variables into columns with binary representation of the value of a particular record, etc.
The last question smoothly brings up the topic of hyperparameter tuning or hypertoptimisation. It is a process of choosing a set of optimal (best possible) assumtions for a particular learning algorithm. Current tools do not only allow to smoothly train and validate ML models but also support analysts with choosing the best set of hyperparameters to maximize their utility from the outcome.
What are the advantages and possible drawbacks?
The pros and cons of ML in insurance pricing can be summarised in the following categories:
- Performance
- Interpretability
- Feature importance analysis
- Building time
- Complexity
Comments below refer to comparison of GBMs against GLMs since these have long been state-of-the-art techniques which make them a worthy competitor of a popular choice of one of the modern ML solutions.
Performance
Usually GBMs are more computationally intensive and longer to train, however, often result in higher predictive accuracy.
The main drawback is that when not built properly, they tend to show signs of overfitting – model performs very well on training data but rather poorly on the validation set. The adverse impact could be reduced with several techniques but not fully mitigated, so a constant performance monitoring is crucial.
Modern ML methods usually require large volumes of data to be executed properly, meaning they could not be used in every single business case.
Interpretability
Interpretability odf Modern ML is still fairly low if even exisiting without the layer of Explainable AI (XAI) – tune in, more details on XAI will be provided in the following blog post!
Regulators and authorities require transparency in financial sector, thus the lack of it is the main reason why companies are still reluctant to implement Modern ML on production. More on this topic can be read in one of our previous blog posts.
Feature importance analysis
GBM provides feature importance measures, enabling insurers to identify the most influential factors in pricing. This helps in better understanding risk factors and creating more targeted pricing strategies.
Building time
Utilising Modern ML algorithms requires a lot of expertise which could lead to long buidling times. However, when equipped with modern tools, it can be just a matter of a few clicks and seconds to build first ML models.
Complexity
Modern ML can easily handle very complex non-linear relationships between variables and capture interactions within dataset.
GBMs are also fairly robust to outliers and missing data values which speeds up the implementation process due to lack of need for a thorough data processing.
Summary
This article introduces the idea of using modern Machine Learning models in inurance pricing. It sets an informative basis for the upcoming follow-up blog posts. They will focus on methods that actuaries could implement to improve their transparency & interpretability.
Sources