Data Science in Insurance Industry - Updated Applications

Opportunities in Insurance Data Science

The Promise of Big Data

Where it was once difficult to gather data about potential risks, today’s insurers have lots of data to work with.

As Matt Josefowicz noted at an insurance leadership forum, the traditional underwriting process “was designed for a world of information scarcity and is trying to adapt now to information super-abundance.”

On any given day, insurance data scientists may gather data from:

Telematics devices
Smart phones
Social media
CCTV footage
Electoral rolls
Credit reports
Website analytics
Government statistics
Satellite data

What’s more, the advent of cloud computing helps companies to aggregate and store it all.

These sources tell insurers far more than did historical data from policy administration systems, claims management applications and billing systems, and the mortality reports of yesteryear. Through a judicious analysis of big data, insurers improve their pricing accuracy, create customized products and services, forge stronger customer relationships and facilitate more effective loss prevention.

AI-Driven Underwriting and Claims

The single biggest shift in insurance data science recently has been the move from traditional predictive models to generative and agentic AI, now used across underwriting, pricing, fraud detection and claims processing. Insurers began piloting agentic AI — systems that can carry out multi-step tasks with limited human input — in early 2025, and by mid-2025 it was in real-world use for claims triage and fraud flagging at a number of major carriers.

That speed has come with real controversy. In one closely watched case, Kisting-Leung v. Cigna, plaintiffs allege the insurer’s PxDx algorithm auto-denied more than 300,000 claims over a two-month span, averaging roughly 1.2 seconds of review per denial — barely enough time to read a single sentence, let alone a medical claim. UnitedHealthcare and Humana face similar lawsuits over algorithmic claims review. Regulators have responded quickly: the National Association of Insurance Commissioners’ 2023 model bulletin on AI has now been adopted in some form by more than half of U.S. states, and several states have gone further. California’s SB 1120 prohibits health insurers from denying coverage based solely on an AI algorithm, and Colorado now requires insurers to maintain detailed algorithm inventories and conduct regular bias testing. For data scientists entering the field, understanding this regulatory landscape is quickly becoming as important as understanding the modeling itself.

Personalized Risk Pricing

To match that level of knowledge in the age of decentralization and the Internet, the insurance industry has turned to big data. Insurance data scientists combine analytical applications – e.g., behavioral models based on customer profile data – with a continuous stream of real-time data – e.g., satellite data, weather reports, vehicle sensors – to create detailed and personalized assessments of risk.

Auto Insurance

Picture a world in which wireless “telematics” devices transmit real-time driving data back to an insurance company.

Telematics-based insurance products have been around since the 1990s, when Progressive first launched them. But technology has come a long way in the intervening years. Telematics devices currently include embedded navigation systems (e.g., GM’s OnStar), on-board diagnostics (e.g., Progressive’s Snapshot) and smartphones.

These can be used to create personalized plans, which typically fall under one of two options two of these options:

PAYD: Pay-As-You-Drive
PHYD: Pay-How-You-Drive

PAYD is straightforward. It charges customers based on the number of miles or kilometers driven. Hollard Insurance, a South African insurer, has six mileage options.

But PAYD does not take into account driving habits. PHYD plans use telematics to monitor a wide variety of factors – speed, acceleration, cornering, braking, lane changing, fuel consumption – as well as geo-location, date and time. If an accident occurs, the insurance company has the ability to recreate the situation.

Auto insurers can then provide customers with driving scores, ideas for improvement and individual pricing.

Property Insurance

In a move similar to auto, property insurance companies are assessing how they can use telematics to create usage-based home insurance. These data sources can include:

Moisture sensors that detect flooding or leaks
Utility and appliance usage records
Security cameras
Sensors that track occupancy

Combine this with information from outside sources (e.g., local crime reports and traffic) and you can arrive at a multi-faceted, comprehensive assessment of one person’s property claim risk.

Going a step further, these sources can be used to protect a customer. For example, with predictive analytics, insurers can calculate the likelihood of an event such as theft or a hurricane and take steps to avoid pain and suffering – as well as, of course, big claims. That calculation has gotten more urgent, not less: as climate-related disasters grow more frequent and severe, property insurers are leaning harder on data science to model wildfire, flood and hurricane risk, and in some of the highest-risk regions, that modeling has led insurers to pull back from writing new policies altogether.

Life and Health Insurance

We live in a monitored world. Life and health insurance companies know this more than anybody. To create profiles of customer health and develop individual “well-being” scores, insurers are now casting the information net very wide indeed. They can collect:

Transactional data – e.g., where and what (junk food?) customers buy
Body sensors – i.e., devices that monitor consumption or alert the wearer to early signs of illness
Exterior monitors – e.g., data from workout machines
Social media – e.g., tweets about one’s personal health or state of mind

For more details on big-data applications in this area, see our related profile of the Health Care Industry.

360-Degree Customer Profiles

Insurance aims to improve customer satisfaction, and it is employing big data to accomplish that. The more an insurer knows about its customers’ quirks, the theory goes, the easier it is to keep them happy – and paying premiums.

Companies are combining all their direct customer connections – e.g., email, call center, adjuster reports, etc. – with indirect sources – e.g., social media, blog comments, website and clickstream data – to create a 360-degree profile of each individual.

With a 360-degree profile in hand, insurers have the means to refine their approach to sales, marketing and existing customer service.

Call Center Optimization

A call center is a seething cauldron of data. For insurance data scientists, it’s also a golden opportunity. These folks are investigating ways to:

Combine claims data with telecom data from CDRs to analyze call center activities and refine training guidelines.
Analyze raw telecom data, model temporal call patterns, and create a plan for staffing optimization.
Use sentiment analysis – e.g., speech analytics on call center conversations or Natural Language Processing (NLP) and text analytics on social media – to improve customer service.

Call-center employees are also in an ideal situation to sell customers additional products. One use of a 360-degree profile is to give that friendly voice on the phone the means to offer you the most relevant product for your particular needs.

Fraud Detection

Fraud costs insurance companies tens of billions of dollars each year. In response, insurers are marshaling their data resources and creating a multi-channel approach to fraud detection. They’re taking a very close look at both traditional structured data (such as claims and policy data), and textual data (such as adjuster notes, police reports and social media).

Using…

Text analytics
Predictive analytics
Behavioral analytics
Pattern, graph and link analysis techniques

… not to mention a host of other handy tools, data scientists are cracking down on suspicious claims.

Data Risks and Regulations

The Challenges Ahead

Insurance companies still have a few hurdles to cross before they can become fully data-driven. Some of those hurdles are already apparent to the industry. They include:

The siloed nature of data collected makes it challenging to synthesize data
Unstructured data
Outdated fraud detection technology that cannot keep pace with today’s level and type of fraud

Elderly Infrastructures

Big companies have their own issues. Some deal with creaky IT infrastructures that are not equipped to handle the volume, velocity or variety of data that are streaming through their doors.

Skill Shortage

Big data can be used to solve many problems, but only if you have employees who are trained to ask the right questions.

And many insurance companies don’t. The insurance industry is replete with statistical ability. It’s only a matter of time before the supply of analytics skills catches up to the demand.

Customer Privacy

But perhaps the most complicated issue centers on a customer’s right to privacy. The Finance Industry in general is subject to a host of federal and state regulations that were enacted to protect consumer privacy and avoid discriminatory practices. These have been joined by a series of stringent rules on data collection – all of which an insurance legal department must be aware of.

Just as importantly, insurance companies may need to think about how they treat customer information. It’s all very well to imagine a world run by telematics, but many consumers are rightly afraid of ceding their personal data to a private company. Even the lure of more affordable premiums may not be enough to change their mind.

Insurance data scientists also have to be very careful they’re not mistakenly assuming the role of Big Brother – whether benevolent or not. Despite the hype, not even big data can tell you everything about a person.

As an early example, I’ll leave you with the cautionary tale of Quebec’s Natalie Blanchard. In 2009, Blanchard went on disability leave due to a case of severe depression. One day, she went to the bank and discovered her health insurance benefits had been terminated. The reason? Her insurance company, while trawling for data, had captured smiling photos on her Facebook page and decided she wasn’t depressed enough to be disabled.

That was one person and one insurer. The Cigna case described above shows the same underlying risk — an algorithm substituting for genuine human judgment about a person’s circumstances — now happening at a scale of hundreds of thousands of claims at a time, which is exactly why regulators have moved as quickly as they have to require meaningful human review of AI-assisted decisions.

History of Data Analysis and Insurance

“Most problems have either many answers or no answer. Only a few problems have a single answer.”

– Edmund C. Berkeley, Right Answers – A Short Guide for Obtaining Them (September 1969)

Insurance has always been a numbers game. What are the odds of a ship sinking? Of the head of the household dying prematurely? Of a wooden house burning down? Since the third millennium B.C., humans have been trying to protect themselves from the risks of living.

Keeping track of risks means knowing the numbers – the data. Increasingly sophisticated techniques were added over time to better calculate the odds. Three and a half centuries ago, “knowing the numbers” was maturing into the mathematics of risk – actuarial science – one of the foundations of modern data analysis.

The Birth of Actuarial Science

In the late 17th century, demand for long-term insurance (e.g., burial, life and annuities) was becoming hard to ignore.

Insurance companies were happy to offer citizens these products, but they were faced with a variety of statistical conundrums in understanding their data:

What was the likelihood of an insurance-holder dying within a certain time frame?
How should insurers price their products?
What percentage of premiums should they set aside to pay for future benefits (e.g., annuities)?
How much could they afford to invest elsewhere? What would the rate of interest be?

Graunt’s Table and Halley’s Annuities

Fortunately, mathematics had reached a point where it was ready to provide the answers. In 1662, John Graunt, a London haberdasher, conducted a study of mortality rolls in the city.

In his analysis, he found predictable patterns of longevity and death rates in groups of people of the same age. This gave him the means to calculate the probabilities of survival. His work formed the nucleus of the first “life table.”

Thirty years later, in 1693, Edmond Halley took a break from calculating the orbits of comets and descending to the bottom of the Thames in a diving bell to publish an article on life annuities.

Using accurate demographic data from Breslau, a city in Silesia, Halley produced a life table of the population, organized by age and survival. From this, he was able to calculate the premium amount that any man or woman, at each year of age, should pay in order to purchase a life annuity. From this time on, actuarial data multiplied.

The Father of the Computer and His Descendants

Over the next few centuries, to accompany the data, actuarial science grew both in popularity and in the complexity of its calculations. It’s no surprise that Charles Babbage, father of the computer, found time to dabble in it.

During the 1820s, he created actuarial tables from Equitable Society mortality data and published a handy guide to the life insurance industry titled A Comparative View of the Various Institutions for the Assurance of Lives.

But it was the adoption of punch-card tabulating machines and, subsequently, early computer technology, that the insurance industry began the march towards data dominance.

During the late 1930s, Edmund Berkeley of the Prudential Insurance Company began to investigate the potential of shifting work to calculating machines, and, later, computers.

The Post-War Push

Regarded by his colleagues as equal parts nut and genius, Berkeley was a pioneer in computing and data processing. In 1947, he prodded Prudential to purchase one of the first UNIVAC computers from the Eckert-Mauchly Computer Corporation.

Computers were arriving at an interesting time – as World War II was coming to a close and the Baby Boomer generation was just beginning.

As Joanne Yates points out, in the years between 1948 and 1953:

The number of insurance policies in force rose over 24%
Total employment in the life insurance industry grew almost 14%

Large insurance firms moved fairly quickly. In The Digital Hand : Volume II: How Computers Changed the Work of American Financial, Telecommunications, Media, and Entertainment Industries, James Cortada notes that by the end of 1955, there were over 20 mainframe systems installed in the industry.

Data Consolidation

The next big shift came in the late 1960s and 1970s. More powerful machines and better software were coming into play. Online systems allowed workers to share information freely and conduct inquiries in real time. Investment in technology increased steadily.

By the 1980s, the insurance industry was on top of IT trends.

The Industry Goes Ballistic

The arrival of the Internet in the 1990s helped insurance data science.

Individuals were able to bypass intermediaries and shop for coverage on their own terms.
Company and consumer websites sprang up to satisfy demand.
Banks seized the opportunity to expand into the industry.

As a consequence, the amount of customer data being gathered and exchanged exploded.

At the same time, the costs of data processing and storage were dropping rapidly. In lieu of the mass modeling of the past, insurers were gaining the capabilities (and the technical tools) to calculate risk on an individual level. The era of big data was just around the corner — and, three decades later, the era of AI-driven insurance had followed close behind it.