Data Science in the Pharmaceutical Industry

Opportunities in Pharmaceutical Data Science

The Promise of Big Data

The modern pharmaceutical industry is used to dealing with big numbers – big profits, big losses, big data sets. As Omudhome Ogbru points out in Why Drugs Cost So Much:

  • Only one out of every 10,000 discovered compounds becomes an approved drug for sale
  • Only three out of every 20 approved drugs bring in enough revenue to cover developmental costs
  • Only one out of every three approved drugs can create enough income to cover the development costs of prior failures

Moreover, it takes approximately 7-10 years and an average cost of $500 million to develop each new drug. That adds up to a vast amount of molecular and clinical data stored in proprietary networks, just ripe for analytics.

Nevertheless, it’s taken pharmas a little while to catch up to the possibilities of big data. Why? As Scott Evangelista, principal at Deloitte Consulting, notes in The Case for Advanced Analytics:

“They’ve been flush with cash… The more profitable companies are, the less they look for the pennies and the minor tweaks and twists that would boost efficiency and return on investment.”

Thanks to the impact of recent healthcare reforms, they’re beginning to look a lot harder.

Increased Collaboration

Though drug companies are notoriously guarded, there has been a push in recent years to increase collaboration – both internally and with the outside world. To gain a competitive edge, increase their expertise and enlarge their ever-growing databanks, pharmas are now working with:

  • External Partners: These might include Contract Research Organizations (CROs) or data management companies. For example, in 2013, GlaxoSmithKline announced a partnership with SAS to provide a globally accessible private cloud where the pharmaceutical industry can securely collaborate around anonymous clinical trial information.
  • Academic Collaborators: To get a first look at compounds being developed outside of the company, Eli Lilly created the Phenotypic Drug Discovery Initiative. External researchers submit their compounds for screening and Lilly uses its proprietary tools and data to identify whether any of them have the potential to become drugs.
  • Customers and Health Professionals: Thanks to the explosive growth of social media, pharmas can personally reach out to their customers and physicians. They’re also conducting sentiment analysis of online physician communities, electronic medical records, and consumer-generated media to flag potential safety issues. These data can then be used to shape strategy throughout the pipeline progression.
  • Insurance Companies: By creating proprietary data networks where payors and providers can share, analyze and respond to outcomes and claims data, pharmas are able to enlarge their databanks far beyond clinical trials.

Predictive Analytics

Over the past decade, predictive models have become increasingly intelligent, accepted and widespread. The power to forecast the future has applications for drug discovery and avoiding negative outcomes.

In terms of drug discovery, Pharmas spend a vast amount of money screening compounds to test in preclinical trials. To speed up the process, drug companies are using predictive models to search gargantuan virtual databases of molecular and clinical data. Analysts zoom in on likely drug candidates with the help of criteria based on chemical structure, diseases/targets and other characteristics.

For example, Numerate, which works with companies like Boehringer Ingelheim and Merck, designs its predictive models with specific drug target and treatment goals in mind.

In relation to avoiding negative outcomes, predictive modeling can also be used to short-circuit potential disasters such as deaths from risk factors.

Partnering with Brigham and Women’s Hospital in Boston, GNS Healthcare has developed (2011) predictive models and tools, which will initially be used to circumvent adverse drug reactions and readmissions for patients with congestive heart failure.

Crowdsourced Competitions

In recent years, pharmaceutical companies and institutions have sponsored crowdsourced contests to predict patient and clinical outcomes, sales pattern, molecule activity, and anything else involving big data.

Examples of these include:

  • Heritage Health Prize (2012): Teams were challenged to develop an algorithm that uses available patient data to predict and prevent unnecessary hospitalizations.
  • Predicting a Biological Response (2012): Teaming up with Kaggle, Boehringer Ingelheim sponsored a competition to assess the likelihood of mutagenicity (the tendency to cause DNA damage), a key side effect to avoid in new drug development.
  • ALS Biomarker Prize (2006): A $1 million Grand Challenge designed to find a biomarker to measure the progression of ALS – also known as Lou Gehrig’s disease – in patients.

More Effective Drug Trials

Clinical trials can often be long, overpopulated and expensive. Data scientists can help to reduce these costs by enabling drug companies to implement:

  • Data-Based Patient Selection: Pharmas use multiple data sources – including social media and public health databases – and more targeted criteria (e.g., genetic information) to identify which populations would work best in trials.
  • Real-Time Monitoring: Companies now monitor real-time data from trials to identify safety or operational risks and nip problems in the bud.
  • Drug Safety Assurance: Data scientists can even tap into side-effect data to predict whether a compound will provoke an adverse reaction before it even reaches trial. Working the University of California-San Francisco, researchers at Novartis have built computer models to do just that.

Targeted Marketing and Sales

Once upon a time, pharmaceutical companies would send their reps on lengthy doctors’ visits and invest in expensive, broad-scale product promotion.

Those times are gone. In a March 2013 survey from Accenture, Life in the New Normal: The Customer Engagement Revolution, respondents noted that around 25% of their pharmaceutical marketing is delivered over a digital platform; 87% intend to increase their use of analytics to target spending and drive improved ROI.

Some of that money is likely to go into monitoring doctors’ therapeutic tastes, geographic trends, peak prescription rates – anything that has a direct relevance to the sales cycle. This data then feeds into:

  • Predictive Analytics: Drug companies are employing predictive methods to determine which consumers and physicians are most likely to utilize a drug and create more targeted on-the-ground marketing efforts.
  • Sophisticated Sales: Pharmas are providing drug reps with mobile devices and real-time analytics on their prospects. Reps can then tailor their agenda to suit the physician. Afterward, the sales team can analyze the results to determine whether the approach was effective.

Better Patient Follow-Ups

With the development of miniature biosensors, sophisticated at-home devices, smart pills and bottles, smartphones and health apps, monitoring a patient’s health has never been easier. Pharmaceutical companies are increasingly interested in how the real-time data from these tools can be used to support R&D, analyze efficacy and increase drug sales.

In addition to knowing how their drugs are being used, companies also want to hear how customers view their products. Opinions about new drugs are often generated through patient/physician and patient/patient experiences in a way that creates messy, unstructured data sets.

However, if properly organized and analyzed, this data can be a rich trove of information on:

  • Patterns in drug-drug interactions
  • What drives patients to stop taking medications
  • Which patients will not stick to their prescriptions

Data Risks and Regulations

The Challenges Ahead

In a way, the pharmaceutical industry is a victim of its early success. By the time it reached the 21st century, it had amassed enormous quantities of structured and semi-structured data in separate, mutually inaccessible “silos”.

Unfortunately, it’s having trouble bringing those silos together. Writing for FierceBiotechIT, Lee Feigenbaum, Cambridge Semantics VP of Marketing, noted that the real issue centers around dealing with different types of datasets. However analysts slice the numbers, the ultimate goal is to turn the data into information that can play a strategic business role. The prerequisite for that is making the data sources communicate with one another in a meaningful way.

What’s more, drug companies are coping with a flood of unstructured information – including social sentiment – that’s coming at them from outside sources. Integrating, manipulating, organizing, and interpreting this data to support some coherent course of action is causing more than a few headaches.

Working with the Fed

Data integration raises the issue of patient privacy. The more databases are shared among institutions, CROs, partners, software companies, etc., the more pharmas run the risk of exposing sensitive patient information to the eyes of those who shouldn’t be seeing it.

That means drug companies need to reduce their exposure to running afoul of federal laws and regulations. For example, the HIPAA and its younger brother the 2009 HITECH Act clearly state that covered entities must:

“Protect individuals’ health records and other identifiable health information by requiring appropriate safeguards to protect privacy, and setting limits and conditions on the uses and disclosures that may be made of such information without patient authorization.”

This warning is reiterated in the 2013 McKinsey report, How Big Data Can Revolutionize Pharmaceutical R&D:

“Pharmaceutical enterprises must understand and mitigate the legal, regulatory, and intellectual-property risks associated with a more collaborative approach.”

Data scientists should also be aware of the FDA’s Sentinel Initiative. This is a legally mandated electronic-surveillance system linking and analyzing health care data on millions of patients from multiple sources. Its purpose its to collect data on safety issues and enable regulators to take quick action.

History of Data Analysis & Pharma

“We try never to forget that medicine is for the people. It is not for the profits. The profits follow, and if we have remembered that, they have never failed to appear. The better we have remembered it, the larger they have been.” – George Wilhelm Merck

Summer, 1854 or thereabouts. A 16-year-old boy from Indiana hops on a train and travels 60 miles north to visit his Uncle Caleb and Aunt Hennie. On the dusty streets of Lafayette, he passes Good Samaritan Drugstore. He stops. He stares. He soaks in the smells. In that moment, he decides to become a pharmacist.

By the late 19th century, Colonel Eli Lilly had:

  • Become an independent drug manufacturer
  • Automated the creation of pills and capsules
  • Hired a permanent R&D staff
  • Instituted a raft of quality assurance measures

By the end of his life, he was a millionaire.

Lilly, Heinrich Emanuel Merck, Charles Pfizer, Friedrich Bayer, Edward Robinson Squibb – before they were brand names, they were pioneers of the pharmaceutical industry. They were also some of the first men to use data-driven methods to cut costs and improve products.

The Fog Of War

Fast-forward to the 20th century. Witness the rise of air travel, fortune cookies, Einstein’s Theory of Relativity, jazz – and the megalithic manufacturers we know today.

The shift from small independent companies to conglomerates got started in the 1940s. With so many lives on the line, World War II spurred intense collaboration between governments and pharmaceutical companies. In the 1940s, a mind-boggling collaborative effort between the government, Merck, Pfizer and Squibb (among others) resulted in the mass production of penicillin.

The Golden Age of Development

By the time the boys (and girls) came back from the war, pharma was becoming big business. The second half of the 20th century saw the development of ibuprofen, paracetamol, the contraceptive pill, Thalidomide, Valium, ACE inhibitors and the war on cancer. Advances in genetics – including automated protein and DNA sequencing – and psychiatric treatments opened new markets.

This was also the age when data began to make its mark.

Take Electronic Data Capture (EDC). In the years leading up to the 1970s, pharmaceutical companies had been receiving clinical research data on paper forms. This often resulted in data entry errors and delays.

  • To circumvent the problem, the Institute for Biological Research and Development (IBRD) formed an alliance with Abbot Pharmaceuticals.
  • Each clinical investigator would have access to a computer and be able to enter clinical data directly into the IBRD mainframe.
  • After cleaning up the data, IBRD supplied reports directly to Abbott.

The 1970s also saw the introduction of Cambridge Structural Database (CSD) and the Protein Data Bank (PDB), as well as genetics-focused resources like the Staden Package for DNA Sequences.

The Golden Age Goes Platinum

With the arrival of Professor Norman Allinger’s baby, the Journal of Computational Chemistry in 1980, the first decade of computer-assisted drug development had begun. As Sean Ekins discusses in Computer Applications in Pharmaceutical Research and Development, scientists were now empowered to use computational chemistry programs on personal computers.

Software companies blossomed, anxious to provide drug companies with useful tools. Examples of their functions included:

  • Predictive Analytics: Based on statistical models, Dr. Kurt Enslein’s TOPKAT software could predict the toxicity of a molecule from its structural components.
  • 3D Molecular Modeling: Graphics software gave chemists the ability to view molecular structures in 3D and create virtual models.
  • Data Analysis: Statisticians used data management programs like SAS to analyze clinical data; computational chemists could now get, in minutes or hours, statistical analyses that previously took weeks or months.

Due to the massive influx of data, Lilly was the first pharmaceutical company to purchase a supercomputer (the Cray-2), and many of its competitors soon followed.


The volume of information only increased with the advent of the World Wide Web and the new millennium:

  • Collaboration between internal departments and external research institutions became the norm.
  • Pharmaceutical companies started to market directly to consumers.
  • Demand for alternative medicines and nutritional supplements skyrocketed.
  • Software programs grew increasingly complex and sophisticated.
  • Discoveries in genetics and the sequencing of the genome generated new drugs and sources of revenue.

Effective data mining became critical to drug development.