Data Science in the Pharmaceutical Industry

Opportunities in Pharmaceutical Data Science

The Promise of Big Data
For every 5,000 compounds starting in the laboratory, five are tested in humans and one makes it to market.

Moreover, it takes approximately 10 years and an average cost of $2-3 billion to develop each new drug. That adds up to a vast amount of molecular and clinical data stored in proprietary networks, just ripe for analytics.

Sponsored Schools

Rutgers University


Rutgers Data Science Bootcamp

Gain skills needed to analyze data and deliver value to organizations. Complete projects using real data sets from the worlds of finance, healthcare, government, social welfare, and more.

Southern Methodist University


SMU Data Science Boot Camp

Develop concrete, in-demand data skills and learn how to help drive business decisions and solve challenges that companies are facing. No programming experience required.

Northwestern University


Northwestern Data Science and Visualization Boot Camp

Northwestern Data Science and Visualization Bootcamp teaches practical and technical skills in 24 intensive weeks. Students apply their knowledge to hands-on projects that translate directly into work in the field.

University of Southern California


USC Viterbi Data Analytics Boot Camp

Expand your skill set and grow as a data analyst. This program covers the specialized skills to be successful in the field of data in 24 weeks.


Increased Collaboration

There has been a push in recent years to increase collaboration – both internally and with the outside world. To gain a competitive edge, increase their expertise and enlarge their ever-growing databanks, pharmas are now working with:

  • External Partners: These might include Contract Research Organizations (CROs) or data management companies. For example, in 2013, GlaxoSmithKline announced a partnership with SAS to provide a globally accessible private cloud where the pharmaceutical industry can securely collaborate around anonymous clinical trial information.
  • Academic Collaborators: To get a first look at compounds being developed outside of the company, Eli Lilly created the Phenotypic Drug Discovery Initiative. External researchers submit their compounds for screening and Lilly uses its proprietary tools and data to identify whether any of them have the potential to become drugs.
  • Customers and Health Professionals: Thanks to the explosive growth of social media, pharmas can personally reach out to their customers and physicians. They’re also conducting sentiment analysis of online physician communities, electronic medical records, and consumer-generated media to flag potential safety issues. These data can then be used to shape strategy throughout the pipeline progression.
  • Insurance Companies: By creating proprietary data networks where payors and providers can share, analyze and respond to outcomes and claims data, pharmas are able to enlarge their databanks far beyond clinical trials.

Predictive Analytics

The power to forecast the future has applications for drug discovery and avoiding negative outcomes.

In terms of drug discovery, Pharmas spend a vast amount of money screening compounds to test in preclinical trials. To speed up the process, drug companies are using predictive models to search gargantuan virtual databases of molecular and clinical data. Analysts zoom in on likely drug candidates with the help of criteria based on chemical structure, diseases/targets and other characteristics.

For example, Numerate, which works with companies like Boehringer Ingelheim and Merck, designs its predictive models with specific drug targets and treatment goals in mind.

In relation to avoiding negative outcomes, predictive modeling can also be used to short-circuit potential disasters such as deaths from risk factors.

Predictive analytics can also be used to optimize clinical trials through the selection of optimal patients through genetic clustering, and to improve marketing efforts.

Crowdsourced Competitions

In recent years, pharmaceutical companies and institutions have sponsored crowdsourced contests to predict patient and clinical outcomes, sales patterns, molecule activity, and anything else involving big data.

Examples of these include:

More Effective Drug Trials

Data scientists can help to reduce the costs of clinical trials by enabling drug companies to implement:

  • Data-Based Patient Selection: Pharmas use multiple data sources – including social media and public health databases – and more targeted criteria (e.g., genetic information) to identify which populations would work best in trials.
  • Real-Time Monitoring: Companies now monitor real-time data from trials to identify safety or operational risks and nip problems in the bud.
  • Drug Safety Assurance: Data scientists can even tap into side-effect data to predict whether a compound will provoke an adverse reaction before it even reaches trial. Working the University of California-San Francisco, researchers at Novartis have built computer models to do just that.

Targeted Marketing and Sales

Once upon a time, pharmaceutical companies would send their reps on lengthy doctors’ visits and invest in expensive, broad-scale product promotion.

In a March 2013 survey from Accenture, Life in the New Normal: The Customer Engagement Revolution, respondents noted that around 25% of their pharmaceutical marketing was delivered over a digital platform and that 87% intend to increase their use of analytics to target spending and drive improved ROI. Six years later, in 2019, many pharmaceutical companies are now planning to spend more than half of their budgets on digital marketing.

Some of that money is likely to go into monitoring doctors’ therapeutic tastes, geographic trends, peak prescription rates – anything that has a direct relevance to the sales cycle. This data then feeds into:

  • Predictive Analytics: Drug companies are employing predictive methods to determine which consumers and physicians are most likely to utilize a drug and create more targeted on-the-ground marketing efforts.
  • Sophisticated Sales: Pharmas are providing drug reps with mobile devices and real-time analytics on their prospects. Reps can then tailor their agenda to suit the physician. Afterward, the sales team can analyze the results to determine whether the approach was effective.

Better Patient Follow-Ups

With the development of miniature biosensors, sophisticated at-home devices, smart pills and bottles, smartphones and health apps, monitoring a patient’s health has never been easier. Pharmaceutical companies are increasingly interested in how the real-time data from these tools can be used to support R&D, analyze efficacy and increase drug sales.

In addition to knowing how their drugs are being used, companies also typically want to hear how customers view their products. Opinions about new drugs are often generated through patient/physician and patient/patient experiences in a way that creates messy, unstructured data sets.

However, if properly organized and analyzed, this data can be a rich trove of information on:

  • Patterns in drug-drug interactions
  • What drives patients to stop taking medications
  • Which patients will not stick to their prescriptions

Data Risks and Regulations

The Challenges Ahead

By the time it reached the 21st century, the pharmaceutical industry had amassed quantities of structured and semi-structured data in separate, mutually inaccessible “silos”.

Unfortunately, it’s having trouble bringing those silos together. However analysts slice the numbers, the ultimate goal is to turn the data into information that can play a strategic business role. The prerequisite for that is making the data sources communicate with one another in a meaningful way.

What’s more, drug companies are coping with a flood of unstructured information – including social sentiment – that’s coming at them from outside sources. Integrating, manipulating, organizing, and interpreting this data to support some coherent course of action is causing more than a few headaches.

Working with the Fed

Data integration raises the issue of patient privacy. The more databases are shared among institutions, CROs, partners, software companies, etc., the more pharmas run the risk of exposing sensitive patient information to the eyes of those who shouldn’t be seeing it.

That means drug companies need to reduce their exposure to running afoul of federal laws and regulations. For example, the HIPAA and its younger brother the 2009 HITECH Act clearly state that covered entities must:

“Protect individuals’ health records and other identifiable health information by requiring appropriate safeguards to protect privacy, and setting limits and conditions on the uses and disclosures that may be made of such information without patient authorization.”

Data scientists should also be aware of the FDA’s Sentinel Initiative. This is a legally mandated electronic-surveillance system linking and analyzing health care data on millions of patients from multiple sources. Its purpose is to collect data on safety issues and enable regulators to take quick action.

History of Data Analysis & Pharma

In the late 19th century, Colonel Eli Lilly, founder of Eli Lilly and Company, had:

  • Become an independent drug manufacturer
  • Automated the creation of pills and capsules
  • Hired a permanent R&D staff
  • Instituted a raft of quality assurance measures

By the end of his life, he was a millionaire.

Lilly, Heinrich Emanuel Merck, Charles Pfizer, Friedrich Bayer, Edward Robinson Squibb – before they were brand names, they were pioneers of the pharmaceutical industry. They were also some of the first men to use data-driven methods to cut costs and improve products.

The Fog Of War

Fast-forward to the 20th century. Witness the rise of air travel, fortune cookies, Einstein’s Theory of Relativity, jazz – and the megalithic manufacturers we know today.

The shift from small independent companies to conglomerates got started in the 1940s. With so many lives on the line, World War II spurred intense collaboration between governments and pharmaceutical companies. In the 1940s, a mind-boggling collaborative effort between the government, Merck, Pfizer and Squibb (among others) resulted in the mass production of penicillin.

The Golden Age of Development

By the time the troops came back from the war, pharma was becoming big business. The second half of the 20th century saw the development of many pharmaceutical breakthroughs, including ibuprofen, the contraceptive pill, Valium, and the war on cancer, among others. Advances in genetics – including automated protein and DNA sequencing – and psychiatric treatments opened new markets.

This was also the age when data began to make its mark.

Take Electronic Data Capture (EDC). In the years leading up to the 1970s, pharmaceutical companies had been receiving clinical research data on paper forms. This often resulted in data entry errors and delays.

  • To circumvent the problem, the Institute for Biological Research and Development (IBRD) formed an alliance with Abbott Pharmaceuticals.
  • Each clinical investigator would have access to a computer and be able to enter clinical data directly into the IBRD mainframe.
  • After cleaning up the data, IBRD supplied reports directly to Abbott.

The 1970s also saw the introduction of Cambridge Structural Database (CSD) and the Protein Data Bank (PDB), as well as genetics-focused resources like the Staden Package for DNA Sequences.

The Golden Age Goes Platinum

With the arrival of Professor Norman Allinger’s the Journal of Computational Chemistry in 1980, the first decade of computer-assisted drug development had begun. As Sean Ekins discusses in his book, Computer Applications in Pharmaceutical Research and Development, scientists were now empowered to use computational chemistry programs on personal computers.

Software companies blossomed, anxious to provide drug companies with useful tools. Examples of their functions included:

  • Predictive Analytics: Based on statistical models, Dr. Kurt Enslein’s TOPKAT software could predict the toxicity of a molecule from its structural components.
  • 3D Molecular Modeling: Graphics software gave chemists the ability to view molecular structures in 3D and create virtual models.
  • Data Analysis: Statisticians used data management programs like SAS to analyze clinical data; computational chemists could now get, in minutes or hours, statistical analyses that previously took weeks or months.

Due to the massive influx of data, Lilly was the first pharmaceutical company to purchase a supercomputer (the Cray-2), and many of its competitors soon followed.


The volume of information only increased with the advent of the World Wide Web and the new millennium:

  • Collaboration between internal departments and external research institutions became the norm.
  • Pharmaceutical companies started to market directly to consumers.
  • Demand for alternative medicines and nutritional supplements skyrocketed.
  • Software programs grew increasingly complex and sophisticated.
  • Discoveries in genetics and the sequencing of the genome generated new drugs and sources of revenue.

Effective data mining became critical to drug development.

Last updated: June 2020