Opportunities in Government Data Science
The Promise of Big Data
The growth of mobile devices and smart sensors, the move to cloud-based storage, the explosion of social media and Internet traffic – all these developments and more are creating new opportunities in big data analytics. Data science has suddenly become the “sexiest job of the 21st century.”
Whether they’re sexy or not, government data scientists are nonetheless needed to:
- Prevent waste, fraud and abuse
- Combat cyber-attacks and safeguard sensitive information
- Use business intelligence to make better financial decisions
- Improve defense systems and protect soldiers on the ground
To do this, they’ll have to sort through a complex mix of information management, storage and security systems. They’ll have to collect, process and analyze an enormous volume of data from a staggering variety of sources. And they’ll have to do it as soon as possible.
Case Western Reserve University
University of California, Berkeley
Georgia Institute of Technology
The University of Texas at Austin
Taking the Initiative
The government made this clear in March 2012 when it announced the $200 million Big Data Research and Development Initiative. This wake-up call detailed the need for each agency to have a big data strategy and improve their analytic tools and techniques.
- The Department of Defense began looking at autonomous systems that could learn from experience, maneuver and make decisions on their own.
- DARPA started the XDATA program, which aims to develop computational techniques and software tools for processing and visualizing imperfect and incomplete data in order to achieve greater battlefield awareness for many types of personnel, whether in planning or on missions.
- The Department of Energy established the Scalable Data Management, Analysis, and Visualization (SDAV) Institute, an effort to unite the expertise of six national laboratories and seven universities on the department’s supercomputers.
- The U.S. Geological Survey (USGS) created Big Data for Earth System Science, an initiative to support scientists researching issues including climate change, earthquake recurrence and ecological indicators.
Show Me the Money
Though $200 million may seem like small change in government circles, it can yield big results. A 2013 Meritalk survey found that federal IT experts believe Big Data could help the government free up nearly $500 billion per year.
Big vendors – those who can handle the government’s strict security demands – are throwing their weight behind the effort. In 2013 alone:
- IBM finalized a five-year, $30 million deal to provide cloud infrastructure for a new order management system with the U.S. General Services Administration (GSA)
- Microsoft announced that it had created the “Windows Azure U.S. Government Cloud.” This closely resembles their enterprise cloud solution, with additional security measures.
- Amazon Web Services won a $600 million contract to build a private cloud service inside the CIA’s data centers. (IBM challenged the deal.)
But you don’t have to be IBM to get a glimpse of government data. As we’ve seen, Obama’s administration has been particularly interested in encouraging data research. As part of the first U.S. National Action Plan (September 2011), the government made over 390,000 agency data sets available for public consumption.
- The National Science Foundation has underwritten “EarthCube,” a project intended to provide free access internationally to Earth science data.
- The Institute of Medicine (IOM) and the U.S. Department of Health and Human Services (HHS) have launched the Health Data Initiative (HDI). The goal is to make health data more accessible in order to raise awareness of health issues and solutions to improve healthcare quality and outcomes.
- The U.S. Census Bureau has released an API that gives the public greater access to census data and demographic, socioeconomic and housing statistics.
Healthy Data, Healthy Body
Health care agencies have it rough. Rapid breakthroughs in the delivery of care continue to swamp them with data. Recent health care reforms and ongoing changes to regulations have further increased the workload.
Still, you have to give them credit for trying:
- To examine the relationships between quality measures reported by providers and real-world patient outcomes, the Centers for Medicare and Medicaid services (CMS) and the Health Services Advisory Group began collaborating with GNS Healthcare (and its big data tools) in 2013.
- To modernize its network and improve data storage and handling, the National Institutes of Health (NIH) launched the Big Data to Knowledge Initiative (BD2K) and began developing a new computing environment on the NIH campus called InfrastructurePlus.
- To cut down on Medicare fraud, CMS started in 2011 to construct algorithms that would target high-risk providers.
Through a Prism Darkly
Yet by far the most controversial federal data initiative concerns the National Security Agency.
In May 2013, the Guardian and the Washington Post broke a story on a top-secret program called Prism. According to information contained in an NSA PowerPoint presentation supplied by whistleblower Edward Snowden, the NSA has direct access to data in the systems of Google, Apple, Microsoft, Skype and other giant Internet companies.
Examples of these data include:
- Video and voice chat
- File transfers
- Social networking details
Prism allows for extensive, in-depth monitoring of live communications and stored information, including data from overseas. Privacy advocates have described it as a big step toward a police state.
In response, the U.S. government denied the charge that Prism can be used on domestic targets without a warrant. Officials also noted that it receives independent oversight from all three branches of the federal government.
Revelations about other surveillance and big data security projects – Boundless Informant, Bullrun, the British black-ops surveillance program Tempora – continued throughout the summer and fall of 2013.
Data Risks and Regulations
The End of Privacy?
Which brings us back to President Ford’s 1974 Privacy Act. For although the law asserts that agencies must follow certain principles – “fair information practices” – when gathering and handling personal data, it also allows law enforcement agencies to excuse themselves from the Act.
What’s more, the 2001 U.S.A. Patriot Act significantly increased the government’s powers of surveillance and investigation. Sweeping amendments were made to the:
- Wiretap Statute
- Electronic Communications Privacy Act
- Computer Fraud and Abuse Act
- Foreign Intelligence Surveillance Act
- Family Education Rights and Privacy Act
- Pen Register and Trap and Trace Statute
- Money Laundering Act
- Immigration and Nationality Act
- Money Laundering Control Act
- Bank Secrecy Act
- Right to Financial Privacy Act
- Fair Credit Reporting Act
These amendments included changes to voice mail communications, secret searches, surveillance orders, search warrants and a host of other law enforcement tools.
Amidst the debate about the NSA, it’s ironic to note the federal government is also responsible for legislation such as the:
- Health Insurance Portability Act (HIPAA)
- Equal Credit Opportunity Act (ECOA)
- Telecommunications Act of 1996
Many of these laws are explicitly concerned with protecting the privacy of U.S. citizens and preventing businesses from misusing personal information. Industries including retail and manufacturing are already testing the limits on the uses of individual data for predictive and behavioral analytics.
As data gets bigger and boundaries grow blurrier, arguments in Congress may become much louder, indeed.
History of Data Science and Government
“I think the government is awakening to the idea that data science can provide models that have great utility for a variety of missions.” – Robert Hummel, from 3rd Annual Government Big Data forum
Governments have forever been poking their noses into the lives of their citizens:
- The Babylonians were busy doing headcounts and tutting over butter production as early as 3800 B.C.
- In A.D. 2, China’s Han Dynasty proudly recorded a population of 59.6 million.
- After overrunning the Anglo-Saxons, William the Conqueror commissioned a comprehensive look at his new territory in the Domesday Book (A.D. 1086).
But until the 19th century, collecting and recording government data was manual and extremely labor-intensive. It would take a minor revolution to shake things up.
The Father of Automatic Computation
In the late 1880s, Herman Hollerith submitted his Ph.D. at New York’s Columbia University, An Electric Tabulating System..
Hollerith’s invention – an electromechanical tabulating machine that employed electrical circuits to count and sort punch cards – was used to complete the 1890 U.S. census. A task that had been predicted to take more than ten years took less than three. The U.S. government saved $5 million.
Hollerith capitalized on his success by forming the Tabulating Machine Company, one of four companies that merged and eventually became International Business Machines Corporation (IBM).
During the Great Depression, IBM would be contracted by the U.S. government to keep employment records on 26 million working Americans and 3 million employers.
Victory or Death
When World War II arrived, the Western powers threw their might behind data intelligence projects:
- Late 1930s: A British officer named Major A.V. Kerrison developed one of the first fully automated anti-aircraft “predictors.” The Kerrison electromechanical analog computer was fast enough to be used in a low-altitude anti-aircraft role where high speed was needed to calculate ballistics in real time.
- 1942: Cryptography requires big data calculations. To help break Nazi spy codes, engineers at Britain’s famous Bletchley Park invented a series of mass data-processing machines. The first programmable electronic computer – Colossus – was able to read paper tape at 5,000 characters per second.
- 1943: Work on the top-secret Project PX began at the University of Pennsylvania. The Electronic Numerical Integrator And Computer (ENIAC) – a thirty-ton behemoth that covered 1,800 square feet – was originally created to compute artillery ballistic tables.
- 1945-1946: In 1945, mathematicians working at Los Alamos used ENIAC to run calculations for the hydrogen bomb.
War work had significant post-war applications. By the end of March 1950, ENIAC was able to generate the first models to forecast the weather.
The Russians Are Coming!
During the Cold War, the government desperately wanted to know what the Soviets were up to. And to do that, it needed data.
Formed in 1952, the National Security Agency (NSA) was a successor to the Armed Forces Security Agency. Its mission was to monitor, collect, decode and analyze foreign intelligence and counterintelligence data.
It quickly ran into a volume problem:
- In 1955, the U.S. had more than 2,000 round-the-clock listening positions sending 37 tons of intercept material to the NSA each month. 30 million words of intercept were sent by teletype.
- In 1961, the NSA had more than 12,000 cryptologists on staff. On a monthly basis, it was attempting to file thousands of reels of analog magnetic tape (17,000 reels in July). It had also begun to collect and process signals intelligence with computers.
Giant supercomputers like the IBM 7950 (Harvest), a customized version of IBM’s Stretch, were installed to handle the flood. At its peak, Harvest could scan over 7 million decrypts to pinpoint over 7,000 key words in less than four hours.
The Privacy Act of 1974
It wasn’t all James Bond. Social scientists – long-time lovers of large data sets – were equally interested in the possibilities of big data.
For example, SAS, one of the world’s large advanced analytics companies, began its life as a university research project for the U.S. Department of Agriculture. Its purpose was to analyze crop data and make recommendations to increase output.
But researchers were tired of rounding up information from scattered sources. From 1965-1966, the federal government began looking at the possibility of a national data center that could centrally store information collected by various statistical agencies.
The outcry from citizens was loud and long. There were shouts of Orwell’s 1984. Congress stepped in, holding a series of sessions to discuss the effect of computerized databases on individual’s privacy rights. Bills were presented. On New Year’s Eve, President Ford signed the Privacy Act of 1974 into law.
One Second vs. 30,000 Years
As the days of free love gave way to the wrath of punk, the government continued to gobble up data for research and defense. It was helped along in its work by the development of a new channel of communication.
The Internet, which had its roots in a Defense Advanced Research Projects Agency (DARPA) project called ARPANET, quickly became entrenched in universities and government research institutions.
In 1985, the U.S. National Science Foundation established the National Science Foundation Network (NSFNET), a hub that connected five supercomputing centers to the National Center for Atmospheric Research.
Then Tim Berners-Lee came up with the idea of setting information free in a World Wide Web. This democratic concept quickly took hold in the collective imagination. By the mid 1990s, there were hundreds, thousands, millions of new data streams crisscrossing the globe.
Which, of course, the government wanted to track:
“We are developing a supercomputer that will do more calculating in a second than a person with a hand-held calculator can do in 30,000 years,” President Clinton boasted in 1996.
The Day the Towers Fell
In the wake of 9/11, the Defense Department – which had already been experimenting with large-scale data mining and analysis – stepped up its efforts.
In 2002, DARPA started to develop the Total Information Awareness System. This project would use biometrics, language processing, predictive modeling and database technologies to analyze government data sets. It would examine everything from communications to medical and travel records to identify suspicious individuals. Though the project was shuttered in 2003, many aspects of it migrated to other agencies.
The government was also quickly realizing that counterterrorism agencies needed – as a preschool teacher might put it politely – to learn how to share. In 2004, the 9/11 Commission suggested a “network-based information sharing system that transcends traditional government boundaries.”
Data for All
Agencies were aided in these efforts by an escalating interest in the potential of big data applications – from e-commerce to search engines to science. The lines between commerce, research and government became increasingly blurred.
- 2004: The CIA’s not-for-profit venture arm, In-Q-Tel, produced a new company called Palantir. Building on technology developed at PayPal to detect fraudulent activity, Palantir provided the Pentagon and CIA with sophisticated terrorism analysis software.
- 2009: As part of his Open Government Initiative, President Obama launched data.gov, increasing public and corporate access to thousands of data sets generated by the federal government.
- 2012: U.S. Secretary of State Hillary Clinton announced Data2X, a public-private partnership to collect economic, political and social status statistics on females around the world.