Data engineers build massive reservoirs for big data. They develop, construct, test and maintain architectures such as databases and large-scale data processing systems. Once continuous pipelines are installed to – and from – these huge “pools” of filtered information, data scientists can pull relevant data sets for their analyses.
Sponsored Online Master's Programs
Learn MoreUniversity of California, Berkeley
Learn MoreUniversity of Denver
Learn MoreSyracuse University
* GRE waivers are available.
Learn MoreSouthern Methodist University
* GRE waivers available for experienced applicants
Learn MoreUniversity of Dayton
Learn MoreAmerican University
Learn MorePepperdine University
Data Engineer Responsibilities
In his/her role as a hardcore builder, a data engineer may be required to:
- Design, construct, install, test and maintain highly scalable data management systems
- Ensure systems meet business requirements and industry practices
- Build high-performance algorithms, prototypes, predictive models and proof of concepts
- Research opportunities for data acquisition and new uses for existing data
- Develop data set processes for data modeling, mining and production
- Integrate new data management technologies and software engineering tools into existing structures
- Create custom software components (e.g. specialized UDFs) and analytics applications
- Employ a variety of languages and tools (e.g. scripting languages) to marry systems together
- Install and update disaster recovery procedures
- Recommend ways to improve data reliability, efficiency and quality
- Collaborate with data architects, modelers and IT team members on project goals
Data engineers may work closely with data architects (to determine what data management systems are appropriate) and data scientists (to determine which data are needed for analysis). They often wrestle with problems associated with database integration and messy, unstructured data sets. Their ultimate aim is to provide clean, usable data to whomever may require it.
How to Become a Data Engineer
1. Pursue higher education degrees in computer science, engineering, applied mathematics, physics or a related field.
You will need a bachelor’s degree in computer science, software/computer engineering, applied math, physics, statistics or a related field and a lot of real-world skills to qualify for most entry-level positions. You may also consider a master’s degree in computer engineering or computer science to fine tune your skills and expand your knowledge.
Is a master’s required? It depends on the job. Some employers are willing to accept relevant work experience and proof of technical expertise in lieu of a higher degree.
2. Fine tune your analysis, computer engineering and big data skills.
Technical Skills for Data Engineers
- Statistical analysis and modeling
- Database architectures
- Hadoop-based technologies (e.g. MapReduce, Hive and Pig)
- SQL-based technologies (e.g. PostgreSQL and MySQL)
- NoSQL technologies (e.g. Cassandra and MongoDB)
- Data modeling tools (e.g. ERWin, Enterprise Architect and Visio)
- Python, C/C++ Java, Perl
- MatLab, SAS, R
- Data warehousing solutions
- Predictive modeling, NLP and text analysis
- Machine learning
- Data mining
- UNIX, Linux, Solaris and MS Windows
Since new data management technologies are appearing every day, this list is subject to change.
Business Skills for Data Engineers
- Creative Problem-Solving: Approaching data organization challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
- Effective Collaboration: Carefully listening to management, data scientists and data architects to establish their needs.
- Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve data management problems.
- Industry Knowledge: Understanding the way your chosen industry functions and how data can be collected, analyzed and utilized; maintaining flexibility in the face of big data developments.
3. Consider pursing additional professional engineering or big data certifications.
If you’re interested in buffing up specific skills, you’ll find a lot of vendor-specific certifications (e.g. Oracle, Microsoft, IBM, Cloudera etc.). To determine which ones are worth your investment, ask your mentors for advice, and examine recent job descriptions for ideas.
Developed by the Data Management Association International (DAMA), the CDMP is a solid, all-round credential for general database professionals. Many employers will recognize the acronym on your résumé.
The CDMP is offered at four levels – associate, practitioner, master and fellow – and awarded to candidates who provide evidence of education, experience and passing results on the CDMP’s professional knowledge exam. Proof of continuing education and professional activity is required to re-certify.
Through the completion and passing score of a hands-on exam, Cloudera offers certification for candidates who can exhibit their skills in data ingestions, transformation, staging and storing of data, analysis of data from Parquet, Avro, JSON, and others, and the development of linear and branching workflow among other abilities.
After a two hour exam, successful Google Cloud data engineers can achieve this credital and recognition. Google awards this certification to data engineers who can demonstrate their abilities in building and maintaining data structures, analyzing data, design processing systems, and incorporation of security, compliance and reliability in processes and structures.
An Interview with a Real Data Engineer
David Bianco has made a career of building geospatial data pipelines. He began his career first with Esri’s ArcGIS Online, then with Booz Allen Hamilton, and now with Urthecast. In 2014, he completed a two-month fellowship with INSIGHT Data Engineering. He is based in San Francisco, CA, and you can find him online at twitter.com/talldave.
A: One of the primary facets of Urthecast’s business is to provide our data for others to consume in a variety of ways, whether their focus involves science, government, education, or business. In essence, we provide data for others to analyze and build upon. Thus, the impact of our data engineers is extremely important. Our success lies in both the quality and the quantity of what we can offer. Because of this, we have data engineers working in a variety of capacities along the entire data pipeline – from working with the raw data, perfecting it with geospatial raster processes (georeferencing, orthorectification, and mosaicing), all the way to building APIs for developer access.In addition, as with most companies, we manage data beyond our core product. For example, we analyze log files and customer-use patterns. All companies benefit from this knowledge, which turns into useful business metrics.
A: The skills and tools that are utilized on the job are highly dependent on which part of the data pipeline you focus on. For myself, I’m at the tail end of the pipeline building APIs for data consumption, integrating external datasets, and analyzing how our data is used to further improve our end product.With APIs, I really feel web languages are sufficiently robust, so it’s not as important which one you choose as long as it is embraced as a common language amongst your team. Our environment relies heavily on both PHP and Python. Almost all of my code relating to data ingestion from other providers is written in Python. It is uncomplicated and robust and can talk to any datastore whether it’s RDBMS or NoSQL. Lastly, for data analysis, I use Big Data technologies, such as Spark, to forecast and recommend improvements based on how the data is consumed.
A: We are in a data revolution, and these are exciting times. Data used to be viewed as a simple necessity and lower on the totem pole. Now it is more widely recognized as the source of truth. As we move into more complex systems of data management, the role of the data engineer becomes extremely important as a bridge between the DBA and the data consumer. Beyond the ubiquitous spreadsheet, graduating from RDBMS (which will always have a place in the data stack), we now work with NoSQL and Big Data technologies.As the tools and processes become more complex, and because raw data is always dirty, data engineers will always have a place in the workforce. I do think the tools we use will become more refined and more powerful, but I don’t see raw data ever arriving clean. Also, I think we will have new, as-yet-unknown to us, data models that will keep things fresh and keep data engineers always learning.
A: Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers giving meaning to an otherwise static entity. Simply put, data engineers clean, prepare and optimize data for consumption. Once the data becomes useful, data scientists can perform a variety of analyses and visualization techniques to truly understand the data, and eventually, tell a story from the data. All data has a story to tell.The communication between a data engineer and a data scientist is vital. Typically, data is not just thrown in a database awaiting consumption. It needs to be optimized to the use case of the data scientist. Having a clear understanding of how this handshake occurs is important in reducing the human error component of the data pipeline.
Personally, I’m a fan of providing data access via an API. This allows scientists to focus on what they can do with the data rather than how to access the data. Not everyone understands SQL, and not everyone writes good SQL. PDFs and spreadsheets have their place in the board room. With a well-written RESTful API, the data engineer is able to provide the data scientist with either exactly what they want or the means to access raw data and then build their final product.
Lastly, I’ll just say that it’s important for data scientists to be appreciative of an engineer’s work. Last year, the NY Times wrote that 50 to 80 percent of a data scientist’s job is cleaning data. That is not the case once you have a team of data engineers on board, allowing the data scientist to focus on analytics.
A: I feel a data engineer should have the following traits:
- Mechanical tendencies. A curiosity to know how things work and how to make them better.
- Patience. Nothing will work the first time; there are just too many moving parts.
- Humility. Data engineers are the wizards of Oz. Ultimately, you are in a support role; you help build the underlying infrastructure. Be proud of your work and know that others may get more of the limelight because of your efforts.
- Focus. Designing data is one of my favorite aspects of my work, but it tends to be a smaller percentage of my day. A data engineer should want to be in the weeds, understanding the intricacies of how and why a data pipeline works as it does.
A: Of course it’s important to be fluent in the languages and tools that will help you get hired. But more important, I believe, is to understand what the tools are helping you to accomplish. Languages come and go, so it’s better to gain a full understanding of the concepts behind building a robust pipeline.Also, be extremely comfortable at the command line. Text files still reign supreme, whether it’s your own code, csv/json/xml data, or log files.
Lastly, find a community and get involved! Check meetup.com for something in your area or local universities, which may have study groups that you can join. Keep a lookout for hackathons – they always need data specialists.
Data Engineer Salary for 2018: How much does a data engineer make?
As you might expect, Silicon Valley can be considered the center of well-paying IT jobs. According to PayScale, the median pay, as of January 2018 for a Data Scientist/Engineer in San Francisco was $121,603 (over 30% above the national average). However, New York has pushed San Francisco out of first place with an average salary of $123,482. .
Average Salary: $137,776 per year
Median Salary: $91,660 per year
Total Pay Range: $59,511 – $132,345
Senior Data Engineer
Average Salary: $172,604 per year
Big Data Engineer
Average Salary: $137,776
Note: Salary information from Glassdoor and PayScale was retrieved as of January 2018.
Jobs Similar to Data Engineer
As we’ve mentioned, a data engineer in a large company may work closely with a Data Architect. Like their counterparts in the physical world, an architect is often immersed in the planning stages of infrastructure projects and a data engineer is deeply involved in the actual construction process.
It’s important to note that data engineers are typically not analysts. Their job is to make data available to others; they don’t use it to discover patterns and trends that affect business decisions. If you’re interested in that kind of role, you may wish to consider being a:
Having said that, some smaller companies may combine the roles of scientist and engineer into one.
Data Engineer Jobs
The rise in need for data management and transition to Cloud storage results in a higher need for data engineers to perform these tasks.
Job opportunities will be best for candidates who aren’t afraid of change! The evolution of Hadoop, which is increasingly being used as an enterprise data hub, advances in processing power for predictive analytics and a general shift towards the Cloud could make a data engineer’s life more complicated.
Even well-established methods of data management are mutating. For instance, instead of carefully designing a data model before entering data into a system, some engineers are now dumping their information into big data lakes (e.g. a Hadoop repository), where organizations can access and analyze data regardless of schema.
Although these changes can sometimes make for a frustrating work day, it’s an incredibly exciting time to be a builder. If you love playing with new tools and can think outside the relational database box, you’ll be in a prime position to help companies adapt to the 21st century.
Professional Organizations for Data Engineers
- International Data Management Association (DAMA)
- Institute for Certified Computing Professionals (ICCP)
- The Data Warehousing Institute (TDWI)
Featured Online Data Science Programs
Sponsored Data Science Program
Sponsored Data Science Program
Sponsored Data Science Program