Data engineers build massive reservoirs for big data. They develop, construct, test and maintain architectures such as databases and large-scale data processing systems. Once continuous pipelines are installed to – and from – these huge “pools” of filtered information, data scientists can pull relevant data sets for their analyses.
Data Engineer Responsibilities
In his/her role as a hardcore builder, a data engineer may be required to:
- Design, construct, install, test and maintain highly scalable data management systems
- Ensure systems meet business requirements and industry practices
- Build high-performance algorithms, prototypes, predictive models and proof of concepts
- Research opportunities for data acquisition and new uses for existing data
- Develop data set processes for data modeling, mining and production
- Integrate new data management technologies and software engineering tools into existing structures
- Create custom software components (e.g. specialized UDFs) and analytics applications
- Employ a variety of languages and tools (e.g. scripting languages) to marry systems together
- Install and update disaster recovery procedures
- Recommend ways to improve data reliability, efficiency and quality
- Collaborate with data architects, modelers and IT team members on project goals
Data engineers may work closely with data architects (to determine what data management systems are appropriate) and data scientists (to determine which data are needed for analysis). They often wrestle with problems associated with database integration and messy, unstructured data sets. Their ultimate aim is to provide clean, usable data to whomever may require it.
An Interview with a Real Data Engineer
David Bianco has made a career of building geospatial data pipelines. He began his career first with Esri’s ArcGIS Online, then with Booz Allen Hamilton, and now with Urthecast. In 2014, he completed a two-month fellowship with INSIGHT Data Engineering. He is based in San Francisco, CA, and you can find him online at twitter.com/talldave.
In addition, as with most companies, we manage data beyond our core product. For example, we analyze log files and customer-use patterns. All companies benefit from this knowledge, which turns into useful business metrics.
With APIs, I really feel web languages are sufficiently robust, so it’s not as important which one you choose as long as it is embraced as a common language amongst your team. Our environment relies heavily on both PHP and Python. Almost all of my code relating to data ingestion from other providers is written in Python. It is uncomplicated and robust and can talk to any datastore whether it’s RDBMS or NoSQL. Lastly, for data analysis, I use Big Data technologies, such as Spark, to forecast and recommend improvements based on how the data is consumed.
As the tools and processes become more complex, and because raw data is always dirty, data engineers will always have a place in the workforce. I do think the tools we use will become more refined and more powerful, but I don’t see raw data ever arriving clean. Also, I think we will have new, as-yet-unknown to us, data models that will keep things fresh and keep data engineers always learning.
The communication between a data engineer and a data scientist is vital. Typically, data is not just thrown in a database awaiting consumption. It needs to be optimized to the use case of the data scientist. Having a clear understanding of how this handshake occurs is important in reducing the human error component of the data pipeline.
Personally, I’m a fan of providing data access via an API. This allows scientists to focus on what they can do with the data rather than how to access the data. Not everyone understands SQL, and not everyone writes good SQL. PDFs and spreadsheets have their place in the board room. With a well-written RESTful API, the data engineer is able to provide the data scientist with either exactly what they want or the means to access raw data and then build their final product.
Lastly, I’ll just say that it’s important for data scientists to be appreciative of an engineer’s work. Last year, the NY Times wrote that 50 to 80 percent of a data scientist’s job is cleaning data. That is not the case once you have a team of data engineers on board, allowing the data scientist to focus on analytics.
- Mechanical tendencies. A curiosity to know how things work and how to make them better.
- Patience. Nothing will work the first time; there are just too many moving parts.
- Humility. Data engineers are the wizards of Oz. Ultimately, you are in a support role; you help build the underlying infrastructure. Be proud of your work and know that others may get more of the limelight because of your efforts.
- Focus. Designing data is one of my favorite aspects of my work, but it tends to be a smaller percentage of my day. A data engineer should want to be in the weeds, understanding the intricacies of how and why a data pipeline works as it does.
Also, be extremely comfortable at the command line. Text files still reign supreme, whether it’s your own code, csv/json/xml data, or log files.
Lastly, find a community and get involved! Check meetup.com for something in your area or local universities, which may have study groups that you can join. Keep a lookout for hackathons – they always need data specialists.
Data Engineer Salaries
As you might expect, Silicon Valley is the center of well-paying IT jobs. According to PayScale, the median pay in 2015 for a Data Scientist/Engineer in San Francisco was $117,388 (25% above the national average). Seattle was in second place, with a median pay of $105,340 (12% above).
Average Salary (2015): $95,936 per year
Median Salary (2015): $91,782 per year
Total Pay Range: $58,773 – $143,419
Senior Data Engineer
Average Salary (2015): $124,338 per year
Big Data Engineer
Robert Half Technology 2015 Salary Guide
Average Salary (2014): $110,250 – $152,750
Average Salary (2015): $119,250 – $168,250
Data Engineer Qualifications
What Kind of Degree Will I Need?
You will need a bachelor’s degree in computer science, software/computer engineering, applied math, physics, statistics or a related field and a lot of real-world skills to qualify for most entry-level positions.
Is a master’s required? It depends on the job. Some employers are more than willing to accept relevant work experience and proof of technical expertise in lieu of a higher degree.
What Kind of Skills Will I Need?
- Statistical analysis and modeling
- Database architectures
- Hadoop-based technologies (e.g. MapReduce, Hive and Pig)
- SQL-based technologies (e.g. PostgreSQL and MySQL)
- NoSQL technologies (e.g. Cassandra and MongoDB)
- Data modeling tools (e.g. ERWin, Enterprise Architect and Visio)
- Python, C/C++ Java, Perl
- MatLab, SAS, R
- Data warehousing solutions
- Predictive modeling, NLP and text analysis
- Machine learning
- Data mining
- UNIX, Linux, Solaris and MS Windows
Since new data management technologies are appearing every day, this list is subject to change.
- Creative Problem-Solving: Approaching data organization challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
- Effective Collaboration: Carefully listening to management, data scientists and data architects to establish their needs.
- Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve data management problems.
- Industry Knowledge: Understanding the way your chosen industry functions and how data can be collected, analyzed and utilized; maintaining flexibility in the face of big data developments.
What About Certifications?
If you’re interested in buffing up specific skills, you’ll find a lot of vendor-specific certifications (e.g. Oracle, Microsoft, IBM, Cloudera etc.). To determine which ones are worth your investment, ask your mentors for advice, examine recent job descriptions and browse articles like Tom’s IT Pro “Best Of” certification lists for ideas.
Developed by the Institute for Certified Computing Professionals (ICCP), the CDMP is a solid, all-round credential for general database professionals. Many employers will recognize the acronym on your résumé.
The CDMP is offered at two levels – Practitioner or Mastery – and awarded to candidates who provide evidence of education, experience and passing results on the CDMP’s professional knowledge exam. Proof of continuing education and professional activity is required to re-certify.
Jobs Similar to Data Engineer
As we’ve mentioned, a data engineer in a large company may work closely with a Data Architect. Like their counterparts in the physical world, an architect is often immersed in the planning stages of infrastructure projects and a data engineer is deeply involved in the actual construction process.
It’s important to note that data engineers are typically not analysts. Their job is to make data available to others; they don’t use it to discover patterns and trends that affect business decisions. If you’re interested in that kind of role, you may wish to consider being a:
Having said that, some smaller companies may combine the roles of scientist and engineer into one.
Data Engineer Job Outlook
It looks like data engineers are finally getting their due. As Alex Woodie of Datanami points out in his 2014 article, “figures from job posting websites show much higher demand for data engineers than for data scientists.” Faced with a tsunami of big data, companies are eager for experts who can “ensure that data pipelines are scalable, repeatable, and secure, and can serve multiple constituents in the enterprise.”
Job opportunities will be best for candidates who aren’t afraid of change! The evolution of Hadoop – which is increasingly being used as an enterprise data hub – rapid advances in processing power for predictive analytics and a general shift towards the Cloud are making a data engineer’s life more complicated than ever before.
Even well-established methods of data management are mutating. For instance, instead of carefully designing a data model before entering data into a system, some engineers are now dumping their information into big data lakes (e.g. a Hadoop repository), where organizations can access and analyze data regardless of schema.
Although these changes can sometimes make for a frustrating work day, it’s an incredibly exciting time to be a builder. If you love playing with new tools and can think outside the relational database box, you’ll be in a prime position to help companies adapt to the 21st century.