Data engineers build massive reservoirs for big data. They develop, construct, test and maintain architectures such as databases and large-scale data processing systems. Once continuous pipelines are installed to – and from – these huge “pools” of filtered information, data scientists can pull relevant data sets for their analyses.
Data Engineer Responsibilities
In his/her role as a hardcore builder, a data engineer may be required to:
- Design, construct, install, test and maintain highly scalable data management systems
- Ensure systems meet business requirements and industry practices
- Build high-performance algorithms, prototypes, predictive models and proof of concepts
- Research opportunities for data acquisition and new uses for existing data
- Develop data set processes for data modeling, mining and production
- Integrate new data management technologies and software engineering tools into existing structures
- Create custom software components (e.g. specialized UDFs) and analytics applications
- Employ a variety of languages and tools (e.g. scripting languages) to marry systems together
- Install and update disaster recovery procedures
- Recommend ways to improve data reliability, efficiency and quality
- Collaborate with data architects, modelers and IT team members on project goals
Data engineers may work closely with data architects (to determine what data management systems are appropriate) and data scientists (to determine which data are needed for analysis). They often wrestle with problems associated with database integration and messy, unstructured data sets. Their ultimate aim is to provide clean, usable data to whomever may require it.
An Interview with a Real Data Engineer
David Bianco has made a career of building geospatial data pipelines. He began his career first with Esri’s ArcGIS Online, then with Booz Allen Hamilton, and now with Urthecast. In 2014, he completed a two-month fellowship with INSIGHT Data Engineering. He is based in San Francisco, CA, and you can find him online at twitter.com/talldave.
In addition, as with most companies, we manage data beyond our core product. For example, we analyze log files and customer-use patterns. All companies benefit from this knowledge, which turns into useful business metrics.
With APIs, I really feel web languages are sufficiently robust, so it’s not as important which one you choose as long as it is embraced as a common language amongst your team. Our environment relies heavily on both PHP and Python. Almost all of my code relating to data ingestion from other providers is written in Python. It is uncomplicated and robust and can talk to any datastore whether it’s RDBMS or NoSQL. Lastly, for data analysis, I use Big Data technologies, such as Spark, to forecast and recommend improvements based on how the data is consumed.
As the tools and processes become more complex, and because raw data is always dirty, data engineers will always have a place in the workforce. I do think the tools we use will become more refined and more powerful, but I don’t see raw data ever arriving clean. Also, I think we will have new, as-yet-unknown to us, data models that will keep things fresh and keep data engineers always learning.
The communication between a data engineer and a data scientist is vital. Typically, data is not just thrown in a database awaiting consumption. It needs to be optimized to the use case of the data scientist. Having a clear understanding of how this handshake occurs is important in reducing the human error component of the data pipeline.
Personally, I’m a fan of providing data access via an API. This allows scientists to focus on what they can do with the data rather than how to access the data. Not everyone understands SQL, and not everyone writes good SQL. PDFs and spreadsheets have their place in the board room. With a well-written RESTful API, the data engineer is able to provide the data scientist with either exactly what they want or the means to access raw data and then build their final product.
Lastly, I’ll just say that it’s important for data scientists to be appreciative of an engineer’s work. Last year, the NY Times wrote that 50 to 80 percent of a data scientist’s job is cleaning data. That is not the case once you have a team of data engineers on board, allowing the data scientist to focus on analytics.
- Mechanical tendencies. A curiosity to know how things work and how to make them better.
- Patience. Nothing will work the first time; there are just too many moving parts.
- Humility. Data engineers are the wizards of Oz. Ultimately, you are in a support role; you help build the underlying infrastructure. Be proud of your work and know that others may get more of the limelight because of your efforts.
- Focus. Designing data is one of my favorite aspects of my work, but it tends to be a smaller percentage of my day. A data engineer should want to be in the weeds, understanding the intricacies of how and why a data pipeline works as it does.
Also, be extremely comfortable at the command line. Text files still reign supreme, whether it’s your own code, csv/json/xml data, or log files.
Lastly, find a community and get involved! Check meetup.com for something in your area or local universities, which may have study groups that you can join. Keep a lookout for hackathons – they always need data specialists.
2018 Data Engineer Average Salaries
As you might expect, Silicon Valley can be considered the center of well-paying IT jobs. According to PayScale, the median pay, as of January 2018 for a Data Scientist/Engineer in San Francisco was $121,603 (over 30% above the national average). However, New York has pushed San Francisco out of first place with an average salary of $123,482. .
Average Salary: $137,776 per year
Median Salary: $91,660 per year
Total Pay Range: $59,511 – $132,345
Senior Data Engineer
Average Salary: $172,604 per year
Big Data Engineer
Average Salary: $137,776
Note: Salary information from Glassdoor and PayScale was retrieved as of January 2018.
Data Engineer Qualifications
What Kind of Degree Will I Need?
You will need a bachelor’s degree in computer science, software/computer engineering, applied math, physics, statistics or a related field and a lot of real-world skills to qualify for most entry-level positions.
Is a master’s required? It depends on the job. Some employers are more than willing to accept relevant work experience and proof of technical expertise in lieu of a higher degree.
What Kind of Skills Will I Need?
- Statistical analysis and modeling
- Database architectures
- Hadoop-based technologies (e.g. MapReduce, Hive and Pig)
- SQL-based technologies (e.g. PostgreSQL and MySQL)
- NoSQL technologies (e.g. Cassandra and MongoDB)
- Data modeling tools (e.g. ERWin, Enterprise Architect and Visio)
- Python, C/C++ Java, Perl
- MatLab, SAS, R
- Data warehousing solutions
- Predictive modeling, NLP and text analysis
- Machine learning
- Data mining
- UNIX, Linux, Solaris and MS Windows
Since new data management technologies are appearing every day, this list is subject to change.
- Creative Problem-Solving: Approaching data organization challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
- Effective Collaboration: Carefully listening to management, data scientists and data architects to establish their needs.
- Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve data management problems.
- Industry Knowledge: Understanding the way your chosen industry functions and how data can be collected, analyzed and utilized; maintaining flexibility in the face of big data developments.
What About Certifications?
If you’re interested in buffing up specific skills, you’ll find a lot of vendor-specific certifications (e.g. Oracle, Microsoft, IBM, Cloudera etc.). To determine which ones are worth your investment, ask your mentors for advice, examine recent job descriptions and browse articles like Tom’s IT Pro “Best Of” certification lists for ideas.
Developed by the Data Management Association International (DAMA), the CDMP is a solid, all-round credential for general database professionals. Many employers will recognize the acronym on your résumé.
The CDMP is offered at four levels – associate, practitioner, master and fellow – and awarded to candidates who provide evidence of education, experience and passing results on the CDMP’s professional knowledge exam. Proof of continuing education and professional activity is required to re-certify.
Through the completion and passing score of a hands-on exam, Cloudera offers certification for candidates who can exhibit their skills in data ingestions, transformation, staging and storing of data, analysis of data from Parquet, Avro, JSON, and others, and the development of linear and branching workflow among other abilities.
After a two hour exam, successful Google Cloud data engineers can achieve this credital and recognition. Google awards this certification to data engineers who can demonstrate their abilities in building and maintaining data structures, analyzing data, design processing systems, and incorporation of security, compliance and reliability in processes and structures.
Jobs Similar to Data Engineer
As we’ve mentioned, a data engineer in a large company may work closely with a Data Architect. Like their counterparts in the physical world, an architect is often immersed in the planning stages of infrastructure projects and a data engineer is deeply involved in the actual construction process.
It’s important to note that data engineers are typically not analysts. Their job is to make data available to others; they don’t use it to discover patterns and trends that affect business decisions. If you’re interested in that kind of role, you may wish to consider being a:
Having said that, some smaller companies may combine the roles of scientist and engineer into one.
Data Engineer Job Outlook
The rise in need for data management and transition to Cloud storage results in a higher need for data engineers to perform these tasks.
Job opportunities will be best for candidates who aren’t afraid of change! The evolution of Hadoop, which is increasingly being used as an enterprise data hub, advances in processing power for predictive analytics and a general shift towards the Cloud could make a data engineer’s life more complicated.
Even well-established methods of data management are mutating. For instance, instead of carefully designing a data model before entering data into a system, some engineers are now dumping their information into big data lakes (e.g. a Hadoop repository), where organizations can access and analyze data regardless of schema.
Although these changes can sometimes make for a frustrating work day, it’s an incredibly exciting time to be a builder. If you love playing with new tools and can think outside the relational database box, you’ll be in a prime position to help companies adapt to the 21st century.