Data architects create blueprints for data management systems. After assessing a company’s potential data sources (internal and external), architects design a plan to integrate, centralize, protect and maintain them. This allows employees to access critical information in the right place, at the right time.
Data Architect Responsibilities
A data architect may be required to:
- Collaborate with IT teams and management to devise a data strategy that addresses industry requirements
- Build an inventory of data needed to implement the architecture
- Research new opportunities for data acquisition
- Identify and evaluate current data management technologies
- Create a fluid, end-to-end vision for how data will flow through an organization
- Develop data models for database structures
- Design, document, construct and deploy database architectures and applications (e.g. large relational databases)
- Integrate technical functionality (e.g. scalability, security, performance, data recovery, reliability, etc.)
- Implement measures to ensure data accuracy and accessibility
- Constantly monitor, refine and report on the performance of data management systems
- Meld new systems with existing warehouse structures
- Produce and enforce database development standards
- Maintain a corporate repository of all data architecture artifacts and procedures
You won’t be surprised to hear that this is a difficult job. Some companies need data architects who are ninjas in data modeling techniques; others want experts in data warehousing, ETL tools, SQL databases or data administration. Most data architects are senior-level employees with plenty of years in business intelligence under their belts.
An Interview with a Real Data Architect
We spoke with Craig Statchuk, Big Data Architect at IBM, to learn more about the responsibilities of data architects. Below, Craig discusses the pros and cons of his job, how the data architect position has changed over time, and his advice for students interested in becoming data architects.
A: The good part is you start most days in the new big data world. This includes everything within the role of a big data architect – someone who fulfills the needs of the entire enterprise beyond IT. In effect, this role is about taking care of more users in more places. Therefore, the pro is that most days you’ll start with a clean slate. You may not know what the day holds for you, but by lunchtime, you’ll have a long list of things to work on, to create, and hopefully resolve in a short period of time. There’s a lot of value placed on immediate results. Instead of, “let’s come up with a great design or a great, long-term vision necessarily,” it’s more like, ‘I want to see the answers now.”
The con is that it doesn’t quite work that way. You have to keep the long-term vision in place because you’ll see more and more of the same kinds of questions and the same requests. So the architecture that allows you to respond quickly to a wide variety of questions will serve you very well.
There’s a tendency to do things quick and easy, but the truth is you need time to think and really plan, and time is a scarce resource in a typical workday. So finding time to think about what you’ve done right and how to move forward is really the secret to doing the job well. But right now, it’s the push and a pull that makes the job difficult.
A: That’s exactly it. Everything is short term. It’s the “I need it yesterday” mentality. The problem with this is that it leaves little room to focus on quality or other issues, such as governance or the idea that you need to serve not only your users but also the business. Those two are always in opposition because the users want things now and the business wants things done right. You have to balance the two.
A: The role is changing, and it’s growing quickly. A data scientist 10 years ago built the data warehouses and conducted the analysis in a constrained way. Nowadays, we’re seeing less demand for that, although it still exists in the office of finance, for example, since finance can easily categorize and quantify values according to accounting standards. The rest of the business wants the same kind of analysis so they can look at their data in a variety of ways, but they don’t necessarily have the rigidity or the structure necessary for that.
So what we’re seeing is the classic hype cycle with greater demand for new ways of looking at data. The truth is it’s not going to pan out as well as we hope. There’s going to be some disillusionment or dissatisfaction with the initial results. But at the end of the day, or let’s say at the end of two years, you will have an organization that is agile, able to answer more questions about the business and its customers, and more knowledgeable about what can be accomplished quicker and more accurately than they are today.
More data actually doesn’t make us smarter if we don’t have the ability to consume it. In some ways, it actually makes us less knowledgeable.
For instance, if you have more data, relatively speaking, but you lack the ability to process and understand it, then you know less than you used to. The way to combat that is to have systems that can adapt to the new data, understand and categorize it, and deliver it to more users quicker than before. That lets you turn the tide against big data. Without the proper resources in place, big data can result in significant confusion, but if it’s well-organized and well-provisioned, it can be the source to greater understanding.
So you have to balance that every day, which involves figuring out how to make the data more reusable.
A: Traditionally, we came from the world of Java and structured, traditional languages such as that. This was the language of the server, and we could it use with minimal modifications on the browser; so heritage played a big part in moving us in that direction. But nowadays, we’re seeing movement towards more flexible languages, more data and more statistically oriented language. My personal favorite is Python because it lets me be a computer scientist and have access to the statistics and other analytics that I need. Other people look at using both SPSS and languages such as R on a regular basis because they provide strong statistical packages, and the programming is often much easier and more accessible.
A: I graduated with a mathematics degree in 1983 from the University of Waterloo. At that time, we were very much into data structures, programming, and databases. We were taught things like first normal form and third normal form. This was standard for a good 20 years. We did structured programming and dealt with structured data. We worked with databases in a way that made users happy, and we answered certain types of questions very well. As we move forward, however, that data structure hasn’t been serving us as well.
For instance, we weren’t always able to understand just how quickly data was changing, and it took us a while before realizing that our way of processing could no longer keep up. So what we’ve arrived at is something that I can only call the new normal form. It represents data that’s just good enough and clean enough but not perfect. This is different from the past way of doing it, in which one piece of data points to other pieces of data, which then would lead us back to our understanding or something related to different pieces of data. Today, we have plenty of look-up tables and other things that help serve the business. The problem with those traditional data structures was that they didn’t allow us to answer enough questions.
The domain of questions that they could answer was really limited. But now we’ve come full circle with these giant spreadsheets. They represent rows and columns of data with lots of gaps, many inconsistencies, and lots and lots of columns. The new tools enable us to build good queries and build data that’s even more reliable than the stuff we used to ETO and put into those structured forms that I already mentioned.
The new solution is to take data and make it as reusable and as accurate as possible without sacrificing flexibility. That becomes a new role with a new focus. Now, I have to ask myself, “How do I produce data with maximum reusability within the company while also making it as accurate and as important as possible for the part of the business responsible for the systems of record that run the business?” You still have to service those, but now we have to service what we call “systems of engagement,” which is how we understand our customers, our employees, and even the products we build.
A: Nowadays, data architects come with many different skills and backgrounds. For instance, unlike 20 years ago, a pure data or computer scientist background may not be as helpful. The new skills are to understand the needs of the user so that you can build data and systems that will answer their problems now and in the future. We have to become proactive, and I liken the job to that of an inventor. In other words, we have to invent the solutions that users are going to ask for, not necessarily tomorrow but six month from now or even two years from now. That requires someone who is both innovative and able to focus on the task today but also someone who can look into the future and say, “Hey, here’s what they’re likely going to ask about in two years. How do I create my data? How do I create my business to serve me better down the road?”
We don’t want to be chasing wild geese, but we want to be able to predict and do the kind of processing that users are going to need down the road. That’s a difficult job to do. This opens up opportunities for skilled candidates with a variety of backgrounds. For instance, you could have an art or a business degree, and then you can come in to the technical side of the business. In fact, that may be the best possible way to get the broad understanding of the business and then the ability to actually execute it.
A: So I think the best piece of advice I can give is to become an expert. Become the absolute best you can be at a particular field of interest. I don’t care if that’s accounting, psychology, or data management. In five years, your job will be totally different than it is today. And you’re going to have to learn new skills all over again. The only way to survive is to anticipate that you’ll have to become an expert in a new set of skills in order to meet future demands. I need to do that every few years in my career. You have to get used to it, and you have to get good at it.
The ability to turn your career towards the next big thing, whether it’s Hadoop, Spark or maybe data preparation for a line of business, is essential. You’re going to have to understand that, and you’re going to have to help the people who need to do these functions by being the expert they can trust.
A: I think the data field is exploding, but our ability to process it is falling behind. The solution for the industry is scalability. However, I don’t think it’s in the way that we traditionally think about throwing more hardware at the problem. We need the ability to share and collaborate more so that everyone in a business participates. This allows us to divide the efforts and move in a common direction.
But I look at data science today, and I see we have individual people doing the entire analysis from start to finish – gathering the data, cleaning it, doing the analytics, creating the visual innovation and presenting the results. The problem standing in the way of scalability is that, for each step along the way that I get that data science activity, all that knowledge is lost as soon as I complete my task. We need to move the industry towards sharing the work and sharing the role so that one person produces a reusable data set, the next person produces reusable analytics on top of it, and then we can all present the results more quickly and accurately.
I think this funnel is going to eventually constrict to the point that we will have to do something better. However, hiring more data scientists isn’t necessarily the answer. Creating more processing power or even SPARC 2 [SP], which gives us more processing power than we ever imagined, isn’t part of the solution, either. Rather, it involves getting teams of people to move each processing step forward, and then reusing that work so we don’t keep reinventing the wheel and doing everything from the ground up.
We used to say, in business analytics, you had a single version of the truth. Now we have to move towards the best-supported, most acceptable version of the truth, which allows us to get to the truth quicker and easier. As a team, we are providing the data and then sharing the results so that we can all reuse them and turn them into answers for the benefit of the business.
Data Architect Salaries
Home to Silicon Valley, San Francisco tops the list of best-paying cities for data architects. Acccording to PayScale, the median pay in 2015 was $144,883 (29% above the national average). The runners up? Boston – a hotbed of large universities and healthcare entities – with a median pay of $144,883 (23% above) and New York, with a median pay of $135,467 (20% above).
Average Salary (2015): $100,118 per year
Median Salary (2015): $106,769 per year
Total Pay Range: $70,470 – $158,551
Senior Data Architect
Average Salary (2015): $111,532 per year
Data Architect Qualifications
What Kind of Degree Will I Need?
To become a data architect, you should start with a bachelor’s degree in computer science, computer engineering or a related field. Coursework should include coverage of data management, programming, big data developments, systems analysis and technology architectures. For senior positions, a master’s degree is usually preferred.
The key aspect of your employment application will be experience. Top employers expect job candidates to have spent at least five years dealing with application architecture, network management and performance management, if not more.
What Kind of Skills Will I Need?
- Application server software (e.g. Oracle)
- Database management system software (e.g. Microsoft SQL Server)
- User interface and query software (e.g. IBM DB2)
- Enterprise application integration software (e.g. XML)
- Development environment software
- Backup/archival software
- Agile methodologies and ERP implementation
- Predictive modeling, NLP and text analysis
- Data modeling tools (e.g. ERWin, Enterprise Architect and Visio)
- Data mining
- ETL tools
- Python, C/C++ Java, Perl
- UNIX, Linux, Solaris and MS Windows
- Hadoop and NoSQL databases
- Machine learning
- Data visualization
As always, this list is subject to changes in technology.
- Analytical Problem-Solving: Approaching high-level data challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
- Effective Communication: Carefully listening to management, data analysts and relevant staff to come up with the best data design; explaining complex concepts to non-technical colleagues.
- Expert Management: Effectively directing and advising a team of data modelers, data engineers, database administrators and junior architects.
- Industry Knowledge: Understanding the way your chosen industry functions and how data are collected, analyzed and utilized; maintaining flexibility in the face of big data developments.
What About Certifications?
As of 2015, there was no expert certification explicitly dedicated to data architects. There are, however, plenty of skill-specific credentials from vendors with a stake in data management (e.g. Oracle, Microsoft, IBM, etc.). When in doubt, consult your mentors, examine recent job descriptions and check out similar articles to Tom’s IT Pro Best Database Certifications for 2015 to decide which acronyms are worth your time and money.
Developed by the Institute for Certified Computing Professionals (ICCP), the CDMP is probably the most frequently listed certification on data architect’s résumés. Since it doesn’t focus on a particular platform or vendor, it’s a solid credential for general database professionals.
The CDMP is offered at two levels – Practitioner or Mastery – and awarded to candidates who provide evidence of education, experience and passing results on the CDMP’s professional knowledge exam. Proof of continuing education and professional activity is required to re-certify.
Jobs Similar to Data Architect
You can take a variety of paths to become a data architect. Folks often get their start working as Database Administrators (DBAs) or entry-level programmers. By concentrating on the day-to-day tasks involved with data management (e.g. installation, upgrades, back-up and recovery, etc.), DBAs gain an understanding of how data are stored and used.
- Architects tend to be concerned with the 10,000 foot view – analyzing business needs, developing logical models and developing procedures.
- Engineers may be more involved in the construction phase – designing, implementing and maintaining data management systems from the ground up.
Data architects do not analyze data. Instead, they make it available to others. If you’re interested in playing in the analyst sandbox, you could consider becoming a:
Data Architect Job Outlook
Like every other job related to data management, you can expect demand for this title to grow in the next 10 years. Although data architects have only been around since the 1990s, companies are continuing to seek them out. The reason? Big data.
In the past, building back-end data management systems was relatively straightforward. Architects might set up a warehouse, structure and consolidate information into an SQL database and make data available to individual departments. Job done.
Those days are gone. As information floods the market, analysts are demanding access to all kinds of unstructured data (e.g. audio, video, text) that could help in making business decisions. This leaves architects with the tricky task of mixing new technologies (e.g. Hadoop) with existing relational databases to create flexible infrastructures that are cost-effective and secure.
As Dip Kharod notes, big data architects must now ask themselves:
“How do I build a platform that provides just enough information in the hands of the business to make timely decisions while processing a massive amount of data that allows advanced analytics to answer never-before-asked questions in a secure environment?)”