Every modern organization which depends on data for their decision making process is rethinking their data strategy. As compared to a few years back, now their abundance of new technologies and tools that promises to transform how modern businesses serve their customers and how they compete.
Rather than reacting to events, today’s organizations that are driven by modern data architecture, anticipate business requirements and work towards optimizing them for better business outcomes. Companies who fail to act proactively and adapt to a modern data architecture lose on customers, market share and are gradually driven out of business by competition.
In this blog we are going to have a look at ETL based architecture, which has been prevalent for a couple of decades now, the challenges involved with ETL and why the modern data architecture is evolving now compared to the last couple of decades.
From the above picture we can see how data typically flows in an organisation (top row). ETL in itself is just a means to an end. We do ETL to build dashboards, reports and for using data in different ways in the organization.
Let's have a look at the typical journey of data in an organization. Considering the above image, we can see that data may be generated from multiple sources such as transaction data sources, files or even in unstructured formats. Next, the data goes through the ETL process and then is loaded in data warehouses or data marts and then into a reporting layer which gives visual representation to the data. The ultimate goal is to drive insights from the data.
Now coming to the bottom row on the above image, we can see the different teams in an organization that work on the different layers of data architecture. So for example, if we have some transactional data saved in MySQL database or any other transactional database, it is owned by application engineers. Next, coming to the ETL, they are typically developed by data engineers, analysts or pipeline developers. Their job typically consists of extracting data from the data sources, defining transformations, cleaning the data and creating fact tables among others. The analytics team maintains the warehouses or data marts. In some organisations there are infra engineers or devops engineers who maintain the warehouses. Coming to the reporting side of things, we have analytics teams who work on data that is loaded into the data warehouse to build reports. They generally use a BI engine to build the reports. These reports are then consumed by the business teams. These teams could be business operations teams, product management teams or our sales teams. They are actually the data consumers in the organization.
Therefore, we can see that the data in an organization flows from left to right however the requirement flows from right to left. Also, there are multiple systems and multiple teams through which the data flows in an organization. In short, we can say that the requirements and the data flows travel in a reverse direction.
Now let's see how data architectures have evolved overtime. Data warehousing started in the 1980s and still in some large companies this kind of model is followed. We have operational databases, ETL tools like Informatica in the middle, providing pure ETL infrastructure which pull data from multiple data sources, build some transformations and push the data into a data warehouse or data marts. Earlier organisations used to build individual data marts for individual teams. For example, there used to be different data marts for marketing teams, human resources teams, sales teams and so on. Next the BI tools would talk with these data marts to build reports. This has been the most traditional ETL architecture in the data environment.
Around 2010 onwards when big data and consumer internet became more mainstream, organisations started generating lots of semi-structured or unstructured data. Previously, the data volume was not too high and the data was in structured format. But with the emergence of Big Data, around late 2010 and 2011, the concept of data lake came into existence. Data lakes helped organisations store large volumes of data at comparatively low costs. For example, in the cloud world we can think of AWS S3 as a data lake, where we started dumping all kinds of data (structured, semi-structured or unstructured) at low costs.
Along with this we started using certain technologies like Hadoop, Spark among others for processing and doing ETL of this huge data and ultimately pushing this data into the data marts.
So when comparing the last two models, we can see that not much changed between the infrastructures. Only the volume of the data increased and we introduced a data lake in the middle to store this huge data at lower costs. Along with this, the ETL process started pruning data and moving the data into the data marts. Time series databases were also introduced around this time which had the capabilities of storing huge volume of data at lower costs.
During the last 5 to 10 years we have seen a lot of progress in data science and machine learning. With a lot of data coming into these data lakes and the emergence of data prep pipelines, we started using the prepared data for machine learning through Python, Pyspark and Spark.
In short, we can say that the ETL has not completely gone away but has introduced a lot of moving parts.
Referring to our earlier comment, we said that the data flows from left to right whereas the requirements flow from right to left. This requires a lot of coordination between teams and systems to make things happen in an organization. For example, the end goal of data is to provide insights to the business teams but they don't have visibility to all kinds of data. They only have the visibility to the layer which is exposed to them, that is, the BI layer. As the data moves from left to right a lot of decisions have already been made during this journey. For example, if the business user wants a particular kind of information from the data, he has to reach out to the analytics team who will in turn reach out to the pipeline developers who will again process the raw data and build some summary table which will be passed on to the business team through the analytics team. As we can see this involves a lot of communication and synergy among teams and systems. Also, there are a lot of operational complexities as well which may rise from the data movement through multiple systems and the limitations and capabilities of each system may be different from each other.This leads to a lot of inefficiencies and any new insights demanded by the business teams may take months to be delivered to them.
Therefore, we can say there is a scale mismatch across the data journey. On the left hand side we have the raw data which gets pruned at every stage and we arrive at the summarised version of the data delivered by the BI dashboard.
Another problem across organisations is fragmented knowledge across different teams. This leads to long turnaround times. For example, pipeline developers may not know why certain KPIs are required and how they shall optimise the core systems.
Since multiple teams are working on different tools, they try to local optima. They don’t have a holistic picture. They tend to optimise for their pipeline only. For example, the analytic team is working to optimise only the data warehouse. What they fail to understand is that by adding or eliminating certain data they might be able to optimise at their level but at a holistic level this might not be helpful. Therefore, we can say that there is a correlation between the optimization done at several levels and the overall impact data can create at the organisation. This will require a waterfall approach to succeed and takes months to get implemented.
Now, the challenge with this model is that businesses nowadays are changing very fast as compared to 15 or 20 years back. And this was designed for large companies who work in a pre-defined and stable environment for a long time. For example, if a company is making cars they can define the KPIs in 6 to 12months and implement the complete ETL architecture over the next few months. Now, in this case their KPIs will not change in next month or two months. The waterfall approach may work for them as their business models don’t tend to change rapidly.
On the other hand, in case of consumer internet companies, we are not sure what kind of KPIs will be required in 2-3 months down the line. This is because the business is changing very fast hence the waterfall approach may not work in this case.
Under the modern data architecture, we can see that with the advancement of cloud technology we are collapsing certain layers. For example, the layers of ETL, Data Marts and Reporting can be collapsed into a single layer of cloud data lake.
This modern data architecture is relatively new. It is only in 2019 that people started writing on this topic and organisations started moving towards this architecture. In this modern architecture, we have structured, semi-structured and unstructured data coming into the cloud data lake. These data lakes provide scalable data storage these days. This is in contrast to our earlier assumptions wherein we considered cloud data lake as just the storage wherein we used to dump all data and then process it. But now, the data lakes have become so powerful, scalable and cost effective that we can process the data there itself. For example, when we do ETL, we prune the raw data sets into summary tables. The raw data does not lie in the data mart, it may lie somewhere else like in an S3. But if someone is required to do ad hoc analysis on raw data, it's not possible on ETL. This is because the raw data is not exposed to them in the data warehouse. It lies in S3 or some other storage and requires a set of different tools to access and analyse them.
Now with the emergence of this modern architecture, our raw data and transformed data lies at one place. Instead of the data flowing through different tools we might look at it as different stages of data. But, all the data is at one place and all the processing is at one place. For instance the first level and the second level of summary is at one place and analysis can be done at any level of data.
Lets say, you are operating at a large scale and you have transactions on your mobile application and you are also tracking the user activity on your app and you are generating 10 million events per day. The day that you have is for the past 3 years. Now, this will lead to billions of events. On the first layer, you can segregate those events on certain KPIs on a daily basis, monthly basis and yearly basis and your 60-70% of the use cases are handled by the daily rollups. Now a new product launch requires a metric which is not available on a daily roll up. In the old ETL setup, you would be required to do a lot of data movement to make that work. But, with the introduction of this latest architecture, you can actually build the required metric without much of an effort. This is possible as all the data is lying at one place and you can analyse them using a single tool. You can query the data at a more granular level. All this can happen in a much more cost effective way with an interactive query speed. Earlier we didn’t have interactive querying with ETL platforms like Hadoop and with the emergence of cloud data lake we have these interactive querying facilities. For example, with Google we have Google Big Query and with AWS we have Redshift and Athena which can be coupled with S3 to work along it.
Under this architecture, instead of fragmenting the data, we are keeping the data at one place organised at different levels of aggregation and summarisation and finally using that data in different tools.
The main advantage of this architecture is self serve. Under self serve, teams across the organisation work on the single copy of truth. There is high agility in asking a new question. All the data is accessible to all departments. This is unlike the ETL world where teams are data isolated and they don’t know what data is available with the other team.
It's much easier to implement data governance across the organization. We have unified definitions and models across all data sets.
Business Users can analyze the data on their own via point and click visual interface.
Analysts can focus on multi-dimensional data modelling, rather than spending time on tactical reports asked by business teams.
The modern data architecture is easily scalable as they are hosted on cloud platforms and are designed for large volumes of data. But the best part is that they are equally efficient if the volume of data is less. So organisations can start small and as they grow this architecture can facilitate their growth. Right from a few hundred thousand records to billions of records and Petabytes of data.
Sprinkle enables ingestion and enrichment into the cloud data lakes. Sprinkle has connectors to hundreds of data sources from which data can be pulled to the cloud data lakes and enrichment can be done at one place. This is done without fragmenting data at many places giving full visibility of the data to the end users.
The turnaround time for building a new KPI or new report under this architecture is relatively low. For example, if a KPI took 3 weeks to be developed in the ETL architecture it takes only a couple of hours to be done under this architecture.
Most of these platforms have pay as you scale models. They have a lower data management overhead as it does not require ETL and data movement. The total cost of ownership is relatively less as compared to the ETL architecture.
This architecture ensures high data quality as no data duplication happens across multiple systems.