A Complete guide for selecting the Right Data Warehouse - Snowflake vs Redshift vs BigQuery vs Hive vs Athena
A Data Warehouse is the basic platform required today for any data driven business. A Data Warehouse is a one stop repository for all your business data across verticals and channels. Data may be coming from your web or mobile application, website, CRM systems or several marketing channels. Now, getting all this data at one place is the key to driving business insights and future planning as well. With the emergence of Cloud services Data Warehouse has also evolved from being on premise, accessible to only a few members of the IT team and requiring purified, structured data to being on the cloud, and able to process semi structured as well as unstructured data. This is where a new term ‘Data Lake’ came into existence. A Data Lake is a modern, much more advanced version of the old Data Warehouse. Data Warehouse is a much traditional concept as compared to Data Lakes and has been around for a few decades now. Whereas, the term Data Lake has been coined around 4 to 5 years back, with the emergence of Big Data and varied use cases for businesses.
Data Lake vs Data Warehouse
Let's first understand the basic difference between a Data Lake and a Data Warehouse. Only then would we be in a position to evaluate the multiple Data Warehouses/Data Lakes available in the market and make an informed decision about the one that suits our requirements the best.
- Approach: As mentioned earlier, Data Warehouse has a much older and traditional methodology for storing data as compared to Data Lake which is much modern and gives the user a lot of flexibility, in terms of supported data forms, architecture and costs to the user.
- Data Structure: Data Warehouses of the past essentially required cleansed data. But, with the emergence of semi-structured and unstructured data types, the requirement for a modern and much more flexible data storing option became necessary and thus the emergence of Data Lakes in this ecosystem.
- Architecture: Data Warehouses generally follow the ETL architecture model wherein the data before loading into the data warehouse needs to be transformed. This is due to the fact that Warehouses need cleansed data. On the other hand, Data Lakes generally follow the ELT architecture, which is much more user friendly and modern. Under the ELT architecture all the data is loaded into the warehouse and then as per the requirements of the user, he can transform the data to build analytical models.
- In other words, Data Warehouses follow the waterfall model wherein the user starts with the requirement of the reports, designing the data model and then designing the transformation. Based on these requirements then the required data is loaded into the warehouse.
- Schema: Data Warehouses have a schema on write model whereas the Data Lakes follow a Schema on read, a much modern concept. This enables the user to transform the data as per his use case at a given point of time.
- Storage: With the traditional Data Warehouses, they can be scaled vertically or horizontally but the storage is co-located with the compute. Generally these are premium storages using SSD and are pretty expensive machines on which these are deployed.
- Under the Data Lakes, storage can be independently scaled from the compute and it integrates well with the cloud storages. Also, the cost is relatively low as compared to Data Warehouses.
- But with the emergence of the modern business practices the business requirements are very dynamic. In such a scenario, Data Lakes appear to be the preferred choice for businesses. Therefore, Data Warehouses are now adopting some of the features of the Data Lakes. Hence, the boundary between Data Warehouse and Data Lake is getting blurred day by day.
- Now that we have a clear understanding of the differences between Data Warehouses and Data Lakes, lets have a look at the criterias on which we should evaluate the solutions available in the market and which could be our preferred choice.
Factors to consider while evaluating Data Warehouse/Data Lakes
The Data Warehouses/Lakes that we would be comparing are
- We will be evaluating the above mentioned Data Warehouses/Lakes based on the below mentioned criterias.
- Data Loading
- Query Performance Scaling
In today’s cloud world, deployment is one of the most important characteristics of any software or tool that we choose for our business. For the DataWarehouse/Lake we will be comparing them on 3 deployments modes. They are SaaS, Server-Less, Cloud/On-Prem.
- SaaS: As we all understand SaaS products are those which do not require us to provision any space on our servers and are readily available on subscription. These services can be availed by just clicking a few buttons and signing up on the vendors website.
- Server-less: This is a relatively new architecture under which we as users don’t need to manage or provision any servers. The service itself scales up or down based on the usage.
- Cloud/On-Prem: Data Warehouses/Lakes which are available on Cloud requires us to provision the servers and maintain them as well. If we take the example of Redshift, since it is a product by Amazon, it is available exclusively on AWS as a service. But, we need to provision the servers and manage them as well.
- When it comes to Snowflake, it is a cloud agnostic platform and runs on all the three major clouds, i.e. AWS, GCP and Azure. Since it is a complete SaaS model we don't need to manage any servers and can start using it by subscribing on their website. While subscribing we can choose our preferred cloud on which we want it to run.
- Coming to Hive, it is an open source platform. There are multiple flavors of Hive for example AWS EMR, Azure HDInsight, Google Dataproc and independent companies like Qubole and Databricks offer Hive. Out of the 5 data warehouses that we are comparing, Hive is the only one which could be deployed on-prem by the user.
- BigQuery is similar to Snowflake and has a SaaS offering. But since it is a Google product it runs only on GCP.
- Athena is offered by AWS and is very similar to BigQuery. Although AWS has Redshift as an offering in the Data Warehouse space, Athena is fast emerging because of its SaaS and serverless models which provides modern businesses flexibility in terms of ease of usage and lesser effort consumption.
Next to Deployment, data loading is one of the most important criteria for comparing available Data Warehouse/Lake solutions in the market today. It is one of the most important aspect of data warehousing.
By doing this comparison we’ll come to know how easy or difficult it actually is to get data inside the Data Warehouse/Lake and also how easy or difficult it is querying the data inside the Data Warehouse/Lake.
A lot of data even starts appearing from the cloud storage. Our raw business data, from different data sources could be lying in places like My SQL (transaction data), Kafka and Kinesis (event data) and also from external services through REST API. All these data needs to be ingested into the data warehouse/lake. Different warehouses provide different mechanisms for data ingestion from these data sources.
In Redshift, the data has to be loaded into the warehouse using APIs wherein it organizes data in optimal form so it is performant on query. Recently Redshift has added support for external tables using Redshift spectrum. Spectrum is where we can point Redshift to S3 storage and define the external table enabling us to read the data lying there using SQL query. Adding Spectrum has enabled Redshift to offer services similar to a Data Lake.
Snowflake and Hive supports external table loading and has been very popular over time.
BigQuery doesn’t have an external table concept. Data needs to be loaded in BigQuery where it gets optimised enabling much better query performance.
Athena doesn’t require any data loading. With Athena, we can say that everything is an external table. All the data in S3 can be cataloged using the metastore, create a logical table in Athena and we can query all the data lying in S3 directly from Athena using SQL. Its clearly schema on read
All of the data warehouses/lakes support most of the popular file formats like CSV, JSON, ORC, Avro, Parquet.
Streaming inserts is something which is only provided by Big Query. It actually means inserting or updating data at row level. Although, some warehouses do provide updates or insertion in real time but they don’t scale at a level at which BIg Query can scale. These warehouses insert data in batches or micro batches and not streams like BigQuery. As soon as the record is inserted in Big Query, it is available for querying.
Hive has added support for real time updates using their ACID features. But, performance is still a concern using ACID. At lower scale it works but at a higher scale it has a lot of performance implications.
Considering these factors, we can say that loading data from different data sources into our preferred data warehouse is not straightforward as it seems. It requires multiple level of customization if we are loading data in Snowflake vs Redshift vs BigQuery for the performance to be optimum.
Data Loading into the Data Warehouse/Lake is just a means to the end. The end being able to run query on the data lying in the data warehouse.
While some of the query engines used by the above mentioned data warehouses/lakes are based on open source engines, some warehouses do have their proprietary search engine as well. For example, Redshift uses the query engine from a company called ParAccel.
Snowflake has its own proprietary query engine. They have separated the storage and compute. They have optimised the performance at different layers.
Hive uses an open source query engine which has been popular for quite a long time.
BigQuery has based its search engine on the Google Dremel algorithm. Although the algorithm is available publicly, the implementation is proprietary.
Athena is based on an open source query engine called Presto. It is a query engine developed by Facebook. Managing presto is a huge task like managing Hive. So, Amazon has created a SaaS service on top of Presto so users don’t have to manage the Presto cluster.
Redshift uses Postgresql, Snowflake, BIgQuery and Athena uses ANSI SQL whereas Hive uses its own HQL for querying.
Complex Data Types
Apart from Redshift, all other Data Warehouses (Snowflake, Hive, BigQuery, Athena) supports complex data structures like arrays, maps and structs as first class citizens. Redshift supports only primitive data types.
UDF is a function which operates at row and creates a single value. Whereas UDAF is a function which operates on multiple rows and then creates value.
Redshift and Athena support only UDF and not aggregated functions. However, with Snowflake, Hive, BigQuery we can define both UDF and UDAF with ease.
Query Performance Scaling
One of the major differences between the above mentioned warehouses/lakes is the ability to scale independently. The data can sit at one place and we can provision multiple data warehouses separately on the same data. This enables us to create multiple clusters on the same data for different teams based on their respective SLAs.
Redshift and BigQuery don't support this independent scaling as the data sits on the compute node and they are hard tied to the data. This requires a lot of performance planning to be done upfront.
Unlike these two, Snowflake, Hive and Athena enable independent scaling helping us in managing them with much ease.
There are different things that need to be tuned and different set of expertise is required to extract the performance through these systems.
Redshift has a concept of distribution keys, which is recommended to be used for optimum performance. Though they have support for json data, the performance suffers if we start using json data type.
Snowflake enables caching at compute nodes which have SSDs. No tuning is required and it figures out hot data based on query patterns.
Hive, BigQuery and Athena have a partitioning concept and this method is recommended in order to get better performance for queries.
Each of the above mentioned warehouses have very different pricing models.
Redshift pricing is based on provisioned servers capacity. They charge on an hourly basis and we can get some discounts if we provision for longer time periods.
Snowflake pricing is based on the time the warehouse is active. If there are queries running for a particular time, let's say for a few minutes every day, then we are charged only for those minutes. The pricing starts from $2 per hour and varies on the different tiers and features they have.
Hive is a free software. If we are managing our own at on prem, then there is no charge for Hive. It charges on the provisioned server capacities and is similar to Redshift.
BigQuery and Athena charges depend on the volume of the data scanned. In other words, we can say that they have query based charging. So, when we are shooting a query, the volume of data it scans is the basis for pricing for these two.
BIgQuery also has a flat rate pricing model.Through this model customers can have more control on the pricing and they can also decide the number of slots on the cluster. This is generally preferred by larger organisations with huge volumes of data.
Which One is Right For You?
Based on our years of experience in the data analytics space we always recommend considering the complexity (ease of use) and maintainability for the warehouse. As we can see that Snowflake, BigQuery and Athena ranks the highest when it comes to ease of use. This is because of the fact that they offer a SaaS based, Server less models.
If we have predictable use cases for all our data and analytics in the future then Redshift is something we should be considering. But, given the dynamic nature of today’s businesses that is highly unlikely.
Snowflake is a good choice if the query workload is different throughout the day and our use cases are evolving. Also, if we have mixed data structure and the scale is from medium to high in terms of data volume. It also gives us the freedom of not operating our own team for managing the clusters.
Hive has the advantage of being an open source tool and has some capabilities in the machine learning space. Also, if we have a very high scale of data, Hive could be cost effective than others. But then again we need to invest in some good resources who can tune the system to get the optimum performance. But with lower volumes, Hive could be more expensive in terms of total cost of ownership for the business.
BigQuery is suitable if the usage is low and we don’t want to manage the system on our own. Similar thing works for Athena as well.
Considering these factors we need to decide where and how we want to invest. Whether we would like to invest in building a team internally having expertise to tune the specific data warehouse; do we truly have a Big Data use case; or we want to keep things simple and would like to go with a pay per use kind of a model which is more cost effective considering the Total Cost of Ownership.
As far as the performance is concerned, there is not much difference in all the cloud based data lakes. Performance varies from use case to use case.
According to us, the main factor to consider would be the maintainability and agility at which we can scale and do change management.