A Complete guide for selecting the Right Data Warehouse - Snowflake vs Redshift vs BigQuery vs Hive vs Athena
A Data Warehouse is the basic platform required today for any data driven business. A Data Warehouse is a one stop repository for all your business data across verticals and channels. Data may be coming from your web or mobile application, website, CRM systems or several marketing channels. Now, getting all this data at one place is the key to driving business insights and future planning as well. With the emergence of Cloud services Data Warehouse has also evolved from being on premise, accessible to only a few members of the IT team and requiring purified, structured data to being on the cloud, and able to process semi structured as well as unstructured data. This is where a new term ‘Data Lake’ came into existence. A Data Lake is a modern, much more advanced version of the old Data Warehouse. Data Warehouse is a much traditional concept as compared to Data Lakes and has been around for a few decades now. Whereas, the term Data Lake has been coined around 4 to 5 years back, with the emergence of Big Data and varied use cases for businesses.
Let's first understand the basic difference between a Data Lake and a Data Warehouse. Only then would we be in a position to evaluate the multiple Data Warehouses/Data Lakes available in the market and make an informed decision about the one that suits our requirements the best.
The Data Warehouses/Lakes that we would be comparing are
In today’s cloud world, deployment is one of the most important characteristics of any software or tool that we choose for our business. For the DataWarehouse/Lake we will be comparing them on 3 deployments modes. They are SaaS, Server-Less, Cloud/On-Prem.
Next to Deployment, data loading is one of the most important criteria for comparing available Data Warehouse/Lake solutions in the market today. It is one of the most important aspect of data warehousing.
By doing this comparison we’ll come to know how easy or difficult it actually is to get data inside the Data Warehouse/Lake and also how easy or difficult it is querying the data inside the Data Warehouse/Lake.
A lot of data even starts appearing from the cloud storage. Our raw business data, from different data sources could be lying in places like My SQL (transaction data), Kafka and Kinesis (event data) and also from external services through REST API. All these data needs to be ingested into the data warehouse/lake. Different warehouses provide different mechanisms for data ingestion from these data sources.
In Redshift, the data has to be loaded into the warehouse using APIs wherein it organizes data in optimal form so it is performant on query. Recently Redshift has added support for external tables using Redshift spectrum. Spectrum is where we can point Redshift to S3 storage and define the external table enabling us to read the data lying there using SQL query. Adding Spectrum has enabled Redshift to offer services similar to a Data Lake.
Snowflake and Hive supports external table loading and has been very popular over time.
BigQuery doesn’t have an external table concept. Data needs to be loaded in BigQuery where it gets optimised enabling much better query performance.
Athena doesn’t require any data loading. With Athena, we can say that everything is an external table. All the data in S3 can be cataloged using the metastore, create a logical table in Athena and we can query all the data lying in S3 directly from Athena using SQL. Its clearly schema on read
All of the data warehouses/lakes support most of the popular file formats like CSV, JSON, ORC, Avro, Parquet.
Streaming inserts is something which is only provided by Big Query. It actually means inserting or updating data at row level. Although, some warehouses do provide updates or insertion in real time but they don’t scale at a level at which BIg Query can scale. These warehouses insert data in batches or micro batches and not streams like BigQuery. As soon as the record is inserted in Big Query, it is available for querying.
Hive has added support for real time updates using their ACID features. But, performance is still a concern using ACID. At lower scale it works but at a higher scale it has a lot of performance implications.
Considering these factors, we can say that loading data from different data sources into our preferred data warehouse is not straightforward as it seems. It requires multiple level of customization if we are loading data in Snowflake vs Redshift vs BigQuery for the performance to be optimum.
Data Loading into the Data Warehouse/Lake is just a means to the end. The end being able to run query on the data lying in the data warehouse.
While some of the query engines used by the above mentioned data warehouses/lakes are based on open source engines, some warehouses do have their proprietary search engine as well. For example, Redshift uses the query engine from a company called ParAccel.
Snowflake has its own proprietary query engine. They have separated the storage and compute. They have optimised the performance at different layers.
Hive uses an open source query engine which has been popular for quite a long time.
BigQuery has based its search engine on the Google Dremel algorithm. Although the algorithm is available publicly, the implementation is proprietary.
Athena is based on an open source query engine called Presto. It is a query engine developed by Facebook. Managing presto is a huge task like managing Hive. So, Amazon has created a SaaS service on top of Presto so users don’t have to manage the Presto cluster.
Redshift uses Postgresql, Snowflake, BIgQuery and Athena uses ANSI SQL whereas Hive uses its own HQL for querying.
Complex Data Types
Apart from Redshift, all other Data Warehouses (Snowflake, Hive, BigQuery, Athena) supports complex data structures like arrays, maps and structs as first class citizens. Redshift supports only primitive data types.
UDF is a function which operates at row and creates a single value. Whereas UDAF is a function which operates on multiple rows and then creates value.
Redshift and Athena support only UDF and not aggregated functions. However, with Snowflake, Hive, BigQuery we can define both UDF and UDAF with ease.
One of the major differences between the above mentioned warehouses/lakes is the ability to scale independently. The data can sit at one place and we can provision multiple data warehouses separately on the same data. This enables us to create multiple clusters on the same data for different teams based on their respective SLAs.
Redshift and BigQuery don't support this independent scaling as the data sits on the compute node and they are hard tied to the data. This requires a lot of performance planning to be done upfront.
Unlike these two, Snowflake, Hive and Athena enable independent scaling helping us in managing them with much ease.
There are different things that need to be tuned and different set of expertise is required to extract the performance through these systems.
Redshift has a concept of distribution keys, which is recommended to be used for optimum performance. Though they have support for json data, the performance suffers if we start using json data type.
Snowflake enables caching at compute nodes which have SSDs. No tuning is required and it figures out hot data based on query patterns.
Hive, BigQuery and Athena have a partitioning concept and this method is recommended in order to get better performance for queries.
Each of the above mentioned warehouses have very different pricing models.
Redshift pricing is based on provisioned servers capacity. They charge on an hourly basis and we can get some discounts if we provision for longer time periods.
Snowflake pricing is based on the time the warehouse is active. If there are queries running for a particular time, let's say for a few minutes every day, then we are charged only for those minutes. The pricing starts from $2 per hour and varies on the different tiers and features they have.
Hive is a free software. If we are managing our own at on prem, then there is no charge for Hive. It charges on the provisioned server capacities and is similar to Redshift.
BigQuery and Athena charges depend on the volume of the data scanned. In other words, we can say that they have query based charging. So, when we are shooting a query, the volume of data it scans is the basis for pricing for these two.
BIgQuery also has a flat rate pricing model.Through this model customers can have more control on the pricing and they can also decide the number of slots on the cluster. This is generally preferred by larger organisations with huge volumes of data.
Based on our years of experience in the data analytics space we always recommend considering the complexity (ease of use) and maintainability for the warehouse. As we can see that Snowflake, BigQuery and Athena ranks the highest when it comes to ease of use. This is because of the fact that they offer a SaaS based, Server less models.
If we have predictable use cases for all our data and analytics in the future then Redshift is something we should be considering. But, given the dynamic nature of today’s businesses that is highly unlikely.
Snowflake is a good choice if the query workload is different throughout the day and our use cases are evolving. Also, if we have mixed data structure and the scale is from medium to high in terms of data volume. It also gives us the freedom of not operating our own team for managing the clusters.
Hive has the advantage of being an open source tool and has some capabilities in the machine learning space. Also, if we have a very high scale of data, Hive could be cost effective than others. But then again we need to invest in some good resources who can tune the system to get the optimum performance. But with lower volumes, Hive could be more expensive in terms of total cost of ownership for the business.
BigQuery is suitable if the usage is low and we don’t want to manage the system on our own. Similar thing works for Athena as well.
Considering these factors we need to decide where and how we want to invest. Whether we would like to invest in building a team internally having expertise to tune the specific data warehouse; do we truly have a Big Data use case; or we want to keep things simple and would like to go with a pay per use kind of a model which is more cost effective considering the Total Cost of Ownership.
As far as the performance is concerned, there is not much difference in all the cloud based data lakes. Performance varies from use case to use case.
According to us, the main factor to consider would be the maintainability and agility at which we can scale and do change management.