Blogs

Data Catalog -  Build vs Buy

Data catalogs are becoming more and more popular to answer the questions relating to data discovery and data trustworthiness. But, when a business decides to go down this path, before taking the first steps, it needs to have an answer to the question of whether to buy or build a data catalog solution. 

Few companies like Uber, Air BNB, Linkedin among others have successfully built their own data catalog solution as per their business model. But, only a few companies possess the skills and clarity in thought to build their own catalog. Sometimes companies spend weeks or months trying to build one for themselves but without any successful implementation and adoption across the business and end up spending hundreds of thousands of dollars.

Building vs Buying at a Glance


Understanding the Building Procedure

Building a data catalog involves a lot of effort in regards to time and resources. Let's take a quick look at the process of building a data catalog solution.

People

Building a data catalog involves a lot of labor. This is due to the fact that in order to build a new product, a dedicated team is required. It involves doing research for best practices, developing the product and finally implementing it. 

Based on research, it is known that a minimum of five data engineers are required to look after the data catalog product. This number goes up even higher when we are in the process of developing and implementing the data catalog. 

Time

Even with the presence of open source data catalogs solutions like Amundsen, CKAN and others which are available in the market, businesses find it very difficult to launch their own data catalogs. These open sources data catalogs are free only on paper but the cost in regards to efforts that are incurred by businesses are high. 

A spokesperson for a financial institution in the US describes that building a data catalog solution is a frustrating journey. He explains that it took more than eight months to deploy a new data catalog solution. 

When considering building a data catalog solution, we always need to keep the competitive landscape in mind. Is it possible to go another year without a proper data catalog solution in place?

Planning and Designing 

With each passing day, modern businesses are investing more resources in data management solutions. But, allocating resources at all data projects doesn’t necessarily give best results. Most of the homegrown data catalogs are built keeping in mind today’s problem statement. And then they take about a year to design and deploy. This means on day one of the data catalog its already using last year’s technology. And with modern businesses this is not acceptable. 

They might become obsolete. Here is how

  1. Most probably they won’t be compatible with some tools that we might use a few years down the line.
  2. Keeping up with data catalog standards is difficult because it changes so rapidly.

These things could be avoided if we have an internal and dedicated data catalog team along with our IT team. 

Processes involved after Building

Building a data catalog is just the beginning. We need to constantly look after it. Therefore it is important to have a development cycle, provide adequate support for queries and keep them up to date. 

Maintenance

The work for a team doesn’t end with just developing a data catalog solution. On the contrary, it's just the beginning. Research says that a minimum of six to seven data engineers are required to maintain a data catalog, which is way more costlier than buying a data catalog solution built for the cloud. 

There are a lot of hidden costs involved in the process. We generally end up paying 2x to 3x of the cost of buying a data catalog which involves support and maintenance cost. 

Staying Competitive

Some of the well known data portals are the ones which are maintained by respective governments. For example, Data.gov or data.gov.uk which are built on CKAN. These portals function properly and cater to the requirements of the government as they have dedicated resources behind it. But for modern businesses its becoming increasingly difficult to allocate dedicated resources. Also, the way data is used today has changed considerably since these open source platforms were launched. This is why:

Difficulty in Finding Data: We can end up getting a deprecated data set as result when searching. 

Difficulty in using Data: Most of the datasets are in excels or CSV. Before using them we need to clean and normalize the data and then ingest the data into another tool for analysis. 

Difficulty in Understanding the Data: The documentation is very limited. Generally there is only one contact email and only the name of the data set. 

Piling up of Cost: Maintenance of a data catalog for the long haul is like spending on life support and not on innovation. Companies like Airbnb, Uber and few others have developed it because they have a clear point-of-view. 

A mature data driven culture is must for a successful design and implementation of a data catalog solution. For businesses who are still defining their data strategy, investing in a home grown data catalog solution, involves a lot of risk as the direction of the data strategy can change multiple times.

Is buying the answer to all Problems ?

User Experience

Modern data catalog teams understand the way modern businesses operate. They have a dedicated UX team which keeps in mind that the design should be comfortable for each and everyone of the organisation. This results in better data governance and experiences for the whole organization. 

Sprinkle has a dedicated team which works on feedback from existing clients and is continuously on the work of improving the UX as per the user convenience.

Service and Software Expertise

Most data catalog vendors have a service component along with their solution. This adds a human touch for the solution which is very important. These experts from the data catalog team reinforces the best practices and enables the proper usage of the data catalog. We need to work with a vendor who is willing to go the extra mile in regards to training and sharing expertise when it comes to the deployment of the data catalog. For example, Sprinkle with its dedicated solutions engineering team is always working as an extended arm for the clients and regularly holds training sessions for the customers. 

Finally, it’s very important to consider that a vendor is a kind of a leader in the industry. The vendor must be one-step ahead of the curve. For example, Sprinkle team comes with more than 2 decades of experience in the data analytics industry and always strive to improve the product and customer experience.

Conclusion 

Young data driven organisations who are still in the way of deciding their data journey and who have limited resources must go for an already existing data catalog solution in the market. This is because it doesn’t make sense for them to spend a lot of time and money on building one inhouse and by the time it's developed, they have a different strategy in place. Mature organizations can think of building a data catalog provided they are willing to have a dedicated team looking after the solution even after it's built.