Best practices for using Kinesis with Sprinkle

BlogsData Engineering

Introduction

As technology scaled over the years, the number of data gathered also increased exponentially.

The data might be of any kind, say, customer, transactional or other operational details. These data gathered are in the form of bytes and its volume may increase up to hundreds of terabytes/hour depending on the magnitude of the user’s business.

Managing these bulk data might prove to be a tedious process. However, Kinesis is not only capable of streaming large volumes of data but it also streams data in real time. Let’s plunge into Kinesis and its functional architecture.

Kinesis is a web service that gathers and streams real-time big data . Streaming is nothing but generating and sending thousands of real-time data every second from various data sources, this is different from the traditional methods where streaming happens in a batch-wise process instead of real-time streaming.

The architectural flow

  • Producers:
  • The producers are the users who generate data from various sources.
  • Kinesis:
  • The data produced is now brought together as “Shard” in Kinesis’s system, these Shard’s contain data records of the records generated by the producers. There could be ‘n’ number of Shard’s processing simultaneously.
A single or a number of Shard’s make a stream. The maximum data read rate may raise up to 2 MB per second and a maximum data write rate of 1 MB per second.
  • Consumers:
  • Consumers get records from Amazon Kinesis Data Streams and process them. These consumers are known as Amazon Kinesis Data Streams Application.

What Kinesis does

Kinesis came into place to provide super fast and seamless data ingestion. The size of data kinesis deals with is mostly small, but in large numbers. These data might be from the operational front, transactional front, etc. However, Kinesis helps in the ingestion process of

  • Real time data metrics and analytics
  • Real time aggregation and data warehousing
  • Ingestion from multiple data streams

In Kinesis, consumers are the ones who process the data that’s generated by the producers. These consumers are called “Amazon Kinesis Data Streams Application.”

In Amazon Kinesis Data Stream Application there are two types of consumers,

  • Shared fan-out consumers
  • Enhanced fan-out consumers

Consumers process all the data from the Kinesis data stream. Through shared fan-out process, multiple consumers are given access to study data from the same stream in parallel only when the consumer uses enhanced fanout. A speed of 2 mb/sec is allotted to each throughput. This doesn’t alter, the 2 mb/sec throughput per shard is fixed even if ‘n’ number of consumers access from the same shard.

However, the registered consumers with enhanced fan-out receive up to 2 mb/sec speed independently and receive their own read throughput per shard.

How Sprinkle makes the most of the data generated by Kinesis

From producers to consumers to being stored in elastic cloud instances (EC2), the data is then warehoused in various formats like Amazon S3, Amazon Redshift, Dynamo DB, Amazon EMR, etc.

Sprinkle on the other hand allows you to visualize the analytics on a much wider scale. It takes data from various databases and ingests into one common database structure called “Flows” These flows help build “Cubes” which provide a three facet accessibility say dimensional, date dimensional and measurement.

These data can be utilized to provide the customers with proper analytics, for instance, in the form of knowledge graphs, growth curves, google like search to get actionable insights and infographics on the data that’s warehoused from Kinesis.

Sprinkle with Kinesis and the steps involved

In order to build a data table in Sprinkle, a new data needs to be ingested from any database, in this case it’s from Kinesis.

Manage data” tab is accessed which routes to the “Data source" tab in which new data can be created. On clicking “Add new data”, a new page displays a number of sources through which the data can be ingested, in this case “Kinesis” is selected. The data is given a new name, and thus the new data is created.

On creating new data, a new page comes up with three tabs

  • Configure
  • Add table
  • Run and schedule

Configure:

The configure table requires you to fill in the “Access key”, “Secret access key” and “Region”.

These are much like the credentials provided by AWS to access its services. On completion, “Create” is selected.

Add table:

After the configuration process, the name of the “Stream” is selected from a drop down list which consists of a number of pre configured “Stream names.” Now table is created.

Run and schedule:

The table is allowed to run and checked if it is a successful process or a failure, the number of jobs that are supposed to run in parallel is scheduled with the help of “Concurrency” tab. The data is ingested at this point. The next step is where the “Explores” tab is accessed to carry on with the process of scripting which allows Sprinkle to provide actionable insights as per the need of the users.

Kinesis helps in generating real time data at a rapid pace, this helps sprinkle build a better visualization. However, Sprinkle’s ETL allows businesses to understand the gathered data and simplify the analytics part that lets the business understand the best way to make use of data.

Written by
Soham Dutta

Blogs

Best practices for using Kinesis with Sprinkle