Best practices for using Kinesis with Sprinkle

BlogsData Engineering

Amazon Kinesis is a powerful service designed to handle large-scale data streaming in real-time. It allows you to collect, process, and analyze streaming data, enabling you to gain timely insights and react quickly to new information. This is particularly useful for applications that require real-time data processing, such as monitoring clickstream data, logs from applications and servers, and data from Internet of Things (IoT) devices.

As the volume of data generated by these sources continues to grow, the need for efficient streaming data solutions becomes more critical. Amazon Kinesis Data Streams can ingest real-time data at large scales, while Amazon Kinesis Data Analytics provides tools for transforming and analyzing this data in real time. Additionally, Amazon Kinesis Data Firehose reliably loads streaming data into data lakes, data stores, and analytics services, ensuring that your data is always available for analysis.

Introduction

As technology scaled over the years, the number of data gathered also increased exponentially.

The data might be of any kind, say, customer, transactional or other operational details. These data gathered are in the form of bytes and its volume may increase up to hundreds of terabytes/hour depending on the magnitude of the user’s business.

Managing these bulk data might prove to be a tedious process. However, Kinesis is not only capable of streaming large volumes of data but it also streams data in real time. Amazon Kinesis Data Firehose is a solution for reliably loading streaming data into various data stores and analytics services, including transforming the data using Lambda before loading it. Let’s plunge into Kinesis and its functional architecture.

Kinesis is a web service that gathers and streams real-time big data. Streaming is nothing but generating and sending thousands of real-time data every second from various data sources, this is different from the traditional methods where streaming happens in a batch-wise process instead of real-time streaming.

The architectural flow

  • Producers:
  • The producers are the users who generate data from various sources.
  • Kinesis:
  • The data produced is now brought together as “Shard” in Kinesis’s system, these Shard’s contain data records of the records generated by the producers. There could be ‘n’ number of Shard’s processing simultaneously.

The partition key plays a crucial role in determining the shard placement for each data record within the stream.

A single or a number of Shard’s make a stream. The maximum data read rate may raise up to 2 MB per second and a maximum data write rate of 1 MB per second. Kinesis helps to route data records to different shards within a stream, allowing parallelization of data ingestion and processing to enhance stream performance.

  • Consumers:
  • Consumers get records from Amazon Kinesis Data Streams and process them. These consumers are known as Amazon Kinesis Data Streams Application.

What Kinesis Data Streams does

Kinesis came into place to provide super fast and seamless data ingestion. The size of data Kinesis deals with is mostly small, but in large numbers. These data might be from the operational front, transactional front, etc. A data record in Kinesis consists of a partition key and a data blob, with the data blob holding the actual value of the record. Kinesis helps in the ingestion process of

  • Real time data metrics and analytics
  • Real time aggregation and data warehousing
  • Ingestion from multiple data streams

In Kinesis, consumers are the ones who process the data that’s generated by the producers. These consumers are called “Amazon Kinesis Data Streams Application.” Kinesis Data Analytics can also be used to transform and analyze streaming data in real time, providing an alternative solution for processing and analyzing real-time streaming data.

In Amazon Kinesis Data Stream Application there are two types of consumers,

  • Shared fan-out consumers
  • Enhanced fan-out consumers

Consumers process all the data from the Kinesis data stream. Through shared fan-out process, multiple consumers are given access to study data from the same stream in parallel only when the consumer uses enhanced fanout. A speed of 2 mb/sec is allotted to each throughput. This doesn’t alter, the 2 mb/sec throughput per shard is fixed even if ‘n’ number of consumers access from the same shard.

However, the registered consumers with enhanced fan-out receive up to 2 mb/sec speed independently and receive their own read throughput per shard.

How Sprinkle makes the most of the data generated by Kinesis Data Analytics

From producers to consumers to being stored in elastic cloud instances (EC2), the data is then warehoused in various formats like Amazon S3, Amazon Redshift, Dynamo DB, Amazon EMR, etc. Additionally, Kinesis Data Firehose can be used for reliably loading streaming data into these data stores and transforming the data before it's loaded.

Sprinkle on the other hand allows you to visualize the analytics on a much wider scale. It takes data from various databases and ingests into one common database structure called “Flows” These flows help build “Cubes” which provide a three facet accessibility say dimensional, date dimensional and measurement.

These data can be utilized to provide the customers with proper analytics, for instance, in the form of knowledge graphs, growth curves, google like search to get actionable insights and infographics on the data that’s warehoused from Kinesis.

Sprinkle with Kinesis and the steps involved

In order to build a data table in Sprinkle, a new data needs to be ingested from any database, in this case it’s from Kinesis.

Manage data” tab is accessed which routes to the “Data source“ tab in which new data can be created. On clicking “Add new data”, a new page displays a number of sources through which the data can be ingested, in this case “Kinesis” is selected. The data is given a new name, and thus the new data is created.

On creating new data, a new page comes up with three tabs

  • Configure
  • Add table
  • Run and schedule

Configure:

The configure table requires you to fill in the “Access key”, “Secret access key” and “Region”.

These are much like the credentials provided by AWS to access its services. On completion, “Create” is selected.

Add table:

After the configuration process, the name of the “Stream” is selected from a drop down list which consists of a number of pre configured “Stream names.” Now table is created.

Run and schedule data records:

The table is allowed to run and checked if it is a successful process or a failure, the number of jobs that are supposed to run in parallel is scheduled with the help of “Concurrency” tab. The data is ingested at this point. The next step is where the “Explores” tab is accessed to carry on with the process of scripting which allows Sprinkle to provide actionable insights as per the need of the users.

Kinesis helps in generating real time data at a rapid pace, this helps sprinkle build a better visualization. However, Sprinkle's ETL allows businesses to understand the gathered data and simplify the analytics part that lets the business understand the best way to make use of data.

Written by
Soham Dutta

Blogs

Best practices for using Kinesis with Sprinkle