Best practices to use Apache Kafka with Sprinkle
In this day and age, the collection of records and data are made in abundance irrespective of the business. There’s a growing need in the market to stream these large data in real-time. However, streaming large data might seem to be a tedious and time-consuming process but with Apache Kafka, the streaming process is eased up. Let’s plunge deep into this.
Kafka is an open-source publishing/subscribing messaging system developed by LinkedIn. Streaming large data on a real-time basis is a stern test and it’s a time-consuming process but with Kafka, the data is streamed into the server as it comes.
Initially, Kafka was used as an activity tracking pipeline but later on it was used as an operational data monitoring system. Kafka was generally claimed to be an alternative for log aggregation but it gives a cleaner abstraction and as a whole, Kafka is better than most message brokers.
Kafka’s data messaging system comes in 4 steps:
What Apache Kafka does
Kafka is the data messaging system which is used to ingest real-time data at a rapid pace. Producers are the ones who publish the data messages in the form of “Topics”, a unique name that’s given for a data set. These kinds of unique names are provided for easy identification of the data stream.
As producers publish/stream messages through topics to the Kafka server, the consumers on the other end receive/subscribe the data as it comes in. These messages fall under the term “Topics.”
Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing.
Kafka has multiple servers running multiple processes all at once and can distribute messages from multiple servers to multiple “Consumer groups” in real-time. As these data records are large, they cannot be streamed into a single Kafka server, hence the data records are partitioned.
Producers are the ones who decide the number of partitions big data requires. These partitions are stored on individual systems, with some simple mathematics, the partitions can be calculated in an orderly fashion. However, a large number of partitions can only be handled by dividing the work between the “Consumers” i.e. “Consumer group.”
Kafka, a distributed system depends on a set of computers which act as the servers. These group of computer systems acting together for a common purpose is termed as “Clusters.”
Sprinkle with Kafka and the steps involved
Sprinkle supports ingesting data from many data sources, one of them being Kafka. The following section describes how you can add Kafka as a Data Source in Sprinkle.
Click through the “Manage data” tab which routes you to the “Data source” tab in which new datasource could be created. On clicking “Add new datasource”, a modal displays the types of data sources through which the data can be ingested, in this case, the data source is “Kafka". The data is given a new name, and thus new data is created.
On creating new data, a new page pops up with three tabs
Configure:
The configure table requires you to select one of the two radio buttons “Zookeeper” and “Bootstrap server.”
The Zookeeper radio button requires you to fill in “Zookeeper connection” whereas the Bootstrap radio button requires you to fill in both “Bootstrap server” tab and also the “Zookeeper connection” tab.
You can connect Kafka by providing a Zookeeper Address or a Bootstrap Server Address. Whichever one you choose, it should be accessible by the Compute Driver.
Add table:
After configuring the data source, you can add tables to it. Here you can configure a topic to be ingested as a table in the warehouse. We assume the records in the topic are in “JSON” format. “JSON” being unstructured data, you need to also provide a “Schema” for the table in the warehouse. Depending on the warehouse, you may be able to use complex types (map/array/struct, etc) as well.
Note that adding tables doesn’t immediately import the table. You need to go to the “Run and Schedule Tab” and run the ingestion Job. After the configuration process, the “Topic name” is updated and “Hive schema” tab is dropped down where it represents the schema of the record i.e. “string”, “int”, etc based on the client’s needs. This brings a wide range of options in which the table can be created.
Run and schedule:
This tab shows settings for Ingestion Jobs. You can run an Ingestion Job by clicking on the Run Button. The section below shows a (limited) list of past jobs. An Ingestion job consists of a combination of “Tasks”. Generally, one table translates to one task. You can control the parallelism of these tasks on the left. You can also control how many times a task should be retried if it fails.
Once you click on “Run”, One Ingestion Job starts and it spears up to the top of the list of jobs. You can expand that to see the current and upcoming tasks. Sprinkle Provides you real-time updates on what exactly is happening in your Ingestion job. In the “Jobs List”, you can also see statistics about ingestion like the number of records, number of bytes fetched, etc.
If the job fails, you will be able to see the error in the message column. The job can fail due to the following reasons:
The next step is to schedule the ingestion to run at various frequencies (the default being nightly).
The table is allowed to run and checked if it is a successful process or a failure, the number of jobs that are supposed to run in parallel is scheduled with the help of “Concurrency” tab. The data is ingested at this point. There is no assurance for a job to succeed after this step, so the number of times the failed jobs need to be re-run could also be scheduled.
The next step is where the “Explores” tab is accessed to carry on with the process of scripting which allows Sprinkle to provide actionable insights as per the need of the users.
Apache Kafka’s distributed streaming system helps in generating real time data at very short time intervals, this helps sprinkle build a better visualization. However, Sprinkle’s ETL allows businesses to understand the gathered data and simplify the analytics part that lets businesses understand the best way to make use of data.