Blogs

Best practices to use Apache Kafka

Best practices to use Apache Kafka with Sprinkle

In this day and age, the collection of records and data are made in abundance irrespective of the business. There’s a growing need in the market to stream these large data in real-time. However, streaming large data might seem to be a tedious and time-consuming process but with Apache Kafka, the streaming process is eased up. Let’s plunge deep into this.

Kafka is an open-source publishing/subscribing messaging system developed by LinkedIn. Streaming large data on a real-time basis is a stern test and it’s a time-consuming process but with Kafka, the data is streamed into the server as it comes.

Initially, Kafka was used as an activity tracking pipeline but later on it was used as an operational data monitoring system. Kafka was generally claimed to be an alternative for log aggregation but it gives a cleaner abstraction and as a whole, Kafka is better than most message brokers.

Kafka’s data messaging system comes in 4 steps:

  • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. Stream processors can stream data as it comes in, they extract the log data that’s required in real-time and processes.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table. Connector modules are external plug-in modules for various databases which can publish or receive new messages.

What Apache Kafka does

Kafka is the data messaging system which is used to ingest real-time data at a rapid pace. Producers are the ones who publish the data messages in the form of “Topics”, a unique name that’s given for a data set. These kinds of unique names are provided for easy identification of the data stream.

As producers publish/stream messages through topics to the Kafka server, the consumers on the other end receive/subscribe the data as it comes in. These messages fall under the term “Topics.”

Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing.

Kafka has multiple servers running multiple processes all at once and can distribute messages from multiple servers to multiple “Consumer groups” in real-time. As these data records are large, they cannot be streamed into a single Kafka server, hence the data records are partitioned.

Producers are the ones who decide the number of partitions big data requires. These partitions are stored on individual systems, with some simple mathematics, the partitions can be calculated in an orderly fashion. However, a large number of partitions can only be handled by dividing the work between the “Consumers” i.e. “Consumer group.”

Kafka, a distributed system depends on a set of computers which act as the servers. These group of computer systems acting together for a common purpose is termed as “Clusters.”

Sprinkle with Kafka and the steps involved

Sprinkle supports ingesting data from many data sources, one of them being Kafka. The following section describes how you can add Kafka as a Data Source in Sprinkle.

Click through the “Manage data” tab which routes you to the “Data source” tab in which new datasource could be created. On clicking “Add new datasource”, a modal displays the types of data sources through which the data can be ingested, in this case, the data source is “Kafka". The data is given a new name, and thus new data is created.

On creating new data, a new page pops up with three tabs

  • Configure
  • Add table
  • Run and schedule

Configure:

The configure table requires you to select one of the two radio buttons “Zookeeper” and “Bootstrap server.”

  • Zookeeper: The configuration, synchronization services in Apache Kafka are acquired with the hierarchical key value called Zookeeper.
  • Bootstrap server: Bootstrapping is uniform authentication of the user tools and servers unknown to each other which exchange a secret session key.

The Zookeeper radio button requires you to fill in “Zookeeper connection” whereas the Bootstrap radio button requires you to fill in both “Bootstrap server” tab and also the “Zookeeper connection” tab.

You can connect Kafka by providing a Zookeeper Address or a Bootstrap Server Address. Whichever one you choose, it should be accessible by the Compute Driver.

Add table:

After configuring the data source, you can add tables to it. Here you can configure a topic to be ingested as a table in the warehouse. We assume the records in the topic are in “JSON” format. “JSON” being unstructured data, you need to also provide a “Schema” for the table in the warehouse. Depending on the warehouse, you may be able to use complex types (map/array/struct, etc) as well.

Note that adding tables doesn’t immediately import the table. You need to go to the “Run and Schedule Tab” and run the ingestion Job. After the configuration process, the “Topic name” is updated and “Hive schema” tab is dropped down where it represents the schema of the record i.e. “string”, “int”, etc based on the client’s needs. This brings a wide range of options in which the table can be created.

Run and schedule:

This tab shows settings for Ingestion Jobs. You can run an Ingestion Job by clicking on the Run Button. The section below shows a (limited) list of past jobs. An Ingestion job consists of a combination of “Tasks”. Generally, one table translates to one task. You can control the parallelism of these tasks on the left. You can also control how many times a task should be retried if it fails.

Once you click on “Run”, One Ingestion Job starts and it spears up to the top of the list of jobs. You can expand that to see the current and upcoming tasks. Sprinkle Provides you real-time updates on what exactly is happening in your Ingestion job. In the “Jobs List”, you can also see statistics about ingestion like the number of records, number of bytes fetched, etc.

If the job fails, you will be able to see the error in the message column. The job can fail due to the following reasons:

  • Incorrect/Unreachable endpoints provided in the “Configure Tab”. (Double-check the values you have provided and also check the connectivity from your side.)
  • Incorrect configuration in a table. (Delete the table and add again with correct values.)

The next step is to schedule the ingestion to run at various frequencies (the default being nightly).

The table is allowed to run and checked if it is a successful process or a failure, the number of jobs that are supposed to run in parallel is scheduled with the help of “Concurrency” tab. The data is ingested at this point. There is no assurance for a job to succeed after this step, so the number of times the failed jobs need to be re-run could also be scheduled.

The next step is where the “Explores” tab is accessed to carry on with the process of scripting which allows Sprinkle to provide actionable insights as per the need of the users.

Apache Kafka’s distributed streaming system helps in generating real time data at very short time intervals, this helps sprinkle build a better visualization. However, Sprinkle’s ETL allows businesses to understand the gathered data and simplify the analytics part that lets businesses understand the best way to make use of data.