The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
Sprinkle supports a wide range of data sources. A list of data sources will be shown in the Create datasource tab. In this case, on selecting the HDFS tab, it requires the user to name it. Post naming, it routes the user to the configuration page.
Post naming, it routes the user to the configuration page.Now the user needs to fill in the credentials such as Url, WebHDFS URL,Socks Host etc before testing the connection and updating.
Optimising Incremental Ingestion in HDFS Database
Also users can select Yes or No to Optimize Incremental Ingestion. If optimize is Yes, all the datasets will undergo full ingestion on every Sunday or every night. If optimize is No, data will be ingesting incrementally and it never goes under complete ingestion.
In the Dataset page, the user can add tables either by table or query method, in table method the user is required to apply a table name and filter clause could also be applied whenever required.
Users can give the name of the table and choose the mode of ingestion i.e complete or incremental.If it’s incremental, the time column name should also be specified. It’s not the case when it comes to complete ingestion.
In the Ingestion jobs tab, the concurrency (number of tables that can run in parallel, a maximum of 7) can be set preferentially before running the job. The status of the job will be updated in the tab below once it’s complete. The jobs can also be set to run automatically by enabling autorun. By default, the frequency is set to every night. Frequency can be changed by clicking on More --> Autorun-->Change Frequency.