Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data and storing files for distributed access. This storage space is used as a cloud storage data source.
Sprinkle supports a wide range of data sources. On clicking the “+sign”, a list of data sources pops up. In this case, Azure Blob Datasource is selected. A new Azure Blob Data source is named and created.
After naming the data source, the connection tab would require the user to provide the Access Key, Storage Account Name, Container Name. The credentials can be tested if they are valid or not by testing the connection before updating.
For the Storage account name, the name of Azure storage account name should be specified and for the container name, the container created inside the storage account should be specified.
For Access key, the user can login to Azure dashboard, and copy the key from Storage Accounts -> <The storage account> -> Access keys view.
The obtained key should be applied in the access key field in the Connection tab
In Datasets, the user is required to specify a table name and select the type of ingestion, whether it is complete ingestion or incremental ingestion. Complete ingestion loads the entire data at once irrespective of the pre-existing data. This takes significant time, if data is huge. In Incremental loading only new and latest data is ingested.
After selecting the ingestion mode, the File Type needs to be selected as either ORC, JSON, CSV or PARQUET. Then, the user can optionally define a directory path to pull data from, so that it pulls all the files in that specific path, Eg: wasbs://demo-sprinkle[email protected]/testhive/datasource//34a
In the Ingestion jobs tab, the concurrency (number of tables that can run in parallel, a maximum of 7) can be set preferentially before running the job. The status of the job will be updated in the tab below once it’s complete. The jobs can also be set to run automatically by enabling autorun. By default, the frequency is set to every night. Frequency can be changed by clicking on More --> Autorun-->Change Frequency.
Sprinkle supports different types of delimiters for CSV ingestion through Azure blob. When a user chooses CSV as the type of file then drop downs related to CSV file appear.
In the drop down there are delimiters like comma, tab, pipe, dash or other.
If the user chooses OTHER_CHARACTER as a type of CSV delimiter then one more field appears where the user can write the symbol for the delimiter.