S3
The Amazon S3-based data lake solution uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability. This storage space is used as a cloud storage data source.
Sprinkle supports a wide range of data sources. On clicking the “+sign”, a list of data sources pops up. In this case, S3 Datasource is selected. A new S3 Data source is named and created.
After naming the data source, the connection tab would require the user to provide the Secret Key, Access Key, Region, Bucket Name. The credentials can be tested if they are valid or not by testing the connection before updating.
Follow the documentation at https://amzn.to/2CS9OcK to generate access key and secret key and provide them in the connection details, if you are allowing access key based access to your storage.
Region should be where the storage bucket was created, for example ap-south-1 and Bucket name is the name of the bucket created on AWS S3. Eg: Twx-Bucket
In Datasets, the user is required to specify a table name and select the type of ingestion, whether it is complete ingestion or incremental ingestion. Complete ingestion loads the entire data at once irrespective of the pre-existing data. This takes significant time, if data is huge. In Incremental loading only new and latest data is ingested.
After selecting the ingestion mode, the File Type needs to be selected as either ORC, JSON, CSV or PARQUET. Then, the user can optionally define a directory path to pull data from, so that it pulls all the files in that specific path, Eg: s3a://test-sprinkle-a/s3Ingest/s3Ingest13
In the Ingestion jobs tab, the concurrency (number of tables that can run in parallel, a maximum of 7) can be set preferentially before running the job. The status of the job will be updated in the tab below once it’s complete. The jobs can also be set to run automatically by enabling autorun. By default, the frequency is set to every night. Frequency can be changed by clicking on More --> Autorun-->Change Frequency.
Sprinkle supports different types of delimiters for CSV ingestion through cloud storage ingestion. When a user chooses CSV as the type of file then drop downs related to CSV file appear.
In the drop down there are delimiters like comma,tab,pipe, dash or other.
If the user chooses OTHER_CHARACTER as a type of CSV delimiter then one more field appears where the user can write the symbol for the delimiter.