Data Warehouse

Tutorial Videos

API

Storage and Compute

Data Sources

CDC Setup

Transform

KPI

Models
Segments

Dashboard

Drill Down

Explores

Machine Learning

Sharing

Scheduling

Notifications

View Activity

Admin

Launch On Cloud

FAQs

FAQ's

Security

Feedback

Option to take feedback from UI

Release Notes

Release Notes

Ingestion Mode

Models

Segments

Data ingestion can be done in two ways.

  • Complete ingestion mode.
  • Incremental ingestion mode.

Complete ingestion mode

Complete ingestion loads the entire data at once irrespective of the pre-existing data. This takes significant time and also increases load on the database server.

   

Select data source

   

While adding table to data source you can select type of ingestion mode.

   

Select data source

   

Incremental ingestion mode

In Incremental loading only new and latest data is ingested to the data source. In this process three tables are created. ds_datasource_table, ds_datasource_table_base and ds_datasource_table_delta.

   

Select data source

   

Updated Rows Incremental Ingestion

Updated rows Incremental Ingestion

In updated rows incremental ingestion users need to select the time column (updated_at, modified_at, last_updated_on) Based on that time column data is fetched incrementally. In the first run it goes under full fetch and complete data is loaded into the table_base. And also it marks the checkpoint after every run. (Checkpoint is the max (timestamp) of previous ingestion). In the next run it pulls records where the time column is greater than or equal to the checkpoint and that latest data is loaded into the table_delta.

Users can opt for complete ingestion at every night or every week. During complete ingestion the complete table will get dropped and recreated in the warehouse. In nightly ingestion it goes under complete loading every night and table_base is created. During the day it pulls data based on the checkpoint and table_delta is created. In weekly ingestion every sunday morning it goes under complete loading and table_base is created. During other days incremental fetch happens and table_delta is created.

The final table ds_datasource_table is a view of union of table_base and table_delta. There might be duplicates in the final view. For example, consider an e-commerce table order. If the order status of a particular order gets updated then the** UpdatedAt **time column will also get changed with the current timestamp. In this scenario few records may be present in table_base and also in table_delta. When querying on the final view the data might get duplicated.

Users can opt to remove duplicate rows by clicking on yes. On opting it, the tool removes the duplicate rows present in the table based on the unique key. Finally, creating ds_datasource_table by merging these two tables is done with one or more unique keys.

As mentioned above, data can be fully fetched based on “Every Sunday” and “Every Night” time interval and if the user wants to apply a filter clause, conditions can be applied here.

New Rows Incremental Ingestion

New Rows Incremental Ingestion

In new rows incremental ingestion it goes under complete loading only in the first run and the complete data is loaded into the table_base. From next run data is pulled from the source based on the checkpoint (Checkpoint is the max (timestamp) of previous ingestion) i.e. it goes under incremental fetch and latest data loaded into the table_delta.

Remove duplicate rows by clicking on yes. On opting it, the tool removes the duplicate rows present in the table based on the unique key. Finally creating ds_datasource_table by merging these two tables is done with one or more unique keys.

Best Practices in Incremental Ingestion

  1. All event or log based data can be ingested under New Rows Incremental because there won’t be any change in the historical data.
  2. All transaction data with updatedat timestamp can be ingested under Updated Rows Incremental ingestion.
  3. Not required to opt to remove duplicates for New Rows incremental ingestion.
  4. Users can decrease the load on their sources by blocking complete ingestion in Updated Rows Incremental ingestion by selecting **No **in Optimize Incremental Ingestion in connection details as below. By default it is set to **Yes **that it goes under complete ingestion periodically either every night or every sunday.
Best Practices in Incremental Ingestion

import requests
from requests.auth import HTTPBasicAuth

auth =  HTTPBasicAuth(<API_KEY>, <API_SECRET>)
response = requests.get("https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE_ID>", auth)

print(response.content)

library('httr')

username = '<API KEY>'
password = '<API SECRET>'

temp = GET("https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>",
           authenticate(username,password, type = "basic"))

temp = content(temp, 'text')
temp = textConnection(temp)
temp = read.csv(temp)

/*Download the Data*/

filename resp temp;
proc http
url="https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>"
   method= "GET"  
   WEBUSERNAME = "<API KEY>"
   WEBPASSWORD = "<API SECRET>"
   out=resp;
run;

/*Import the data in to csv dataset*/
proc import
   file=resp
   out=csvresp
   dbms=csv;
run;

/*Print the data */
PROC PRINT DATA=csvresp;
RUN;

import requests
import json

url='http://hostname/api/v0.4/createCSV'

username='API_KEY'
password='API_SECRET'

files={'file':open('FILE_PATH.csv','rb')}
values={'projectname':PROJECT_NAME','name':'CSV_DATASOURCE_NAME'}

r=requests.post(url, files=files, data=values, auth=(username,password))

res_json=json.loads(r.text)

print(res_json['success'])

import requests
import json

url='http://hostname/api/v0.4/updateCSV'

username='API_KEY'
password='API_SECRET'

files={'file':open('FILE_PATH.csv','rb')}
values={'projectname':PROJECT_NAME','name':'CSV_DATASOURCE_NAME'}

r=requests.post(url, files=files, data=values, auth=(username,password))

res_json=json.loads(r.text)

print(res_json['success'])

import requests

url='https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>'

username='API_KEY'
password='API_SECRET'

r=requests.get(url,auth=(username,password))
print(r)
print(r.text)

import requests

import pandas as pd

import io

url='https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>'

secret='API_SECRET'

r=requests.get(url,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

df = pd.read_csv(io.StringIO(r.text),sep=',')

import requests

import pandas as pd

import io

url='https://<hostname>/api/v0.4/segment/streamresult/<SEGMENT ID>'

secret='API_SECRET'

r=requests.get(url,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

df = pd.read_csv(io.StringIO(r.text),sep=',')

import requests

import json

url='http://hostname/api/v.o4/createCSV'

files={'file':open('path/file.csv’')}

values={'projectname':PROJECT_NAME,'name':'csv_datasource_name/table_name'}

secret='API_SECRET'

r=requests.post(url, files=files, data=values, headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

res_json=json.loads(r.text)

import requests

import json

url='http://hostname/api/v.o4/updateCSV'

files={'file':open('path/file.csv’')}

values={'projectname':PROJECT_NAME,'name':'csv_datasource_name/table_name'}

secret='API_SECRET'

r=requests.post(url, files=files, data=values,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

res_json=json.loads(r.text)