Data Warehouse

Tutorial Videos

API

Storage and Compute

Data Sources

CDC Setup

Transform

KPI

Models
Segments

Dashboard

Drill Down

Explores

Machine Learning

Sharing

Scheduling

Notifications

View Activity

Admin

Launch On Cloud

FAQs

FAQ's

Security

Feedback

Option to take feedback from UI

Release Notes

Release Notes

Ingestion Mode

Models

Segments

In Datasource >> Datasets, while adding datasets the ingestion mode can be selected to be either Complete or Incremental, depending on the need.

Complete Ingestion

Complete ingestion fetches the entire data during each ingestion irrespective of the pre-existing data. This takes a significant amount of time and also increases the load on the database server.

Incremental Ingestion

In Incremental ingestion, only new and updated data is ingested to the data source based on checkpoints. 

For the Incremental ingestion, the “Time column” (also known as Updated_at Timestamp) needs to be specified, which forms the basis for the incremental ingestion.

First Fetch

In the first run, the full fetch i.e complete ingestion is done. The complete data is loaded into the dataset, also updating the checkpoint for the fetch. Checkpoint is the max (timestamp) of previous ingestion and is used in the following fetches to update only the new & updated rows.

Following Fetches

In the subsequent fetches post the First Fetch, only incremental records are fetched. The table is updated with only the rows whose Time Column entry is greater than or equal to the Checkpoint.


Remove Duplicate Rows

The option can be enabled to remove duplicate row entries. This is generally required in the case when incremental ingestion leads to the creation of duplicate rows, where rows get updated on the source.

For example, consider an e-commerce table order. If the order status of a particular order gets updated then the time column entry corresponding to the order ID also gets updated with the current timestamp. In this scenario, the updated entries are added to the table. When querying on the final table the data might get duplicated.

Users can opt to remove duplicate rows by clicking on yes. On enabling, the tool removes the duplicate rows present in the table based on the unique key provided.

Periodic Full Fetch

The periodic Full Fetch option enables to do a periodic complete fetch of the data from the source. This can be done in order to maintain proper sync between the dataset & the source data. On clicking “Yes”, you get an option to select the periodicity of the “Full Fetch Interval” field. You would want to periodic full fetch if data on source gets deleted, and you want to delete them from the warehouse as well. 

Users can opt for complete ingestion every night or every week. During complete ingestion, the complete table will get dropped and recreated in the warehouse. Transactional data where there is an update expected in the row entries can be ingested with incremental settings with periodic full fetch enabled.

By default, “Periodic Full Fetch” is set to No. Events or log-based data can be ingested without the Periodic Full Fetch being enabled, as there are no updates expected on the row entries.


Best Practices

  1. All event or log-based data can be ingested with Incremental mode and no periodic full fetch enabled because there won’t be any change in the historical data.
  2. All transaction data with updatedat timestamp can be ingested under incremental mode with the option to remove duplicates.
  3. Not required to opt to remove duplicates if only new rows get added on the source, for ex: History or Audit data.
  4. Enable periodic full fetch on incremental mode, if records get deleted on source and warehouse should be in sync, disable otherwise.

import requests
from requests.auth import HTTPBasicAuth

auth =  HTTPBasicAuth(<API_KEY>, <API_SECRET>)
response = requests.get("https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE_ID>", auth)

print(response.content)

library('httr')

username = '<API KEY>'
password = '<API SECRET>'

temp = GET("https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>",
           authenticate(username,password, type = "basic"))

temp = content(temp, 'text')
temp = textConnection(temp)
temp = read.csv(temp)

/*Download the Data*/

filename resp temp;
proc http
url="https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>"
   method= "GET"  
   WEBUSERNAME = "<API KEY>"
   WEBPASSWORD = "<API SECRET>"
   out=resp;
run;

/*Import the data in to csv dataset*/
proc import
   file=resp
   out=csvresp
   dbms=csv;
run;

/*Print the data */
PROC PRINT DATA=csvresp;
RUN;

import requests
import json

url='http://hostname/api/v0.4/createCSV'

username='API_KEY'
password='API_SECRET'

files={'file':open('FILE_PATH.csv','rb')}
values={'projectname':PROJECT_NAME','name':'CSV_DATASOURCE_NAME'}

r=requests.post(url, files=files, data=values, auth=(username,password))

res_json=json.loads(r.text)

print(res_json['success'])

import requests
import json

url='http://hostname/api/v0.4/updateCSV'

username='API_KEY'
password='API_SECRET'

files={'file':open('FILE_PATH.csv','rb')}
values={'projectname':PROJECT_NAME','name':'CSV_DATASOURCE_NAME'}

r=requests.post(url, files=files, data=values, auth=(username,password))

res_json=json.loads(r.text)

print(res_json['success'])

import requests

url='https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>'

username='API_KEY'
password='API_SECRET'

r=requests.get(url,auth=(username,password))
print(r)
print(r.text)

import requests
import pandas as pd

url='https://<hostname>/api/v0.4/explores/infoByFolder/<SPACE_ID>'

username='API_KEY'
password='API_SECRET'

r=requests.get(url,auth=(username,password)).json()
df = pd.DataFrame(r)
print(df)

import requests
import pandas as pd

url='https://<hostname>/api/v0.4/folders/byOrgName/<ORG_NAME>'

username='API_KEY'
password='API_SECRET'

r=requests.get(url,auth=(username,password)).json()
df = pd.DataFrame(r)
print(df.loc[:,['name','id']])

import requests
import pandas as pd

url='https://<host>/api/v0.4/explore/sql/<EXPLORE_ID>/<PROJECT_NAME>'

username='API_KEY'
password='API_SECRET'

r=requests.get(url,auth=(username,password)).json()
print(r.text)

import requests

import pandas as pd

import io

url='https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>'

secret='API_SECRET'

r=requests.get(url,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

df = pd.read_csv(io.StringIO(r.text),sep=',')

import requests

import pandas as pd

import io

url='https://<hostname>/api/v0.4/segment/streamresult/<SEGMENT ID>'

secret='API_SECRET'

r=requests.get(url,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

df = pd.read_csv(io.StringIO(r.text),sep=',')

import requests

import json

url='http://hostname/api/v.o4/createCSV'

files={'file':open('path/file.csv’')}

values={'projectname':PROJECT_NAME,'name':'csv_datasource_name/table_name'}

secret='API_SECRET'

r=requests.post(url, files=files, data=values, headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

res_json=json.loads(r.text)

import requests

import json

url='http://hostname/api/v.o4/updateCSV'

files={'file':open('path/file.csv’')}

values={'projectname':PROJECT_NAME,'name':'csv_datasource_name/table_name'}

secret='API_SECRET'

r=requests.post(url, files=files, data=values,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

res_json=json.loads(r.text)

import requests

import pandas as pd

import io

url='https://<host>/api/v0.4/explore/sql/<EXPLORE_ID>/<PROJECT_NAME>'

secret='API_SECRET'

r=requests.get(url,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

df = pd.read_csv(io.StringIO(r.text),sep=',')