Data Warehouse

Tutorial Videos

API

Storage and Compute

Data Sources

CDC Setup

Transform

KPI

Models
Segments

Dashboard

Drill Down

Explores

Machine Learning

Sharing

Scheduling

Notifications

View Activity

Admin

Launch On Cloud

FAQs

FAQ's

Security

Feedback

Option to take feedback from UI

Release Notes

Release Notes

Advanced Features in Flow

Models

Segments

There are two advanced features in Flows.

  1. Managed Data Loading
  2. Incremental Processing

1. Managed Data Loading

If your flow does DROP TABLE somewhere in your script, then reports depending on those tables end up failing. While this is not a problem if you're dropping temporary tables, but if you start dropping tables which power reports, this starts to become a problem. To get around that, you can let Sprinkle manage the loading of the data.

The way this works is, you would create a Staging Table and specify Staging Table and Final Table in the Settings section. First, you need to make changes in your script. You need to remove the code that handles loading the data into the final table. Instead of that, you should load data into a Staging Table. Then Save the script, head to Settings and enable Managed Data Loading, providing values for Staging and Final Tables. Here again, Save the settings. If your script doesn't meet the criteria, saving settings will fail.

   

Managed Data Loading1

   

In script loading data into staging table.

   

Managed Data Loading2

   

Under the hood, we take care of loading data into the final table in a way that avoids DROP TABLE. One caveat to this is that if your table has some incompatible column changes (e.g. string to map) then it'll have to drop the table and recreate. But since the script doesn't change very frequently, this would be beneficial in most of the cases.

At the end of this, the final table will look exactly like the staging table. This is implemented by having a view at the final table instead of the regular table. The view is backed by a sprinkle-managed-table which is not visible to you from the UI. This allows us to manage column addition/removal/rename/type change internally over the managed table. The columns that are available in the view (final table) are the same as the staging table.

User-defined Variables:

Users can define Variables with some value in the flow settings and that variable can be used in that particular flow. For example, in incremental loading user defines the number of window days. The incremental process is explained below.

2. Incremental Processing (Partitioned Managed Data Loading)

Incremental processing helps in reducing the pipeline run time by reducing the data crunched. This builds on the concept of managed data loading (read paragraphs above) by adding a few more abstractions. You still have to provide staging and final table names. Just like above, the final table is actually a view over a sprinkle managed table. The sprinkle-managed-table is not visible to you, but the view is.

To enable partitioned-managed-data-loading, you need to provide a date column. You need to take care of coming up with an expression for this column. For daily partitioning, this column should be in yyyy-MM-dd format. If you already have such a column in your staging table, you can directly use it. The sprinkle-managed-table as mentioned above is stored as a Partitioned table. The date column specified earlier becomes a partitioning column for this table. Other columns are regular columns. Note that from a query perspective, regular and partition columns are treated the same. You can still select partition columns, apply filters on them etc. For loading staging table into the managed table, the behaviour is to replace all partitions for which the staging table contains rows. They will be loaded into the sprinkle-managed-table and it'll be visible via the view (final table). So, you should take extra care to construct the staging table such that whenever it has data for a particular partition, it has the complete data for that partition. This could be taken care by putting proper filter clauses in your script.

   

Incremental Processing

   

The staging table should be created so that it has last-x-days data. To help with that, you should be putting where clauses in your input tables' select statements so you only process the data that needs to be loaded in the staging table. You have full freedom to construct these where clauses. Just remember, whatever partition values the staging table has data for, will be replaced in the sprinkle-managed-table. So, if you don't put anywhere clauses in input tables, your flow will process complete data and load all partitions on every run.

To help with check pointing, you have a pre-defined variable available for use in your script. You can use ${sprinkle.latest.partition} to construct your where clauses. We are also providing an option to add more variables, each of which can be accessed as ${name} where name is the variable name. Variable names don't contain spaces. You can define constant day-windows for your input tables and use them as variables in script. See the last section for examples.

What does DROP MANAGED TABLE do?

As explained above, Managed Data loading (both partitioned and non-partitioned) works with three tables: staging-table, sprinkle-managed-table and final-view. In most of the cases, Sprinkle automatically takes care of updating both sprinkle-managed-table and final-view. In case of a bug, this button acts as a backdoor to getting sprinkle to recreate the managed table. Views, on the other hand, are simple to update and are unlikely to fail.

There is one scenario in which you would consciously need to press this button and that's related to schema evolution. When a field is added upstream (i.e. input tables to a flow), sprinkle's default behaviour is to treat that field as from now on i.e. this field will be available at the table level but the value will be null for older partitions. While this could be enough for some of the cases, sometimes you do need to add a new field to your entire data. These are the cases when you'd need to press this button. This will proceed with dropping the managed table and when the flow runs next, it'll recreate the table from scratch.

Example

-- incremental processing
     droptableifexists test_staging;
     createtable test_staging storedas orc asselect * from
-- write your own table x here:
   (select '2019-01-03' day, '33' col2 unionallselect '2019-01-02' day, '22' col2     unionallselect '2019-01-01', '11' unionallselect '2019-01-04', '44') x
-- use latest.partition and custom defined variables here.
      whereday>= date_sub('${sprinkle.latest.partition}', ${window});
-- If you are joining multiple tables, you will need to define windows for all of them.

Transactions statements are now allowed in transformation:

A transaction is a sequential group of statements, queries, or operations such as select, insert, update or delete to perform as a one single work unit that can be committed or rolled back.

We support transaction statements within begin -> end/commit statements. 


import requests
from requests.auth import HTTPBasicAuth

auth =  HTTPBasicAuth(<API_KEY>, <API_SECRET>)
response = requests.get("https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE_ID>", auth)

print(response.content)

library('httr')

username = '<API KEY>'
password = '<API SECRET>'

temp = GET("https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>",
           authenticate(username,password, type = "basic"))

temp = content(temp, 'text')
temp = textConnection(temp)
temp = read.csv(temp)

/*Download the Data*/

filename resp temp;
proc http
url="https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>"
   method= "GET"  
   WEBUSERNAME = "<API KEY>"
   WEBPASSWORD = "<API SECRET>"
   out=resp;
run;

/*Import the data in to csv dataset*/
proc import
   file=resp
   out=csvresp
   dbms=csv;
run;

/*Print the data */
PROC PRINT DATA=csvresp;
RUN;

import requests
import json

url='http://hostname/api/v0.4/createCSV'

username='API_KEY'
password='API_SECRET'

files={'file':open('FILE_PATH.csv','rb')}
values={'projectname':PROJECT_NAME','name':'CSV_DATASOURCE_NAME'}

r=requests.post(url, files=files, data=values, auth=(username,password))

res_json=json.loads(r.text)

print(res_json['success'])

import requests
import json

url='http://hostname/api/v0.4/updateCSV'

username='API_KEY'
password='API_SECRET'

files={'file':open('FILE_PATH.csv','rb')}
values={'projectname':PROJECT_NAME','name':'CSV_DATASOURCE_NAME'}

r=requests.post(url, files=files, data=values, auth=(username,password))

res_json=json.loads(r.text)

print(res_json['success'])

import requests

url='https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>'

username='API_KEY'
password='API_SECRET'

r=requests.get(url,auth=(username,password))
print(r)
print(r.text)

import requests

import pandas as pd

import io

url='https://<hostname>/api/v0.4/explore/streamresult/<EXPLORE ID>'

secret='API_SECRET'

r=requests.get(url,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

df = pd.read_csv(io.StringIO(r.text),sep=',')

import requests

import pandas as pd

import io

url='https://<hostname>/api/v0.4/segment/streamresult/<SEGMENT ID>'

secret='API_SECRET'

r=requests.get(url,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

df = pd.read_csv(io.StringIO(r.text),sep=',')

import requests

import json

url='http://hostname/api/v.o4/createCSV'

files={'file':open('path/file.csv’')}

values={'projectname':PROJECT_NAME,'name':'csv_datasource_name/table_name'}

secret='API_SECRET'

r=requests.post(url, files=files, data=values, headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

res_json=json.loads(r.text)

import requests

import json

url='http://hostname/api/v.o4/updateCSV'

files={'file':open('path/file.csv’')}

values={'projectname':PROJECT_NAME,'name':'csv_datasource_name/table_name'}

secret='API_SECRET'

r=requests.post(url, files=files, data=values,headers = {'Authorization': 'SprinkleUserKeys ' +secret } )

res_json=json.loads(r.text)