Sprinkle

Sprinkle

  • Docs
  • Tutorials
  • API
  • FAQ's
  • Blog
  • Go to sprinkledata.com

›Machine Learning

Data Warehouse

  • Why the warehouse?
  • Amazon Athena
  • Apache Hive
  • Databricks
  • BigQuery
  • Snowflake
  • Redshift

Storage

  • Why the storage?
  • AWS S3 Bucket
  • Google Cloud Storage
  • Azure Blob Storage

Data Sources

  • Overview and Creating Data Source
  • Ingestion Mode
  • How Sprinkle handles the ingestion if there is a change in schema in the client DB?
  • Flattening JSON columns in DB
  • Column excluding and masking in DB table
  • Ingestion via SSH Tunnel
  • Configurable Destination Schema and table name
  • PostgreSQL
  • Salesforce
  • MySQL
  • MongoDB
  • Mixpanel
  • Hubspot
  • CosmosDB
  • CSV
  • AppsFlyer
  • CleverTap
  • SQL
  • Kafka
  • Amazon Kinesis
  • Azure Event Hub
  • Azure Table Storage
  • Zoho CRM
  • Freshsales
  • Google Analytics
  • GoogleSheet
  • Google Cloud Storage
  • Azure Blob
  • S3
  • Webhook
  • Sendgrid
  • Segment
  • Google Ads
  • Google Analytics MCF
  • Zendesk Support
  • Zendesk Chat
  • Google Search Console
  • Shopify
  • Facebook Ads
  • Mailchimp
  • WebURL
  • Klaviyo
  • SAP S4
  • Intercom
  • Marketo
  • Freshdesk
  • Leadsquared
  • Bigquery
  • MongoDB Atlas
  • Paytm
  • HDFS
  • FTPS
  • FTP

CDC Setup

  • MySQL
  • Postgres
  • Mongo

Transform

  • Schema Browser
  • Overview and Creating Flow
  • Advanced Features in Flow

KPI

    Models

    • Overview
    • Creating Model
    • Joins
    • Hierarchical Filters
    • Default Date Filters
    • Column Description in reports

    Segments

    • Overview
    • Creating Segment
    • Publish segment as table
    • Transpose
    • Show Labels Annotations on Charts
    • Tooltips
    • Fixed Columns
    • Conditional Builders
    • Cumulative Sum and Percentages
    • Embed Segment

    Metric Alerts

    • Overview and Creating Metric Alerts

Dashboards

  • Overview and Creating Dashboard
  • Embed Dashboard
  • Restricting filters
  • Sharing resources

Drill Down

  • Drill Down Feature In Segments And Dashboards
  • Drill Down Hierarchical Dimensions
  • Drill Down Expression Hierarchical Dimensions

Explores

  • Overview and Creating Explore
  • Show Labels Annotations on Charts
  • Tooltips

Machine Learning

  • Jupyter
  • Notebook Setup Guide

Sharing

  • Sharing Segments and Explore Reports
  • Share folders with users or groups

Scheduling

  • Schedule Timeline
  • Autorun

Notifications

  • Email Notifications
  • Slack Notifications

View Activity

  • View Activity

Admin

  • Admin -> usage
  • User Permissions & Restrictions
  • Github Integration

Launch On Cloud

  • AWS
  • Azure
  • Setup Sprinkle

Security

  • Security at Sprinkle
  • GDPR

Feedback

  • Option to take feedback from UI

Release Notes

  • Release Notes

Notebook Setup Guide

About this guide

This document helps users to setup their own jupyterhub with different authenticators and spawners. This also talks about what all are required for jupyter notebook to work with Sprinkle data.

Topics

1. Jupyter Installation

Check for java in machine, if not then install java 8.

  • sudo apt-get install openjdk-8-jdk

Installing anaconda3 with python3.7

  • curl -O https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh

  • bash Anaconda3-2019.10-Linux-x86_64.sh

  • source ~/.bashrc

For more information about anaconda3 installation, refer this

Installing jupyter

  • conda install -c conda-forge jupyterhub
    # installs jupyterhub and proxy

  • conda install notebook
    # needed if running the notebook servers locally

For more intormation about jupyter installation refer this

2. Common configurations

These configurations are mandatory to do, irrespective of your authenticators and spawners, inorder to make it work through sprinkle.

Change your directory to /anaconda3 and generate jupyterhub_config.py file for jupyter configurations

  • jupyterhub --generate-config

Open jupyterhub_config.py and set the following parameters-

  • Search for c.JupyterHub.admin_access and set it to true c.JupyterHub.admin_access = True

    This will enable external applications, having admin token, to access jupyter notebook with all admin permissions.

  • Set the ip of your machine in config file through which external applications can access your hub c.JupyterHub.hub_ip = <ip address of your machine>

It is preferred to create an environment for your notebook. Create a new conda environment-

  • If base is not active then activate conda first:

    • go to directory /anaconda3 and run the command source bin/activate
  • Create python3 environment:

    • conda create -n example python=3.7

      You can replace example with any other name as the name of your environment

  • Activate the environment:

    • conda activate example
  • Install the following packages in the created environment:

    conda install -c conda-forge -y \
    conda-pack \
    pandas \
    numpy \
    pyspark \
    requests \
    time \
    scikit-learn \
    matplotlib
    
  • Other useful packages you may want to install:

    datetime \
    tensorflow \
    matplotlib \
    pytorch \
    statsmodels \
    seaborn \
    findspark \
    gspread \
    oauth2client \
    urllib3 \
    statistics \
    pygsheets \
    pandas_gbq \
    openpyxl \
    selenium \
    sklearn \
    xgboost \
    geopy \
    gmplot 
    

    Use conda-forge to install above packages. If not available in conda-forge then you can use pip or pip3 or any other preferred installer.

3. Authenticators

By default Jupyterhub uses jupyterhub.auth.PAMAuthenticator which allows only local user of your machine to login. If you want to create new users then you may need to change your authenticator. Some of the authenticators you may try:

Dummy Authenticator

If you are not concerned about security and want to allow anyone to login then use this authenticator. This allows for any username and password unless if a global password has been set. Once set, any username will still be accepted but the correct password will need to be provided.

search for c.JupyterHub.authenticator_class and set it to jupyterhub.auth.DummyAuthenticator

  • c.JupyterHub.authenticator_class = 'jupyterhub.auth.DummyAuthenticator'

OAuthenticator

If you want to do authentication based on OAuth then follow the below steps:

  • First we need to create google OAuth credentials. If you have it already then you can skip this section.

    • Log into the google API console

    • For creating oauth credentials, you need to have an app created in OAuth consent screen. Create an app by entering approprate name, user support email and developer contact information.

      alt_text     

    • Create oauth credentials

      Click on CREATE CREDENTIALS -> OAuth client ID

      alt_text     

      Click on Application type as Web application

      alt_text     

      Add redirect uri. Its format should be http[s]://[your-host]/hub/oauth_callback. where [your-host] is where your server will be running. Such as localhost:8000

      alt_text     

      Copy the generated Client ID and Client Secret

      alt_text     

  • General Setup:

    • Activate your conda environment

    • Install oauthenticator using conda install oauthenticator or pip3 install oauthenticator

    • Add command to add user. It is mandatory to do as without this, external applications like Sprinkle won't be able to create notebook user.

      • For linux based machine use

        c.Authenticator.add_user_cmd = ['adduser', '-q', '--gecos', '""', '--disabled-password', '--force-badname']

        It is important to add --force-badname as linux consider name with . as bad name. You might have a gmail account containing ., example:- [email protected]. In that case you will be able to authenticate but not be able to create user.

      • For osx based machine use

        c.Authenticator.add_user_cmd = ['dscl', '.', '-', 'create', '/', 'User', '/']

  • Now you can setup OAuthenticator of your choice.

    GoogleOAuthenticator

    • Set the followings in your jupyterhub_config.py file:

      from oauthenticator.google import GoogleOAuthenticator
      c.JupyterHub.authenticator_class = GoogleOAuthenticator
      c.GoogleOAuthenticator.create_system_users = True
      c.GoogleOAuthenticator.client_id = '<your client id>'
      c.GoogleOAuthenticator.client_secret = '<your client secret>'
      c.GoogleOAuthenticator.oauth_callback_url = '<your callback url>'
      

    LocalGoogleOAuthenticator

    • Set the followings in your jupyterhub_config.py file:

      from oauthenticator.google import LocalGoogleOAuthenticator
      c.JupyterHub.authenticator_class = LocalGoogleOAuthenticator
      c.LocalGoogleOAuthenticator.create_system_users = True
      c.LocalGoogleOAuthenticator.client_id = '<your client id>'
      c.LocalGoogleOAuthenticator.client_secret = '<your client secret>'
      c.LocalGoogleOAuthenticator.oauth_callback_url = '<your callback url>'
      

    • Replace <your client id> and <your client secret> with the credentials you copied earlier.

    • Replace <your callback url> with the redirect url added wile creating client id.

    • If you want to restrict your hub to limited domain then you can add below

      c.GoogleOAuthenticator.hosted_domain = ['domain.name']

      Replace domain.name with your domain name. Ex:- gmail.com

    • If you want to restrict your hub to certain users then you can whitelist them. No other user will be authenticated but the whiltelisted ones.

      c.Authenticator.whitelist = {'username1', 'username2'}

    Check this for other supported authentication services by OAuthenticator.

Other authenticators in jupyterhub

4. Spawners

By default Jupyterhub uses jupyterhub.spawner.LocalProcessSpawner which may not be ideal to use with your favourite authenticator. Spawners you may want to try:

Simple Spawner

  • This is one of the simplest spawners. Open your jupyterhub_config.py and set c.JupyterHub.spawner_class = 'jupyterhub.spawner.SimpleLocalProcessSpawner' and you are done. This will launch your notebook servers on your machine.

Yarn Spawner

If you want to launch notebook servers on Apache Hadoop/YARN clusters then use yarnspawner.

Steps to be followed:

  • Activate the environment

  • Install yarnspawner -> conda install -c conda-forge jupyterhub-yarnspawner

  • Install notebook pip install notebook

    Pip required to avoid hardcoded path in kernelspec (for now)

  • Package the environment into environment.tar.gz

    conda pack -o environment.tar.gz

  • Uploading the environments on HDFS

    • hdfs dfs -mkdir hdfs://<ip address of your machine>:8020/environments/

    • hdfs dfs -chown <username> hdfs://<ip address of your machine>:8020/environments/

    • hdfs dfs -chmod 744 hdfs://<ip address of your machine>:8020/environments/

    • hdfs dfs -put environment.tar.gz hdfs://<ip address of your machine>:8020/environments/

      Here ip is the ip of machine on which hadoop is installed

  • If you want to create a separate queue for notebook then follow these steps:

    • Go to ambari -> yarn queue manager -> add queue and name it notebook (you can name anything you want)

    • Set capacity and user limit factor according to the environment

  • Setting jupyterhub_config.py for yarnspawner

    c.JupyterHub.spawner_class = 'yarnspawner.YarnSpawner'
    c.YarnSpawner.localize_files = {
        'environment': {
            'source': 'hdfs://<ip address of your machine>:8020/environments/environment.tar.gz',
            #'visibility': 'public'
        }
    }
    c.YarnSpawner.prologue = 'source environment/bin/activate'
    # The memory limit for a notebook instance.
    c.YarnSpawner.mem_limit = '2 G'
    # The YARN queue to use, you can ignore this if don't want to add separate queue
    c.YarnSpawner.queue = 'notebook'
    
  • You might need to setup ambari for jupyterhub to run with yarnspawner:

    • Go to HDFS -> configs -> advanced -> Custom core-site

    • Add the following properties

      hadoop.proxyuser.username.groups = *
      hadoop.proxyuser.username.hosts = *
      

      Where username is the name of user of machine on which jupyterhub is running and we can find user by command whoami on terminal

  • If you want to use spark from notebook then add spark2 service, if not added already:

    • Go to ambari -> Add new service -> spark2

Check yarnspawner in detail.

Check other supported spawners by jupyterhub

5. Start jupyterhub

  • Go to location where anaconda is installed

  • Activate your environment

    • source bin/activate

    • conda activate example where example is the name of your environment

  • Run jupyter using command jupyterhub

    At this point your jupyterhub should be running at port number 8000

6. Generate Admin Token

  • Make sure c.JupyterHub.admin_access is set to true

    c.JupyterHub.admin_access = True

  • Generate admin token in either of following two ways:

    Admin User

    • First you need to create an admin user. Search for c.Authenticator.admin_users and give the name of user which you want to be as admin.

      c.Authenticator.admin_users = {'username'}

      If you are using Google based Authenticator then for user [email protected] the user name will be nalinswarup

    • Login to jupyterhub with the admin user. Go to control panel on the top right corner.

      alt_text     

    • You should be able to see admin tab. Otherwise you are not an admin. Token generated by non-admins won't work.

      alt_text     

    • Click on token tab just before admin tab

    • Request new API token

      alt_text     

      Copy the generated token. This is the token Sprinkle asks for in the Sprinkle's Jupyter notebook driver

      If you are providing token generate by admin user and deletes the user itself, then the token won't be valid anymore

    Admin Service

    • You can also create a service for external apps. Open jupyterhub_config.py and add

      c.JupyterHub.services = [
          {
          'name':'myservice',
          'admin': True,
          'api_token': '<random hex 32 digit token>'
          }
      ]
      

      You can create random hex 32 digit token using openssl rand -hex 32 or any other known method. Make sure that admin is set to true

      If you delete the service then the token won't be valid anymore

7. Install new packages in your environment

  • Stop your jupyterhub

  • Activate your environment

  • Install the needed conda install -c conda-forge your_package_name

  • If you are using yarn spawner then you will have to do the followings:

    • Do the above two steps

    • Remove if environment.tar.gz is present in anaconda3 directory

      rm -rf environment.tar.gz

    • Remove environment.tar.gz from hadoop machine

      hdfs dfs -rm -r hdfs://<ip of the machine where hadoop is runnig>:8020/environments/environment.tar.gz
      
    • Create new environment.tar.gz file with the newly installed packages

      conda pack -o environment.tar.gz

    • Upload the tar file in the hadoop machine

      hdfs dfs -put environment.tar.gz hdfs://<ip of the machine where hadoop is runnig>:8020/environments/
      
  • Rerun the jupyterhub

8. Troubleshoot

  • Not able to open ambari after conda installation

    • After conda installation, your base will get activated by its own everytime you open the terminal. And your conda environment run on python3. Sometimes this causes problem in opening ambari if your ambari runs with different python version and on the same machine where jupyterhub is installed. If you don't want conda to activate by its own then run

      conda config --set auto_activate_base false

    • If you want to deactivate conda just for current session then use conda deactivate

  • Able to run notebook through Sprinkle but other Api's are failing

    • This happens if c.JupyterHub.admin_access is not set to true. Setting this to true grants admin users permission to access single-user servers.

      c.JupyterHub.admin_access = True

  • 400 Bad Request

    • This generally occurs if request you sent to the webserver was malformed, i.e, the request itself has somehow incorrect or corrupted and the server couldn't understand it. Check jupyter logs if the request was able to reach the hub. It might be a malformed request syntax, invalid request message framing, or deceptive request routing . In most cases, the problem is on the website itself.
  • 403 Forbidden

    • If you are trying to reach jupyterhiub through any external app, e.x:- trying to launch notebook through Sprinkle, and facing 403 error then you need to check the followings:

      • Token has been revoked.

      • Token you are using may not be tied to an admin user.

      • You provided a token generated by an admin user and then deleted the user. In that case your token won't be valid anymore.

      • Token is tied to a service but the service may not have admin permission.

      • Token is tied to a service having admin permission but the service itself is no longer available.

    For above case check how to generate admin token

  • 500 Internal Server Error

    • Check for jupyterhub logs. If nothing is clear, reach out our team for help. In most of the cases, the problem is with the app/website trying to access jupyterhub.
  • Connection refused

    • Your jupyterhub is not up. Try checking jupyterhub if it is running or not.

    • One of the reasons could be you are trying connect to jupyterhub with wrong host and port name, ex:- you have set the notebook driver in Sprinkle with wrong host and port then you may face Connection refused. Check the configurations properly and correct if needed.

← JupyterSharing Segments and Explore Reports →
  • Topics
    • 1. Jupyter Installation
    • 2. Common configurations
    • 3. Authenticators
    • 4. Spawners
    • 5. Start jupyterhub
    • 6. Generate Admin Token
    • 7. Install new packages in your environment
    • 8. Troubleshoot

Product

FeaturesHow it worksIntegrationsDeploymentPricing

Industries

Retail & EcommerceUrban MobilityFinanceEducation

Departments

MarketingOperationsTechnology

Connect

Free trialAbout Us

Actionable Insights. Faster.

Sprinkle offers self-service analytics by unlocking enterprise scale data via simple search and powerful reporting service.


Copyright © 2021 Sprinkle data