Notebook Setup Guide
About this guide
This document helps users to setup their own jupyterhub with different authenticators and spawners. This also talks about what all are required for jupyter notebook to work with Sprinkle data.
Topics
1. Jupyter Installation
Check for java in machine, if not then install java 8.
sudo apt-get install openjdk-8-jdk
Installing anaconda3 with python3.7
curl -O https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash Anaconda3-2019.10-Linux-x86_64.sh
source ~/.bashrc
For more information about anaconda3 installation, refer this
Installing jupyter
conda install -c conda-forge jupyterhub
# installs jupyterhub and proxy
conda install notebook
# needed if running the notebook servers locally
For more intormation about jupyter installation refer this
2. Common configurations
These configurations are
Change your directory to /anaconda3
and generate jupyterhub_config.py file for jupyter configurations
jupyterhub --generate-config
Open
Search for
c.JupyterHub.admin_access and set it totrue c.JupyterHub.admin_access = True
This will enable external applications, having admin token, to access jupyter notebook with all admin permissions.
Set the ip of your machine in config file through which external applications can access your hub
c.JupyterHub.hub_ip = <ip address of your machine>
It is preferred to create an environment for your notebook. Create a new conda environment-
If base is not active then activate conda first:
- go to directory
/anaconda3
and run the commandsource bin/activate
- go to directory
Create
python3 environment:conda create -n example python=3.7
You can replace example with any other name as the name of your environment
Activate the environment:
conda activate example
Install the following
packages in the created environment:conda install -c conda-forge -y \ conda-pack \ pandas \ numpy \ pyspark \ requests \ time \ scikit-learn \ matplotlib
Other
useful packages you may want to install:datetime \ tensorflow \ matplotlib \ pytorch \ statsmodels \ seaborn \ findspark \ gspread \ oauth2client \ urllib3 \ statistics \ pygsheets \ pandas_gbq \ openpyxl \ selenium \ sklearn \ xgboost \ geopy \ gmplot
Use
conda-forge
to install above packages. If not available inconda-forge
then you can usepip
orpip3
or any other preferred installer.
3. Authenticators
By default Jupyterhub uses
Dummy Authenticator
If you are not concerned about security and want to allow anyone to login then use this authenticator. This allows for any username and password unless if a global password has been set. Once set, any username will still be accepted but the correct password will need to be provided.
search for
c.JupyterHub.authenticator_class = 'jupyterhub.auth.DummyAuthenticator'
OAuthenticator
If you want to do authentication based on OAuth then follow the below steps:
First we need to create
google OAuth credentials . If you have it already then you can skip this section.Log into the google API console
For creating oauth credentials, you need to have an app created in
OAuth consent screen . Create an app by entering approprate name, user support email and developer contact information.Create oauth credentials
Click on
CREATE CREDENTIALS ->OAuth client ID Click on
Application type asWeb application Add redirect uri . Its format should behttp[s]://[your-host]/hub/oauth_callback
. where[your-host] is where your server will be running. Such aslocalhost:8000 Copy the generatedClient ID andClient Secret
General Setup:
Activate your conda environment
Install oauthenticator using
conda install oauthenticator
orpip3 install oauthenticator
Add command to add user. It is mandatory to do as without this, external applications like
Sprinkle won't be able to create notebook user.For
linux based machine usec.Authenticator.add_user_cmd = ['adduser', '-q', '--gecos', '""', '--disabled-password', '--force-badname']
It is important to add
--force-badname
as linux consider name with.
as bad name. You might have a gmail account containing.
, example:-[email protected]
. In that case you will be able to authenticate but not be able to create user.For
osx based machine usec.Authenticator.add_user_cmd = ['dscl', '.', '-', 'create', '/', 'User', '/']
Now you can setup OAuthenticator of your choice.
GoogleOAuthenticator
Set the followings in your
jupyterhub_config.py file:from oauthenticator.google import GoogleOAuthenticator c.JupyterHub.authenticator_class = GoogleOAuthenticator c.GoogleOAuthenticator.create_system_users = True c.GoogleOAuthenticator.client_id = '<your client id>' c.GoogleOAuthenticator.client_secret = '<your client secret>' c.GoogleOAuthenticator.oauth_callback_url = '<your callback url>'
LocalGoogleOAuthenticator
Set the followings in your
jupyterhub_config.py file:from oauthenticator.google import LocalGoogleOAuthenticator c.JupyterHub.authenticator_class = LocalGoogleOAuthenticator c.LocalGoogleOAuthenticator.create_system_users = True c.LocalGoogleOAuthenticator.client_id = '<your client id>' c.LocalGoogleOAuthenticator.client_secret = '<your client secret>' c.LocalGoogleOAuthenticator.oauth_callback_url = '<your callback url>'
Replace
<your client id>
and<your client secret>
with the credentials you copied earlier.Replace
<your callback url>
with the redirect url added wile creating client id.If you want to restrict your hub to limited domain then you can add below
c.GoogleOAuthenticator.hosted_domain = ['domain.name']
Replace domain.name
with your domain name. Ex:-gmail.com
If you want to restrict your hub to certain users then you can whitelist them. No other user will be authenticated but the whiltelisted ones.
c.Authenticator.whitelist = {'username1', 'username2'}
Check this for other supported authentication services by OAuthenticator.
Other authenticators in jupyterhub
4. Spawners
By default Jupyterhub uses
Simple Spawner
- This is one of the simplest spawners. Open your
jupyterhub_config.py and setc.JupyterHub.spawner_class = 'jupyterhub.spawner.SimpleLocalProcessSpawner'
and you are done. This will launch your notebook servers on your machine.
Yarn Spawner
If you want to launch notebook servers on Apache Hadoop/YARN clusters then use
Steps to be followed:
Activate the environment
Install
yarnspawner -> conda install -c conda-forge jupyterhub-yarnspawner
Install notebook
pip install notebook
Pip required to avoid hardcoded path in kernelspec (for now) Package the environment into
environment.tar.gz conda pack -o environment.tar.gz
Uploading the environments on HDFS
hdfs dfs -mkdir hdfs://<ip address of your machine>:8020/environments/
hdfs dfs -chown <username> hdfs://<ip address of your machine>:8020/environments/
hdfs dfs -chmod 744 hdfs://<ip address of your machine>:8020/environments/
hdfs dfs -put environment.tar.gz hdfs://<ip address of your machine>:8020/environments/
Here ip is the ip of machine on which hadoop is installed
If you want to create a separate queue for notebook then follow these steps:
Go to
ambari -> yarn queue manager -> add queue and name it notebook(you can name anything you want) Set
capacity anduser limit factor according to the environment
Setting
jupyterhub_config.py foryarnspawner c.JupyterHub.spawner_class = 'yarnspawner.YarnSpawner' c.YarnSpawner.localize_files = { 'environment': { 'source': 'hdfs://<ip address of your machine>:8020/environments/environment.tar.gz', #'visibility': 'public' } } c.YarnSpawner.prologue = 'source environment/bin/activate' # The memory limit for a notebook instance. c.YarnSpawner.mem_limit = '2 G' # The YARN queue to use, you can ignore this if don't want to add separate queue c.YarnSpawner.queue = 'notebook'
You might need to setup ambari for jupyterhub to run with yarnspawner:
Go to
HDFS -> configs -> advanced -> Custom core-site Add the following properties
hadoop.proxyuser.username.groups = * hadoop.proxyuser.username.hosts = *
Where username is the name of user of machine on which jupyterhub is running and we can find user by commandwhoami
on terminal
If you want to use spark from notebook then add spark2 service, if not added already:
- Go to
ambari -> Add new service -> spark2
- Go to
Check yarnspawner in detail.
Check other supported spawners by jupyterhub
5. Start jupyterhub
Go to location where anaconda is installed Activate your environment
source bin/activate
conda activate example
where example is the name of your environment
Run jupyter using command
jupyterhub
At this point your jupyterhub should be running at port number 8000
6. Generate Admin Token
Make sure
c.JupyterHub.admin_access is set to truec.JupyterHub.admin_access = True
Generate admin token in either of following two ways:
Admin User
First you need to create an admin user. Search for
c.Authenticator.admin_users and give the name of user which you want to be as admin.c.Authenticator.admin_users = {'username'}
If you are using
Google based Authenticator then for user the user name will be[email protected]
nalinswarup Login to jupyterhub with the admin user. Go to
control panel on the top right corner.You should be able to see
admin tab . Otherwise you are not an admin. Token generated by non-admins won't work.Click on
token tab just beforeadmin tab Request new API token
Copy the generated token. This is the token
Sprinkle asks for in the Sprinkle's Jupyter notebook driverIf you are providing token generate by admin user and deletes the user itself, then the token won't be valid anymore
Admin Service
You can also create a
service for external apps. Openjupyterhub_config.py and addc.JupyterHub.services = [ { 'name':'myservice', 'admin': True, 'api_token': '<random hex 32 digit token>' } ]
You can create
random hex 32 digit token usingopenssl rand -hex 32
or any other known method. Make sure thatadmin is set totrue If you delete the service then the token won't be valid anymore
7. Install new packages in your environment
Stop your jupyterhub
Activate your environment
Install the needed
conda install -c conda-forge your_package_name
If you are using yarn spawner then you will have to do the followings:
Do the above two steps
Remove if
environment.tar.gz is present in anaconda3 directoryrm -rf environment.tar.gz
Remove
environment.tar.gz from hadoop machinehdfs dfs -rm -r hdfs://<ip of the machine where hadoop is runnig>:8020/environments/environment.tar.gz
Create new
environment.tar.gz file with the newly installed packagesconda pack -o environment.tar.gz
Upload the tar file in the hadoop machine
hdfs dfs -put environment.tar.gz hdfs://<ip of the machine where hadoop is runnig>:8020/environments/
Rerun the jupyterhub
8. Troubleshoot
Not able to open ambari after conda installation
After conda installation, your base will get activated by its own everytime you open the terminal. And your conda environment run on python3. Sometimes this causes problem in opening ambari if your ambari runs with different python version and on the same machine where jupyterhub is installed. If you don't want conda to activate by its own then run
conda config --set auto_activate_base false
If you want to deactivate conda just for current session then use
conda deactivate
Able to run notebook through Sprinkle but other Api's are failing
This happens if
c.JupyterHub.admin_access is not set totrue . Setting this to true grants admin users permission to access single-user servers.c.JupyterHub.admin_access = True
400 Bad Request
- This generally occurs if request you sent to the webserver was malformed, i.e, the request itself has somehow incorrect or corrupted and the server couldn't understand it. Check jupyter logs if the request was able to reach the hub. It might be a malformed request syntax, invalid request message framing, or deceptive request routing . In most cases, the problem is on the website itself.
403 Forbidden
If you are trying to reach jupyterhiub through any external app, e.x:- trying to launch notebook through
Sprinkle , and facing403 error then you need to check the followings:Token has been revoked.
Token you are using may not be tied to an admin user.
You provided a token generated by an admin user and then deleted the user. In that case your token won't be valid anymore.
Token is tied to a service but the service may not have admin permission.
Token is tied to a service having admin permission but the service itself is no longer available.
For above case check how to generate admin token
500 Internal Server Error
- Check for jupyterhub logs. If nothing is clear, reach out our team for help. In most of the cases, the problem is with the app/website trying to access jupyterhub.
Connection refused
Your jupyterhub is not up. Try checking jupyterhub if it is running or not.
One of the reasons could be you are trying connect to jupyterhub with
wrong host and port name , ex:- you have set the notebook driver inSprinkle with wrong host and port then you may faceConnection refused . Check the configurations properly and correct if needed.