by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)
The Azure Data Science Virtual Machine (DSVM) is a curated VM which provides commonly-used tools and software for data science and machine learning, pre-installed. AzureDSVM is a new R package that enables seamless interaction with the DSVM from a local R session, by providing functions for the following tasks:
Deployment, deallocation, deletion of one or multiple DSVMs;
Remote execution of local R scripts: compute contexts available in Microsoft R Server can be enabled for enhanced computation efficiency for either a single DSVM or a cluster of DSVMs;
Retrieval of cost consumption and total expense spent on using DSVM(s).
AzureDSVM is built upon the AzureSMR package and depends on the same set of R packages such as httr, jsonlite, etc. It requires the same initial set up on Azure Active Directory (for authentication).
To install AzureDSVM with devtools package:
When deploying a Data Science Virtual Machine, the machine name, size, OS type, etc. must be specified. AzureDSVM supports DSVMs on Ubuntu, CentOS, Windows, and Windows with the Deep Learning Toolkit (on GPU-class instances). For example, the following code fires up a D4 v2 Ubuntu DSVM located in South East Asia:
where context is an azureActiveContext object created by AzureSMR::createAzureContext() function that encapsulates credentials (Tenant ID, Client ID, etc.) for Azure authentication.
In addition to launching a single DSVM, the AzureDSVM package makes it easy to launch a cluster with multiple virtual machines. Multi-deployment supports:
creating a collection of independent DSVMs which can be distributed to a group of data scientists for collaborative projects, as well as
clustering a set of connected DSVMs for high-performance computation.
To create a cluster of 5 Ubuntu DSVMs with default VM size, use:
To execute a local script on remote cluster of DSVMs with a specified Microsoft R Server compute context, use the executeScript function. (NOTE: only Linux-based DSVM instances are supported at the moment as underneath the remote execution is achieved via SSH. Microsoft R Server 9.x allows remote interaction for both Linux and Windows, and more details can be found here.) Here, we use the RxForeachDoPar context (as indicated by the compute.context option):
Information of cost consumption and expense spent on DSVMs can be retrieved with:
Detailed introductions and tutorials can be found in the AzureDSVM Github repository, linked below.
Github (Azure): AzureDSVM