Implementing an Agile Methodology for managing your Institutions Data Science Processes

Microsoft’s Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently.

The intention of TDSP is aimed to improve team collaboration and learning. TDSP is a distillation of the best practices and structures from Microsoft and others in the industry that facilitate the successful implementation of data science initiatives that help companies fully realize the benefits of their analytics program.

The purpose of this Blog to share the best practice of the Team Data Science Process and its main components.

The process comprises of the following key components :

  • A data science lifecycle definition.
  • A standardized structure for projects
  • Analytics infrastructure management
  • Productivity Tools and utilities

1. Data Science Lifecycle

Data Science is a highly iterative discovery process with emphasis on evaluating and validating each step of the process. The process iterations refine the hypotheses and predictive model
to obtain a sound solution.

Some Sobering facts

Only 27% of the big data projects are regarded as successful

Only 13% of organisations have achieved full-scale production for their

Only 8% of the big data projects are regarded as VERY successful

Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive Analytics program in place, while 80% said they planned on implementing such a program within five years

– Dataversity 2015 Survey

So doing data science involves bringing people with different levels of skills to work together to build intelligent applications that leverages data or dealing with a spectrum of activities.

How many of you have worked with teams that had both data scientists and software engineers? How did that work? Did they always communicate and coordinate well among themselves? Real world data is quite messy. You need data engineering to collect, process and store data in the most optimal way. The scale of data companies are collecting now is quite mind boggling.

Big Data implementations

The data science lifecycle defines a systematic sequence of steps that starts with the planning step, where the business problem is framed, proceeds to the development of predictive analytics models, and completes with their consumption of predictions by intelligent applications. If you are using an existing lifecycle like CRISP-DM, KDD or your own custom process that is working well in your organization, you can still use TDSP in the context of those development lifecycles. If you dont have one in use already, TDSP provides a staged data science lifecycle for you to adopt. At a high level, the different methodologies have much in common. You can easily map the correspondences between the steps in the TDSP lifecycle and in these other popular methodologies.

Here is a depiction of the TDSP lifecycle.

TDSP_LIFECYCLE

Details of each stage of the lifecycle in TDSP are found here.

The following diagram provides the detailed task view for each of the role working together on a data science initiative in each stage of the lifecycle.

TDSP_SWIMLANE

2. Standardized structure for projects

Having all projects share a directory structure and use templates for project documents makes it easy for the team members to find information about their projects. TDSP recommends creating a separate repository for each project on the version control system for versioning, information security, and collaboration. We provide templates for the folder structure and required documents. Examples include:

  • a project charter to document the business problem and scope of the project
  • data reports to document the structure and statistics of the raw data
  • model reports to document the derived features
  • model performance metrics such as ROC curves or MSE

TDSP_DIR_STRUCT

The directory structure can be cloned from Github.

3. Management of Analytics Infrastructure

TDSP provides recommendations for managing shared analytics and storage infrastructure such as cloud file systems for storing datasets, databases, Big Data (Hadoop, Spark) clusters, and machine learning services. The analytics and storage infrastructure can be on the cloud or On-premises.

Here, we are showing you how we are implementing TDSP at Microsoft. Hopefully you can find it useful and you can adopt it to make your data science more productive.

At Microsoft, we are practicing the TDSP on the following four aspects:

1.We use Data Science Virtual Machines as our basic on-premise computational platform.

Currently, we provide DSVM in both Linux and Windows. DSVM comes with a rich set of data science languages including Anaconda Python, open source R, and Microsoft R Services (Developer version). It also carries quite a lot useful tools for you to manage your resources in Azure. DSVM also allows managers to better control the computational resources of his team. He can easily switch the ownership of a DSVM between data scientists, and he can also make multiple data scientists work on the same DSVM. So they save time and cost.

2.We use Visual Studio Team Services to track work items and do scrum planning, and we use git repositories which come automatically with VSTS for versioning, information security, and standardized workflow of project execution

3.We share common utilities within a group or a team to make different steps of data science process more efficient. These utilities abstract the common scripts in different projects into utilities, which avoids reinventing wheels again and again on different projects.

4.Azure provides variant data storage/analysis platforms as managed services to tackle different business challenges. We provision these platforms as resources and share within a team, to save budgets on our data science practices.

TDSP_INFRA

4. Productivity tools and utilities

TDSP provides an initial set of tools and scripts to jump start adoption of TDSP within a team and to automate some of the common tasks in the data science lifecycle such as data exploration and baseline modelling. There is a well-defined structure provided for individuals to contribute shared tools and utilities into their team’s shared code repository.

Benefits

TDSP is aimed primarily for a data science team, developing the analytic components of a predictive analytics solution. We specifically target teams utilising cloud-based assets for compute and storage, as is increasingly the norm for elastic resources.

As a manager, the benefits of your team are

Organization: A single place to go to find code, documentation and all artifacts for all your team's projects;

Standardisation: Code, data, documents are organised the same way, so when a second pair of eyes are required, or a new member joins a team, files are where you expect them to be, documents have the same naming and structure.

Knowledge: One of the biggest challenges for data science teams where individuals or small teams work largely independently on a variety of projects, is how to accumulate and share learnings, tools, and best practices. TDSP specifically provides a central shared utilities repository and methodology for individual projects to share and contribute to this team-wide resource. Also, just by virtue of having a documented process, improvements can be documented and spread to individuals executing subsequent projects.

Security: VSTS provides for detailed role-based access control on repositories and work items.

As a Data Scientist, your benefits are:

Productivity: Out of the box we provide various utilities for data exploration and baseline model construction. But, as the team customize and build up domain-specific scripts and utilities over time, these are easily shared. In addition, document templates serve as a guide for information required and already provide basic formatting, just as IDEs speed up code creation.

Collaboration: Teams can seamlessly collaborate on distributed compute (VMs), with a simple way to share data assets and run large analytics jobs without contending for resources. All the while contributing to a single repository so that the product appears to be the work of a single scientist.

Resources

These resources can then be leveraged by other projects within the team or the organisation.

Find all our instructions on Github: https://aka.ms/tdsp

Productive utilities and tools repository: https://github.com/Azure/Azure-TDSP-Utilities

Sign up for a free VSTS account: https://www.visualstudio.com

Sign up and try Azure Data Science Virtual Machines