Demystifying decision forests

By Michele Usuelli, Lead Data Scientist   This article doesn’t require a data science background, but just some basic understanding of predictive analytics. Besides that, all the concepts are explained from scratch, including a popular algorithm called the “decision forest”. Throughout the article you won’t see any fancy or advanced machine learning algorithm, but by the… Read more

What is the role of a data scientist?

By Michele Usuelli, Lead Data Scientist Data Science has been around for decades, but it recently increased in popularity among companies. Although the tools and techniques existed already, there are some changes. Digital technologies generate more data that can drive new advanced analytics use-cases. Also, there are more success stories show-casing the value in data, making… Read more

Scaling up Scikit-Learns Random Projection using Apache Spark

By Sashi Dareddy, Lead Data Scientist What is Random Projection (RP)? Random Projection is a mathematical technique to reduce the dimensionality of a problem much like Singular Value Decomposition (SVD) or Principal Component Analysis (PCA) but only simpler & computationally faster. [Throughout this article, I will use Random Projection and Sparse Random Projection interchangeably.] It… Read more

Scaling a recommender system across large data volumes

By Michele Usuelli, Data Scientist Consultant Building a recommendation engine in presence of large data volume E-commerce businesses can suggest new products to their customers. How do they choose the products to recommend? The companies collect data about the purchases of their customers. Starting from the purchase history, they can identify items that have been… Read more

Analysing data in SQL Server 16, combining R and SQL

By Michele Usuelli, Data Scientist Consultant Overview R is the most popular programming language for statistics and machine learning, and SQL is the lingua franca for the data manipulation. Dealing with an advanced analytics scenario, we need to pre-process the data and to build machine learning models. A good solution consists in using each tool… Read more

Validating a model in R Services using the k-fold

By Michele Usuelli, Data Scientist Consultant Why k-fold Predictive modelling consists in predicting a future outcome based on the data. Starting from data which outcome is already known, the predictive models detect patterns that had an impact on the outcome. Then, in presence of data which outcome is unknown, the model looks for the same… Read more

Using R Services in SQL Server 2016 Release Candidate 2 (RC2)

Author: Benjamin Wright-Jones Contributors: Sander Timmer, Derek Norton Reviewers: Anderson Chan The results of the recent IEEE Survey (2015) clearly show the rising interest in R (the linga franca of data scientists). In SQL Server 2016, R Services will be available, leveraging the highly scalable and parallel algorithms from the Revolution Analytics engine. SQL Server… Read more

Evaluating Machine Learning models when dealing with imbalanced classes

Sander Timmer, PhD In real-world Machine Learning scenarios, especially those driven by IoT that are constantly generating data, a common problem is having an imbalanced dataset. This means, we have far more data representing one outcome class than the other. For example, when doing predictive maintenance, there is (far) more data available about the healthy… Read more

PowerShell Script To Invoke ML Scoring Part I

By Earle Sinnatamby, Consultant Objective The purpose of this blog post is to provide PowerShell alternative to utilizing Azure Data Factory to perform Machine Learning (ML) scoring. The Pilot engagement required daily on-premises data to be uploaded into Azure Blob Storage. Each data file uploaded required daily rescoring with ML script provided by the client… Read more

Why Public Cloud beats Private Cloud for Analytics: A Data Warrior’s Perspective

By Bill Eldredge, Associate Architect As the former head of the Big Data Management and Governance team at Nokia, I was responsible for managing our internal business customers’ needs and expectations use of the private Hadoop cloud and related Big Data Asset we spent five years building and maintaining. Unfortunately, several of those years amounted… Read more