Updated: 8 February 2018
It's been over a year since I wrote the original of this article - and much has changed in the world of Data Science. I've decided to update the information from time to time, since it's the most popular I've done - there is clearly a lot of demand and need out there for these topics. I've changed the format to have a table of the areas that a Data Scientist should know, and an example of a way to learn it.
I've also been asked for an article on actually getting a job in Data Science. I thought about putting that information here, but I will create a second article for that alone, as I think it's important to focus on your learning journey, even while searching for a job in the Data Science role.
With a format for learning to be an Amateur Data Scientist established and a firm understanding of how you learn, it’s time to focus on what to learn.
There are no shortages of Internet posts, magazine articles, or college syllabi describing what a “Data Scientist” should know. I originally thought the term was still up for debate – but there are “real” Data Scientists that have formal degrees and years of experience with that official title (my team is full of them, save yours truly). But in my case, I’m building this knowledge outside of a formal degree. Since I have to start somewhere, I’ll extrapolate from these other references to include the knowledge path I need to follow. Feel free to modify to your liking.
NOTE: There’s an absolutely wonderful visual representation of what a Data Scientist should know that you can find here: http://nirvacana.com/thoughts/becoming-a-data-scientist/ by Swami Chandrasekaran, and I would encourage you to look over his work. What I show here is independent of that grouping, but similar. Of course, he’s using several tools from IBM and I’m using the ones at Microsoft. Pick your stack and learn it well. Want to use Open Source only? Knock yourself out.
ALSO NOTE: I have never liked a “tools approach” to learning. Yes, you’ll need to learn several tools and yes, I often use a tool to learn a thing (like using R to learn Statistics) but I focus on the concepts, not just how the work is done. First learn why you do something, and then shawarma after. So learning concepts first and then choosing a tool is the route I’ll follow here.
Or, you can simply follow a complete course, online. There are several really good ones:
- Complete Certificate Course in Data Science from Microsoft on Edx: https://www.edx.org/microsoft-professional-program-certficate-data-science
- Coursera: https://www.coursera.org/course/datasci
- Udacity: https://www.class-central.com/mooc/1480/udacity-intro-to-data-science
- Caltech: http://work.caltech.edu/telecourse
- The Open Source Society: https://github.com/open-source-society/data-science
Among many, many others. See the comments below as well for even more.
Of course, if you want to “assemble your own”….
Note: I have biased this list towards things that we've published at Microsoft, although I've included resources for some that aren't. Keep in mind the "Asset" column is simply one of the many places you can go to learn these topics - and I would caution you against using only this list of resources as your only stop for this information. There are lots of fine resources out there, and more being created every day, so I encourage you to do a web search on the "Technology/Concept" items as well as the "Topic" items. In any case, this list will serve you well on researching and learning more about the craft of Data Science.
|1||Math - Linear Algebra and College-level Statistics|
|Linear and Matrix Algebra||Linear Algebra with Matrix Transforms||Course|
|Core Statistics||Statistics and Probability
Essential Statistics for Data Analysis using Excel
|Bayesian Methodologies in Modeling||Bayesian Networks||Book|
|General AI Mathematics||Essential Mathematics for AI||Course|
|2||Team Software Development|
|Agile||Agile Methods and Practices||Self-Guided|
|The Team Data Science Process||Primary Documentation
|Source Control||Version Control||Self-Guided|
|Visual Studio Code||Visual Studio Code Site||Self-Guided|
|PyCharm||PyCharm getting Started||Course|
|4||Data Constructs and Data Programming (SQL, Graphs)|
|Algorithms and Data Structures||Algorithms and Data Structures||Course|
|Data Modeling||Introduction to Data Modeling||Course|
|Data programming with SQL||Learn SQL
Querying Data with Transact-SQL
|Graph Database Programming||Graph programming with the Gremlin API||Self-Guided|
|NoSQL Systems||Introduction to NoSQL Data Solutions||Course|
|5||Exploratory Data Analytics|
|EDA Methods||Exploratory Data Analysis||Book|
|6||Advanced Analytics and Business Analytics|
|Data Analytics and Business Intelligence||MCSE in Business Intelligence||Course|
|7||Programming Languages used in Data Science (R, Python)|
|R Programming||Introduction to R for Data Science||Course|
|Python||Introduction to Python
Introduction to Python for Data Science
|8||Big Data Processing Technologies|
|Hadoop/Spark||Introduction to Big Data
HDInsight Developer Guide
Processing Big Data on Azure
Spark on HDInisght
Implementing Real-Time Analytics with Hadoop
Implementing Predictive Analytics with Spark
|9||Research methods (including hypothesis definition and testing)|
|Research Methods Overview||Research Methods||Overview|
|Hypothesis Testing||Hypothesis Testing: Methodology and Limitations||Book|
|10||Data Science Algorithms and Data Analysis Techniques|
|Data Science||Data Science Essentials||Course|
|Machine Learning||Introduction to Machine Learning||Overview|
|Algorithms||Choosing the right Estimator
How to choose algorithms for Microsoft Azure Machine Learning
Reinforcement Learning Explained
|Deep Learning||Deep Learning Explained||Course|
|Artificial Intelligence||Introduction to Artificial Intelligence||Course|
|11||AI Model Management and Operationalization|
|Operationalization||Building Your Azure Skills Toolkit
Developing Intelligent Apps and Bots
Operationalize analytics with Machine Learning Server
|Model Management||Machine Learning Model Management||Resource|
|12||Domain Expertise||(Various Industry Verticals)||(Various Industry Verticals)|
Here are a few other views on what a Data Scientist should know:
- Job interview questions for data scientists: http://www.datasciencecentral.com/profiles/blogs/66-job-interview-questions-for-data-scientists?goback=.mid_I207394501*45_*1
- So you Want to be a Data Scientist? http://www.jeffheaton.com/2014/02/so-you-want-to-be-a-data-scientist/
- 9 Must-Have Skills To Land Top Big Data Jobs in 2015: http://allabttech.com/2015/07/02/9-must-have-skills-to-land-top-big-data-jobs-in-2015/
- My data science journey: http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey
- More free data science books: http://www.learndatasci.com/free-books/
- More than 100 data science, analytics, big data, visualization books: http://www.datasciencecentral.com/profiles/blogs/more-than-100-data-science-analytics-big-data-visualization-books