A year ago, I was a numbers geek with no coding background. After trying an online programming course, I was so inspired that I enrolled in one of the best computer science programs in Canada.
Two weeks later, I realized that I could learn everything I needed through edX, Coursera, and Udacity instead. So I dropped out.
The decision was not difficult. I could learn the content I wanted to faster, more efficiently, and for a fraction of the cost.
I already had a university degree and, perhaps more importantly, I already had the university experience. Paying $30K+ to go back to school seemed irresponsible.
I started creating my own data science master’s degree using online courses shortly afterwards, after realizing it was a better fit for me than computer science. I scoured the introduction to programming landscape. For the first article in this series, I recommended a few coding classes for the beginner data scientist.
Now onto statistics and probability.
I have taken a few courses, and audited portions of many. I know the options out there, and what skills are needed for learners preparing for a data analyst or data scientist role.
For this guide, I spent 15+ hours trying to identify every online intro to statistics and probability course offered as of November 2016, extracting key bits of information from their syllabi and reviews, and compiling their ratings. For this task, I turned to none other than the open source Class Central community and its database of thousands of course ratings and reviews.
Since 2011, Class Central founder Dhawal Shah has kept a closer eye on online courses than arguably anyone else in the world. Dhawal personally helped me assemble this list of resources.
How we picked courses to consider
Each course must fit four criteria:
- It must be an introductory course with little to no statistics or probability experience required.
- It must be on-demand or offered every few months.
- It must be of decent length: at least ten hours in total for estimated completion.
- It must be an interactive online course, so no books or read-only tutorials. Though these are viable ways to learn statistics and probability, this guide focuses on courses.
We believe we covered every notable course that fits the above criteria. Since there are seemingly hundreds of courses on Udemy, we chose to consider the most-reviewed and highest-rated ones only. There’s always a chance that we missed something, though. So please let us know in the comments section if we left a good course out.
How we evaluated courses
We compiled average rating and number of reviews from Class Central and other review sites. We calculated a weighted average rating for each course. If a series had multiple courses (like the University of Texas at Austin’s two-part “Foundations of Data Analysis” series), we calculated the weighted average rating across all courses. We read text reviews and used this feedback to supplement the numerical ratings.
We made subjective syllabus judgment calls based on three factors:
- The degree to which each course teaches statistics through coding up examples – preferably in R or Python.
- Coverage of the fundamentals of probability and statistics. Covering descriptive statistics, inferential statistics, and probability theory is ideal.
- How much of the syllabus is relevant to data science? Does the syllabus have specialized content like genomics, as several biostatistics courses do? Does the syllabus cover advanced concepts not often used in data science?
Why Target Coding?
William Chen, a data scientist at Quora who has a master’s in Applied Mathematics from Harvard, wrote the following in this popular Quora answer to the question: “How do I learn statistics for data science?”
For any aspiring data scientist, I would highly recommend learning statistics with a heavy focus on coding up examples, preferably in Python or R.
Since a lot of a data scientist’s statistical work is carried out with code, getting familiar with the most popular tools is beneficial.
Statistics AND Probability
Probability is not statistics and vice versa. My favorite explanation of their differences is from Stony Brook University:
Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events.
They explain that “probability is primarily a theoretical branch of mathematics, which studies the consequences of mathematical definitions,” while “statistics is primarily an applied branch of mathematics, which tries to make sense of observations in the real world.”
Statistics is generally regarded as one of the pillars of data science. Probability – though it generates less attention is also an important part of a data science curriculum.
Joe Blitzstein, a Professor in the Harvard Statistics Department, stated in this popular Quora answerthat aspiring data scientists should have a good foundation in probability theory as well.
Justin Rising, a data scientist with a Ph.D. in statistics from Wharton, clarified that this “good foundation” means being comfortable with undergraduate level probability.
Our picks for the best statistics and probability courses for data scientists are:
“Foundations of Data Analysis” includes two of the top reviewed statistics courses available with a weighted average rating of 4.48 out of 5 stars over 20 reviews. The series is one of the only courses in the upper echelon of ratings to teach statistics with a focus on coding up examples. Though not mentioned in either course titles, the syllabi contain sufficient probability content to satisfy our testing criteria. These courses together have a great mix of fundamentals coverage and scope for the beginner data scientist.
Michael J. Mahometa, Lecturer and Senior Statistical Consultant at the University of Texas at Austin, is the “Foundations of Data Analysis” series instructor. Both courses in the series are free. The estimated timeline is 6 weeks at 3-6 hours per week for each course. One prominent reviewer said:
Excellent course! I took part 1 and enjoyed it a lot, so it was very easy to decide to go on with part 2. Dr. Mahometa and team are very good teachers and their material is of a very high quality. The exercises are interesting and the materials (videos, labs and problems) are appropriate and well chosen. I recommend this course to anyone interested in statistical analysis (as an introduction to machine learning, big data, data science, etc.). On a scale from 1 to 10, I give 50!
Please note each course’s description and syllabus are accessible via the links provided above.
A stellar specialization
Update (December 5, 2016): Our original second recommendation, UC Berkeley’s “Stat2x: Introduction to Statistics” series, closed their enrollment a few weeks after the release of this article. We promoted our top recommendation in “The Competition” section accordingly.
- Statistics with R Specialization by Duke University on Coursera
Ã?Â¢?Ã?Â¦which contains the following five courses:
This five-course specialization is based on Duke’s excellent Data Analysis and Statistical Inference course, which had a 4.82-star weighted average rating over 55 reviews. The specialization is taught by the same professor, plus a few additional faculty members. The early reviews on the new individual courses, which have a 3.6-star weighted average rating over 5 reviews, should be taken with a grain of salt due to the small sample size. The syllabi are comprehensive and has full sections dedicated to probability.
Dr. Mine Ã???etinkaya-Rundel is the main instructor for the specialization. The individual courses can be audited for free, though you don’t have access to grading. Reviews suggest that the specialization is “well worth the money.” Each course has an estimated timeline of 4-5 weeks at 5-7 hours per week. One prominent reviewer said the following about the original course that the specialization was based upon:
One of the greatest courses I’ve taken so far. [Dr. Mine Cetinkaya-Rundel is] a great teacher, very much involved in exchanges with her students. A large variety of teaching approaches and tools. Lots of practice through short tests, R-programming labs, and an in-depth project. A very lively forum with lots of help to cope with difficulties. The course is not too difficult, but the variety of the proposed material requires that students get involved quite substantially. A very nice book available for free with plenty of practice exercises.
Want more probability?
- Introduction to Probability-The Science of Uncertainty by the Massachusetts Institute of Technology (MIT)
Consider the above MIT course if you want a deeper dive into the world of probability. It is a masterpiece with a weighted average rating of 4.91 out of 5 stars over 34 reviews. Be warned: it is a challenge and much longer than most MOOCs. The level at which the course covers probability is also not necessary for the data science beginner.
John Tsitsiklis and Patrick Jaillet, both of whom are professors in the Department of Electrical Engineering and Computer Science at MIT, teach the course. The contents of this course are essentially the same as those of the corresponding MIT class (Probabilistic Systems Analysis and Applied Probability)Ã?Â¢??-Ã?Â¢??a course that has been offered and continuously refined over more than 50 years. The estimated timeline is 16 weeks at 12 hours per week. One prominent reviewer said:
Many online courses are watered down in some way, but this one feels like a proper rigorous exercise-driven course similar to what you’d get in-person at a top school like MIT. The professors present concepts in lectures that have obviously been honed to a laser focus through years of pedagogical experienceÃ?Â¢??-Ã?Â¢??there is not a single wasted second in the presentations and they go exactly at the right pace and detail for you to understand the concepts. The exercises will make you work for your knowledge and are critical for really internalizing the concepts. This is the best online course I have taken in any subject.
I encourage you to visit Class Central’s page for this course to read the rest of the reviews.
Our #1 pick had a weighted average rating of 4.48 out of 5 stars over 20 reviews. Let’s look at the other alternatives.
- MedStats: Statistics in Medicine (Stanford University/Stanford OpenEdx): Great syllabus where the examples have a medical focus. Covers a bit of R programming at the end, though not as much as UT Austin’s series. A worthy option for anyone, even those not targeting medicine. It has a 4.58-star weighted average rating over 32 reviews.
- SOC120x: I “Heart” Stats: Learning to Love Statistics (University of Notre Dame/edX): Targets a non-technical audience, though likely would be good for anyone. No coding. Good production value. Course and instructors look really fun. It has a 4.54-star weighted average rating over 12 reviews.
- QM101x: Statistics for Business (Indian Institute of Management Bangalore/edX): Part of a 4-course series. Business focus. Good syllabus that uses coding. The last two courses in the series are unreleased as of November 2016 so can’t make a judgment yet. It has a 4.43-star weighted average rating over 27 reviews.
- Workshop in Probability and Statistics (Udemy): Taught by Dr. George Ingersoll, Associate Dean of Executive MBA Programs at the UCLA Anderson School of Management. Costs money. Uses Excel. It has a 4.4-star weighted average rating over 452 reviews.
- Intro to Descriptive Statistics (San Jose State University/Udacity): Part of a 2-course series. Bite-sized videos. No coding. It has a 3.88-star weighted average rating over 8 reviews.
- Intro to Inferential Statistics (San Jose State University/Udacity): Part of a 2-course series. I took both courses as refreshers for my undergrad statistics classes and came away with a deeper understanding. Really enjoyed Katie Kormanik’s teaching style (see video below). Bite-sized videos. No coding. It has a 4.4-star weighted average rating over 5 reviews.
- 6.008.1x: Computational Probability and Inference (Massachusetts Institute of Technology/edX): One of two courses/series to teach statistics with a focus of coding up examples in Python. Reviews suggest prior stats experience is needed and that the course is a bit unorganized. It has a 4-star weighted average rating over 12 reviews.
- Basic Statistics (University of Amsterdam/Coursera): One of two statistics courses in the University of Amsterdam’s Methods and Statistics in Social Sciences Specialization. One exceedingly positive review on the series and its instructors. No coding. It has a 4.06-star weighted average rating over 8 reviews.
- Inferential Statistics (University of Amsterdam/Coursera): One of two statistics courses in the University of Amsterdam’s Methods and Statistics in Social Sciences Specialization. One exceedingly positive review on the series and its instructors. No coding. It has a 4-star weighted average rating over 3 reviews.
- PH525.1x: Statistics and R (Harvard University/edX): Part of a 7-course series on edX. Life sciences focus. Uses R programming, but the reviews suggest UT Austin’s series is better. It has a 3.96-star weighted average rating over 26 reviews.
- PH525.3x: Statistical Inference and Modeling for High-throughput Experiments (Harvard University/edX): Part of a 7-course series on edX. Life sciences focus. Uses R programming, but the reviews suggest UT Austin’s series is better. It has a 4.63-star weighted average rating over 4 reviews.
- Mathematical Biostatistics Boot Camp 1 (Johns Hopkins University/Coursera): Part of a 2-course series. Biostatistics focus. It has a 3.13-star weighted average rating over 23 reviews.
- Mathematical Biostatistics Boot Camp 2 (Johns Hopkins University/Coursera): Part of a 2-course series. Biostatistics focus. It has a 3.83-star weighted average rating over 3 reviews.
- KIexploRx: Explore Statistics with R (Karolinska Institutet/edX): More of a data exploration course than a statistics course. Uses coding. It has a 3.77-star weighted average rating over 22 reviews.
- Statistical Inference (Johns Hopkins University/Coursera): One of two statistics courses in JHU’s data science specialization. Bad reviews. It has a 2.9-star weighted average rating over 29 reviews.
- Regression Models (Johns Hopkins University/Coursera): One of two statistics courses in JHU’s data science specialization. Bad reviews. It has a 2.73-star weighted average rating over 30 reviews.
- DS101X: Statistical Thinking for Data Science and Analytics(Columbia University/edX): Part of the Microsoft Professional Program Certificate in Data Science. Short syllabus. Bad reviews. It has a 2.77-star weighted average rating over 24 reviews.
- Understanding Clinical Research: Behind the Statistics (University of Cape Town/Coursera): “This isn’t a comprehensive statistics course, but it offers a practical orientation to the field of medical research and commonly used statistical analysis.” Health care focus. It has a 5-star weighted average rating over 15 reviews.
- MED101x: Introduction to Applied Biostatistics: Statistics for Medical Research (Osaka University/edX): Biostatistics focus. Uses coding. It has a 4.5-star weighted average rating over 3 reviews.
- Probability and Statistics (Stanford University/Stanford OpenEdx): Curriculum looks great. The one review is really positive. No coding. It has a 4.5-star weighted average rating over 1 review.
- Inferential and Predictive Statistics for Business (University of Illinois at Urbana-Champaign/Coursera): Part of a 7-course Managerial Economics and Business Analysis Specialization. Uses Excel. It has a 5-star weighted average rating over 1 review.
- Exploring and Producing Data for Business Decision Making (University of Illinois at Urbana-Champaign/Coursera): Part of a 7-course Managerial Economics and Business Analysis Specialization. Uses Excel. It has a 5-star weighted average rating over 1 review.
- Introduction to Probability, Statistics, and Random Processes (University of Massachusetts Amherst/Independent): Videos not available for the whole course. It has a 2.5-star weighted average rating over 2 reviews.
- 005x: Introduction to Statistical Methods for Gene Mapping (Kyoto University/edX): Genetics focus. Need prior statistics and R knowledge. It has a 2.5-star weighted average rating over 1 review.
- Statistics for Genomic Data Science (Johns Hopkins University/Coursera): Genomic focus. Not a good introductory course: “A fair class for someone with an interest in this field who also happens to have a decent background in R programming.” It has a 2-star weighted average rating over 2 reviews.
The following courses had no reviews as of November 2016.
- Statistical Thinking in Python (Part 1) and Statistical Thinking in Python (Part 2) (DataCamp): Uses coding and Python specifically, making it one of few worthy courses or series that use that language. Seven hours of video and 120+ exercises. DataCamp is a popular option.
- A Hands-on Introduction to Statistics with R (DataCamp): Uses coding. 26 hours of video and 150+ exercises. Again, DataCamp is a popular option.
- Statistical Computing with RÃ?Â¢??-Ã?Â¢??a gentle introduction (University College London/Independent): Uses coding.
- Probability & Statistics (Carnegie Mellon): Uses R. Primarily text-based instruction. Designed to be equivalent to one semester of a college statistics course.
- Introduction to Probability and Statistics (Massachusetts Institute of Technology/MIT OCW): Traditional lecture format (video-taped).
- Fundamentals of Engineering Statistical Analysis (The University of Oklahoma/Janux): Engineering focus.
- Elementary Business Statistics (The University of Oklahoma/Janux): Business focus.
- STAT101x: Biostatistics for Big Data Applications (The University of Texas Medical Branch/edX): Biostatistics focus.
- 416.1x: Probability: Basic Concepts & Discrete Random Variables(Purdue University/edX): Part of a 2-course series.
- 416.2x: Probability: Distribution Models & Continuous Random Variables (Purdue University/edX): Part of a 2-course series.
- Business Statistics and Analysis Specialization (Rice University/Coursera): Uses Excel.
- Statistics 110: Probability (Harvard University): Traditional lecture format (video-taped). Often recommended on Quora.
- Statistics (Dataquest): A multi-course series with about 12 hours of content. Subscription required. One of two courses/series to teach statistics with a focus of coding up examples in Python. A note from Dataquest: “the statistics courses are being entirely re-written at the moment, due for release around the end of November.”
Wrapping it Up
This is the second of a six-piece series that covers the best MOOCs for launching yourself into the data science field. We covered programming in the first article, and the remainder of the series will cover several other data science core competencies: the data science process, data visualization, and machine learning.
The final piece will be a summary of those courses, and the best MOOCs for other key topics such as data wrangling, databases, and even software engineering.
Courtesy: David Venturi, FreeCodeCamp