Course 3

Course 3
A statistical approach to machine learning

Presenters: Andreas Ziegler and Marvin Wright     Room: Esquimalt
(Sunday, 10 July 2016 from 9:00 am – 5:00 pm)

Summary: Big data from high throughput genetic studies or image analyses allow the extraction of new knowledge. The term machine learning summarizes the methods for automated knowledge extraction using computers, and statistical learning specifically refers to statistical methods. Recently, statistical properties, such as consistency or asymptotic normality of the estimators have been derived for some learning machines, which provides a better understanding of the machines. Furthermore, several machines have been extended to operate beyond the standard classification problem for dichotomous endpoints. This short course provides an introduction to some of the most important machine learning approaches currently used. The focus of the theoretical sessions is the non-technical but intuitive explanation of the algorithms, and the focus of the hands on laptop sessions is to apply the machines to real data using R. The combination of simple descriptions in a language familiar to biostatisticians together with the use of standard statistical software should help to demystify machine learning.

The aims of the course are to:

  1. introduce the key concepts in statistical learning and corresponding R software;
  2. discuss the obstacles for the application of statistical learning in practice;
  3. describe approaches for assessing and addressing these obstacles, and
  4. present and illustrate some extensions that are the subject of on‐going

At the end of the day, participants should:

  1. know the fundamental ideas of
    1. boosting (Adaboost, gradient boosting),
    2. classification, probability estimation, regression and survival trees,
    3. nearest neighbors (k nearest neighbors, bagged nearest neighbors),
    4. random forests with extensions and
    5. support vector machines;
  2. know the basic algorithms of the machine learning approaches and the tuning parameters;
  3. know the most important statistical properties of the machine learning approaches;
  4. know the potential pitfalls when applying the machines;
  5. know similarities and differences between standard statistical regression approaches and machine learning methods;
  6. be able to perform machine learning analyses using these machines for various endpoints using R.

Topics covered

analysis of summary data from continuous, binary and survival studies; heterogeneity; meta‐regression; outcome reporting and publication bias; missing data; multivariate meta‐analysis; network meta‐analysis; meta‐analysis of diagnostic studies

Learning strategy

Topics will be introduced by a presentation, followed by hands‐on application using publicly available R packages.

Preparation

To benefit fully, participants should bring a laptop with the latest version of R installed. Participants will be asked to download course material about two weeks prior to the course.

Pre‐requisites

Participants will need a Bachelor level qualification in statistics to fully benefit from the more advanced parts of the course.

About the instructors

Andreas Ziegler is Professor of Medical Biometry and Statistics at the University of Lübeck, Germany, and Honorary Professor at the School of Mathematics, Statistics and Computer Science at the University of KwaZulu-Natal, Pietermaritzburg, South Africa.

Marvin N. Wright is a research computer scientist at the Institute of Medical Biometry and Statistics, Lübeck, Germany. He has published several R packages, including ranger for random forest analysis.