lse-my474.github.io

LSE

MY474 - Applied Machine Learning for Social Science

Lent Term 2023

Instructors

Course Information

No lectures or classes will take place during School Reading Week 6.

Week Topic Instructor
1 What is Machine Learning? BM
2 Generalization, Inference, Prediction, and Causality BM
3 Linear Discriminant Analysis, Logistic Regression BM
4 Gradient Descent, Bootstrap, Cross-Validation, Hyperparameters BM
5 Regularization, Decision Trees BM
6 Reading Week -
7 Support Vector Machines, Active Learning BM
8 Bias, Fairness, Accountability, and Transparency in ML BM
9 Ensembles, Bagging, Boosting FG
10 Neural Networks FG
11 Unsupervised Learning FG

Course Description

The amount of data available to social science researchers has exploded in recent years. Advances in machine learning have given social scientists new tools to make sense of these data to answer big and important questions that we did not have the tools to answer just a few years before. For example, some researchers have used machine learning to explore patterns in large databases of text in an effort to identify how the racial and gender biases encoded in language have changed over time. Others have used machine learning methods to detect and analyze the behavior of state-sponsored bots and trolls and the content they produce. In this class we will introduce a wide range of machine learning methods and their applications to social science.

The course surveys machine learning methods and concepts with the application to social sciences in mind. We begin with discussions of how machine learning methods can be adapted from the predictive purposes for which they were originally designed, to the ultimate goal of inference. We will then move on to a survey of common machine learning models for classification and regression, frameworks for out-of-sample validation, and finally, a discussion of unsupervised learning and an introduction to neural networks.

Objectives

Prerequisites

Applied Regression Analysis (MY452) or equivalent is required. In short, you should 1) understand linear regression and logistic regression and 2) know the basics of a programming language.

Students in this course will strongly benefit from prior experience with the R programming language. Class assignments will be completed in R. If you do not know R, you will have to invest extra time on your own to learn the language.

Textbooks and Course Materials

All course materials are available for free download. The R books are for your own reference and will not be assigned as readings for the class.

Required Software

The ISL textbook uses the R programming language for labs that you are required to complete at home before class. R and R Markdown (via R Studio) will also be used in summative assessments, in-class exercises, and for your final assessment.

Assessments

Problem Sets for master’s students enrolled in MY474 and PhD students enrolled in MY574 (40% of total mark)

Problem sets will be assigned at the beginning of each lab session. These will involve computer exercises applied to data supplied by the instructor. These will be submitted via GitHub Classroom by their due date, and will be marked to provide 40% of the course grade.

Problem sets consist of two parts:

  1. Written responses
  2. Coding questions

Both parts should be included in the Rmd and PDF files to be uploaded to Moodle.

One formative problem set will not count toward your final mark, but will give you an idea of how problem sets are structured:

There will be four summative problem sets that will count toward the final grade:

Quizzes for master’s students enrolled in MY474 and PhD students enrolled in MY574 (40% of total mark)

There will be 5 timed quizzes administered on Qualtrics and linked to on Moodle. These quizzes will be based on lecture, seminars, and readings. The quiz will be released on Moodle at 8:00am on Thursdays of weeks 2, 4, 7, 9, and 11. You can start any time between 8:00am and 10:00pm, but the quiz is timed and must be completed within 15 minutes. Once the Moodle quiz is opened, you will have 15 minutes to complete it. The quiz must be submitted by 10:00pm.

Take Home Assessment for PhD students enrolled in MY574 (Due Tuesday, May 23, by 2pm, 30% of total mark)

PhD students are expected to submit a 3000-word paper in which they identify and contextualize a relevant social data science problem related to their dissertation research, find suitable data to address it, plan and conduct extensive machine learning analysis on the data, and present the findings. Marking of these assessments will be at a level appropriate for PhD students.

Project proposal (10% of take home assessment, due Week 7 Seminar):

The project proposal should be 1.5 to 2 pages, double spaced. I will provide feedback on the proposals to make sure you are on the right track for the final take home assessment. The proposal will contribute 3% toward your final grade so take it seriously.

Guidelines:

Project Report (Due Monday, May 23, by 2pm, 30% of total mark)

PhD students will submit a project report, which is a 3000-word paper in which they identify and contextualize a relevant social data science problem related to their dissertation research, find suitable data to address it, plan and conduct extensive machine learning analysis on the data, and present the findings. This report should be thought of as a first draft of a paper to submit to a scholarly journal. The analysis need not be ready for publication, but should suggest a roadmap for eventual publication (through discussion of additional analyses to be conducted, additional data to be collected, etc.) Format your report like you would an academic paper. In addition to the paper, you should also include an R markdown file with code and figures used for the analysis. These components should be submitted on Moodle.

Marking Criteria

Written Assignments

Written assignments will be marked using the following criteria:

Short Answer and Coding Assignments

Some of the assignments will involve shorter questions or coding assignments, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly. If code output is incorrect but it is clear that some of the implementation has been correctly executed, partial credit will be given as far as feasible.

Short answer and coding assignments will be marked using the following criteria:

Schedule

Week 1. What is Machine Learning?

This week we focus on the big picture: what is machine learning? We explore the main varieties of what we consider machine learning and discuss the problems involved in prediction and generalization from training data.

Reading:

Further Reading:

Week 2: Generalization, Inference, Prediction, and Causality

Reading:

Further Reading:

Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.

Github Reference Materials: Click here for reference guides to Github.

Week 3: Linear Discriminant Analysis, Logistic Regression

Reading:

Week 4: Gradient Descent, Bootstrap, Cross-Validation, Hyperparameters

Reading:

Further Reading:

Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.

Week 5: Regularization, Decision Trees

Reading:

Week 7: Support Vector Machines, Active Learning

Reading:

Further Reading:

Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.

Week 8: Bias, Fairness, Accountability, and Transparency in ML

Reading:

Further Reading:

Week 9: Ensembles, Bagging, Boosting

Reading:

Further Reading:

Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.

Week 10: Neural Networks

Reading:

Further Reading:

Week 11: Unsupervised Learning

Reading:

Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.