No lectures or classes will take place during School Reading Week 6.
Week | Topic | Instructor |
---|---|---|
1 | What is Machine Learning? | BM |
2 | Generalization, Inference, Prediction, and Causality | BM |
3 | Linear Discriminant Analysis, Logistic Regression | BM |
4 | Gradient Descent, Bootstrap, Cross-Validation, Hyperparameters | BM |
5 | Regularization, Decision Trees | BM |
6 | Reading Week | - |
7 | Support Vector Machines, Active Learning | BM |
8 | Bias, Fairness, Accountability, and Transparency in ML | BM |
9 | Ensembles, Bagging, Boosting | FG |
10 | Neural Networks | FG |
11 | Unsupervised Learning | FG |
The amount of data available to social science researchers has exploded in recent years. Advances in machine learning have given social scientists new tools to make sense of these data to answer big and important questions that we did not have the tools to answer just a few years before. For example, some researchers have used machine learning to explore patterns in large databases of text in an effort to identify how the racial and gender biases encoded in language have changed over time. Others have used machine learning methods to detect and analyze the behavior of state-sponsored bots and trolls and the content they produce. In this class we will introduce a wide range of machine learning methods and their applications to social science.
The course surveys machine learning methods and concepts with the application to social sciences in mind. We begin with discussions of how machine learning methods can be adapted from the predictive purposes for which they were originally designed, to the ultimate goal of inference. We will then move on to a survey of common machine learning models for classification and regression, frameworks for out-of-sample validation, and finally, a discussion of unsupervised learning and an introduction to neural networks.
Applied Regression Analysis (MY452) or equivalent is required. In short, you should 1) understand linear regression and logistic regression and 2) know the basics of a programming language.
Students in this course will strongly benefit from prior experience with the R programming language. Class assignments will be completed in R. If you do not know R, you will have to invest extra time on your own to learn the language.
All course materials are available for free download. The R books are for your own reference and will not be assigned as readings for the class.
The ISL textbook uses the R programming language for labs that you are required to complete at home before class. R and R Markdown (via R Studio) will also be used in summative assessments, in-class exercises, and for your final assessment.
Problem sets will be assigned at the beginning of each lab session. These will involve computer exercises applied to data supplied by the instructor. These will be submitted via GitHub Classroom by their due date, and will be marked to provide 40% of the course grade.
Problem sets consist of two parts:
Both parts should be included in the Rmd and PDF files to be uploaded to Moodle.
One formative problem set will not count toward your final mark, but will give you an idea of how problem sets are structured:
There will be four summative problem sets that will count toward the final grade:
There will be 5 timed quizzes administered on Qualtrics and linked to on Moodle. These quizzes will be based on lecture, seminars, and readings. The quiz will be released on Moodle at 8:00am on Thursdays of weeks 2, 4, 7, 9, and 11. You can start any time between 8:00am and 10:00pm, but the quiz is timed and must be completed within 15 minutes. Once the Moodle quiz is opened, you will have 15 minutes to complete it. The quiz must be submitted by 10:00pm.
PhD students are expected to submit a 3000-word paper in which they identify and contextualize a relevant social data science problem related to their dissertation research, find suitable data to address it, plan and conduct extensive machine learning analysis on the data, and present the findings. Marking of these assessments will be at a level appropriate for PhD students.
The project proposal should be 1.5 to 2 pages, double spaced. I will provide feedback on the proposals to make sure you are on the right track for the final take home assessment. The proposal will contribute 3% toward your final grade so take it seriously.
Guidelines:
PhD students will submit a project report, which is a 3000-word paper in which they identify and contextualize a relevant social data science problem related to their dissertation research, find suitable data to address it, plan and conduct extensive machine learning analysis on the data, and present the findings. This report should be thought of as a first draft of a paper to submit to a scholarly journal. The analysis need not be ready for publication, but should suggest a roadmap for eventual publication (through discussion of additional analyses to be conducted, additional data to be collected, etc.) Format your report like you would an academic paper. In addition to the paper, you should also include an R markdown file with code and figures used for the analysis. These components should be submitted on Moodle.
Written assignments will be marked using the following criteria:
70–100: Very Good to Excellent (Distinction). Perceptive, focused use of a good depth of material with a critical edge. Coherent well-read and researched answers. Original ideas or structure of argument.
60–69: Good (Merit). Perceptive understanding of the material. Coherent well-read and researched answers.
50–59: Satisfactory (Pass). A “correct” answer based largely on lecture material. Little detail or originality but presented in adequate framework. Small factual errors allowed.
30–49: Unsatisfactory (Fail) and 0–29: Unsatisfactory (Bad fail). Based entirely on lecture material but unstructured and with increasing error component. Concepts are disordered or flawed. Poor presentation. Errors of concept and scope or poor in knowledge, structure and expression.
Some of the assignments will involve shorter questions or coding assignments, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly. If code output is incorrect but it is clear that some of the implementation has been correctly executed, partial credit will be given as far as feasible.
Short answer and coding assignments will be marked using the following criteria:
This week we focus on the big picture: what is machine learning? We explore the main varieties of what we consider machine learning and discuss the problems involved in prediction and generalization from training data.
Reading:
Further Reading:
Reading:
Further Reading:
Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.
Github Reference Materials: Click here for reference guides to Github.
Reading:
Reading:
Further Reading:
Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.
Reading:
Reading:
Further Reading:
Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.
Reading:
Further Reading:
Reading:
Further Reading:
Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.
Reading:
Further Reading:
Reading:
Seminar Materials: You can find the GitHub Classroom link for this seminar on Moodle.