The course will introduce the basics of computational data collection and computational management of the data in social science student. In the end of the course, you have focused on data collection strategies and both supervised and unsupervised machine learning and their applications into real research problems.

Course practices

Prerequisite! You should be familiar with quantitative and qualitative research and Python programming. Check this list to understand the level we expect you to have.

The course will consist of studying the methods for data extraction (reading assignments and exercises) and examining how those can be integrated into social science (case studies and final project).

Each class consists of three different stages: addressing the method literature, moving from the described method to Python programming language (including exercises) and discussing the practical uses of the described methods.  For maximum efficiency, always read the given materials before each lecture. the courses will use Python and scikit-learn packages. [*] Examples regarding the tricks on the course have been documented on GitHub.

The final project is a report of real empirical question that you explore using these methods. It must follow traditional academic paper style; that is to have introduction, related work, data and methods description, description of findings and discussion. To guide this work, choose a research question for your work. The related work must demonstrate existing knowledge about the research question and related phenomena; the data and methods must be described in sufficient level and there must be clear answers (findings) to the research question. The maximum length of the work is 6000 words including references. Each picture and table counts 200 words. For style etc. check Social Science Computer Review.

UPDATE: We discussed on the final project last week and decided to re-scope it. The aim is not to produce ~ 6,000 words research paper but instead a course work without specific length requirements. The content should however address “theory” or related work (as you’ve seen, it is important for me), description of the data and methods and some summary of the findings.


Assignments column...
Homework means that you need to some actual work leading to an outcome before the lecture. Check the formatting and other guidelines in detail. Mostly these structure the work for the final project.
Python material includes links to relevant Python documentations and tutorials.
Reading materials marked with * are shared to course participants in-person only. They are various works submitted or currently in review and thus, not shared directly with everyone.
Additional reading shows various other good (?) works around the topics and can be used to bring more depth to the topic.

"Limited reading can, and often does, induce a risky sense of competence. I sustain that it is our scholarly responsibility to understand the methodology well. Such understanding gives us better access to a powerful research methodology that offers much freedom to operate within its framework in a creative manner." (Walsh et al., 2015)

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.New York, NY: Springer New York. Available online e.g. in Hastie's website. Warning: this is math heavy book (read: has a lot of equations) and may be rather hard for those not familiar with that background.

DateLecture focusAssignments (for this lecture)
18.3.Introduction to course practices
Working with files
Working with natural language

Grimmer, J. (2015). We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together. PS: Political Science & Politics, 48(01), 80–83.

Additional reading

King, G. (2014). Restructuring the Social Sciences: Reflections from Harvard’s Institute for Quantitative Social Science. PS: Political Science & Politics, 47(01), 165–172.

Python material

Working with files

Natural language toolkit: tagging words

Python recap

25.3.Easter holiday - no class
1.4.Working with application programming interfaces (APIs)
Working with web scraping

McKelvey, K., DiGrazia, J., & Rojas, F. (2014). Twitter publics: how online political communities signaled electoral outcomes in the 2010 US house election. Information, Communication & Society, 17(4), 436–450.

Geiger, R. S., & Ribes, D. (2011). Trace Ethnography: Following Coordination through Documentary Practices. In 2011 44th Hawaii International Conference on System Sciences (pp. 1–10).

Additional reading

Jungherr, A., Schoen, H., & Jürgens, P. (2016). The Mediation of Politics through Twitter: An Analysis of Messages posted during the Campaign for the German Federal Election 2013. Journal of Computer-Mediated Communication, 21(1), 50–68.

Note: class ends stats at 8.15 and ends 10.15 sharp!
Summary of machine learning methods in this class
Lab: Data management and preprocessing
Lab: Initially formulating research questions

Homework: Think of a research theme that you're interested and prepeare to share it with the class.

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297.

Hindman, M. (2015). Building Better Models: Prediction, Replication, and Machine Learning in the Social Sciences. The ANNALS of the American Academy of Political and Social Science, 659(1), 48–62.

Additional reading

Nelimarkka: Laskennallinen yhteiskuntatiede (Bachelor's thesis) - in Finnish only and for comments.

15.4Classifying: support vector machines
Classifying: decision trees (optional)
Homework: Return core related work to Matti by Friday 15.4. noon.

Nelimarkka & Ahonen: Automatically detecting deliberation

Hanak et al.: Tweetin` in the Rain.

Methods primer

Support vector machines, focus on understanding the linear version only; the non-linear variants are "extensions of the same idea"

Decision trees, read 3.1 - 3.3 and skim 3.4

22.4Finding groups: k-means (optional)
Searching patterns: Association rules 
Nelimarkka et al. Social learning strategies

Jurek, S. J., & Scime, A. (2014). Achieving Democratic Leadership: A Data-Mined Prescription. Social Science Quarterly, 95(1), 97–110.

Methods primer

k-means (read algorithm and discussions)

Association rules (read definition, concepts and process)

29.4.Searching patterns: Bayes networks (optional)
Finding groups: topic models
Lab: Framing an exact research question
Homework: Return an initial research plan of (max 2 pages, use line space 2) by Tuesday 26.4. noon to Matti.

Nokelainen & Tirri: Role of motivation in the moral and religious judgment of mathematically gifted adolescents

Nelimarkka et al. Agenda normalisation work

Methods primer

Topic models (you can also check the page on LDA, if you understand the model that's OK already)

Bayesian network (read the example)

6.5.Summary: validity and reliability challenges
Lab: Choosing an analysis method and planning analysis strategy
Laaksonen et al. Big data augmented ethnography. *

Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying Party Affiliation from Political Speech. Journal of Information Technology & Politics, 5(1), 33–48.

Additional reading

Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297.

Hindman, M. (2015). Building Better Models: Prediction, Replication, and Machine Learning in the Social Sciences. The ANNALS of the American Academy of Political and Social Science, 659(1), 48–62.

Project deadlines

  • Final project report deadline: 31.5. 12:00
  • Reviews deadline: 12.6. 12:00
  • Final submits: 31.6. (or later, discuss with Matti)


More info

Please contact Matti Nelimarkka (matti.nelimarkka@helsinki.fi).

For personal tutoring, see the online calendar and choose a time and date.

[*] I use R for my everyday practices and I'm happy to provide R examples for all the methods we address in this class. However, for simplicity I want to keep the same tools for data managing and analysis throughout the course (to reduce the effort of making extra steps). I strongly recommend taking a course about R somewhere if you're interested on this stuff.