Qualitative Content Analysis meets SBERT

Author

Nico Blokker & André Blessing (MARDY)

Published

February 9, 2023

Introduction

Qualitative content analysis (QCA) is like programming – easy, until you start coding. It is also a powerful tool that lets you annotate and categorize (code) text for in-depth analyses (Mayring 2010). By carefully reading text passages, trained annotators are able to assign pre-defined categories or labels.

However, assigning categories to thousands of sentences is tedious work. Luckily, recent advances in natural language processing (NLP) allow us to speed up this task (see also Haunss et al. 2020). In this post, we present an efficient way to assist (human) annotators during the labeling process. For that we predict a label for a given sentence based on its semantic similarity to already labeled sentences.

This classification task can be separated into the following sub-tasks:

  • identify meaningful sentences using a pre-trained model (Section 4.1)
  • use a fine-tuned language model and make predictions (Section 4.2)

The setup

Let’s consider a few examples from the DEbateNet2.0-corpus (Blokker et al. 2023). This data set on the German migration debate in 2015 contains demands and propositions made by political actors as reported in newspaper articles. You can explore the data here and download it from Github.

Code
library(mardyr2)
library(tidyr)
library(dplyr)

# load debatenet2.0 from mardyR package
# remotes::install_github("nicoblokker/mardyr2")

lre <- mardyr2:::LRE %>%
          separate_rows(claimvalues, sep = ",") %>%
          mutate(claimvalues = gsub("\\D", "", claimvalues)) %>%
          filter(!grepl("[1-9]00|999", claimvalues)) %>%
          mutate(label = suppressWarnings(mardyr2:::lookup_codes(claimvalues))) %>%
          as.data.frame() %>% select(quote, claimvalues, label) %>% 
          mutate(quote = trimws(gsub("\\s+", " ", quote)))

# collapse quotes with different labels
lre_compressed <- lre %>% 
          group_by(quote) %>% 
          summarise(claimvalues = paste(claimvalues, collapse = "; "),
                                            labels = paste(label, collapse = "; "))

# collapse quotes with different polarity
lre_compressed$claimvalues <- sapply(1:nrow(lre_compressed), 
                                      function(x) unique(stringr::str_extract_all(lre_compressed$claimvalues[x], "\\d+")[[1]]))

# remove multi-label claims
lre_single <- lre_compressed %>% 
          filter(!grepl(",", claimvalues)) %>%
          mutate(claimvalues = unlist(claimvalues))

# save
readr::write_csv(lre_single, "lre_single.csv")

# examples for demonstration purposes
display_examples <- lre_single %>% slice(c(372, 861, 471, 1715))
display_examples$quote_translated <- c("CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline", "The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward", "The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview", "Chancellor Angela Merkel relies on European Union and Turkey to set quotas for refugees in refugee crisis")
Example Sentences from DEbateNet2.0
rowid quote
1 CDU CSU Horst Seehofer droht der Kanzlerin mit Konsequenzen sollte die Zahl der Flüchtlinge nicht sinken
2 Die Europäische Union sucht die Verständigung mit der Türkei und findet keine Lösung Bundeskanzlerin Angela Merkel prescht vor
3 Das Grundrecht auf Asyl für politisch Verfolgte kennt keine Obergrenze verkündete Merkel ebenfalls per Interview
4 Kanzlerin Angela Merkel setzt in der Flüchtlingskrise auf die Festlegung der Europäischen Union und der Türkei auf Kontingente für Flüchtlinge
Example Sentences from DEbateNet2.0
rowid quote_translated
1 CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline
2 The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward
3 The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview
4 Chancellor Angela Merkel relies on European Union and Turkey to set quotas for refugees in refugee crisis

Annotation was carried out on a sentence-level. In this example, we are mostly interested in the codes – the claim-categories – and their corresponding labels. However, we also add information regarding the name of the actor or proponent and their stance towards the claim (support (\(1\)) or opposition (\(-1\))).

Example Annotations from DEbateNet2.0
rowid name labels_en codes weight
1 Horst Seehofer ceiling/upper limit 102 1
2 EU Cooperation with transit countries 507 1
3 Angela Merkel ceiling/upper limit 102 -1
4 Angela Merkel Cooperation with transit countries 507 1

Use-case: Discourse Networks

When combined with network analysis, this setup allows to capture and represent complex relations from texts in an informative way. Here, it condenses sequential information given as text (e.g., sentences in newspaper articles) and combines it into a network representation (Discourse Network Analysis (DNA), Leifeld 2016). This network can subsequently be visualized:

Bipartite network

In this example, only three actors (blue circles) and two claim-categories (red squares) are displayed. As more and more text is considered, the number of different categories increases. Depending on the level of desired granularity their number easily reaches 100 and more distinct categories.

These categories are part of a so-called Codebook, which contains guidelines describing the various categories. Codebooks are typically developed by domain experts and subsequently manually applied to the sentences in question.

In the above example, the following sentence

Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline.

arguably contains the same demand as this sentence

The basic right to asylum for politically persecuted persons knows no upper limit […].

Therefore, they are assigned the same category (with opposing weights). The question then is, can we automate this process?

Sentence-Bert1

By comparing sentences against each other, one can quickly find similar sentences. If sentences are semantically similar then they more likely belong to the same category. Humans can carry out this comparison without much explanation. Language models can be trained for this task (Reimers and Gurevych 2019).

Code
# Code from here: https://www.sbert.net/docs/usage/semantic_textual_similarity.html
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

sentences = r.display_examples.quote_translated[[0,1,2]]

embeddings = model.encode(sentences, convert_to_tensor=True)
cosine_scores = np.round(util.cos_sim(embeddings, embeddings),2)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'from': sentences[i], 'to': sentences[j], 'score': float(cosine_scores[i][j])})
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
edgelist = pd.DataFrame(pairs) 

For starters, we use a (multi-lingual) pre-trained model to encode our sentences and return sentence-embeddings. These embeddings are subsequently compared with each other. Here, a cosine-similarity score (ranging between \(0-1\)) is returned to quantify the similarity between sentences.

Sentence similarity (selection)
from to score
CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview 0.47
The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview 0.40
CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward 0.32

The results are not bad at all. Sentences of the same category score higher than sentences with different labels. To check whether the multi-lingual model actually works, we compare the German sentence to its English translation:

Code
# Code from here: https://www.sbert.net/docs/usage/semantic_textual_similarity.html

# add German translation
sentences = pd.concat([sentences[[0]], pd.Series("CDU CSU Horst Seehofer droht der Kanzlerin mit Konsequenzen sollte die Zahl der Flüchtlinge nicht sinken",  index = [1])])
embeddings = model.encode(sentences, convert_to_tensor=True)
cosine_scores = np.round(util.cos_sim(embeddings, embeddings),2)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'English': sentences[i], 'German': sentences[j], 'score': float(cosine_scores[i][j])})
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
edgelist = pd.DataFrame(pairs) 
Sentence similarity between translated and original sentence
English German score
CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline CDU CSU Horst Seehofer droht der Kanzlerin mit Konsequenzen sollte die Zahl der Flüchtlinge nicht sinken 0.94

Let’s say we did not have any labels for the above sentences. Then we could easily cluster the sentences – given their embeddings or similarity-score – and assign labels in an unsupervised fashion. Conversely in a supervised approach, we could start with a few labeled sentences, find the most similar ones from the unlabeled set, and assign the corresponding label.

Since we have a good amount of labeled sentences and a comprehensive Codebook, we will do the latter. Instead of simply finding the most similar sentence to the already categorized one, we will use a similar yet slightly different and more robust approach. First, we create a median-embeddings for each category. Only then, we compare the sentence in question to each median-embedding and choose the closest one as label.2

Task 1: Identify Self-contained vs. context-dependent Claims

The current setup of comparing sentences without further context information (e.g., in form of adjacent sentences) imposes a strong restriction: We only consider sentences containing all information needed to classify them. We call these sentence self-contained. In fact, all sentences so far introduced do not require context-information. This sentence on the other hand does:

The CDU opposes this notion.

Without context we can only guess its label – it is therefore context-dependent. Although their removal might seem Procrustean, it is not unfounded: Most claims – especially important ones – occur more than once and in their self-contained form.

So the first task is to weed out those sentences that a) require context, b) are potentially malformed, or c) are outliers. This improves the quality of our training data for later fine-tuning as well, because the classes become more homogeneous.3 We start by loading the data set and remove multi-label claims. The data set contains 2454 individual sentences, each categorized with one out of 109 labels.

Code
import pandas as pd
import numpy as np
import torch
from sentence_transformers import SentenceTransformer, util , SentencesDataset, losses, evaluation

# set model 
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

# load data set
df = pd.read_csv("lre_single.csv") # only contains sentences with a single unambiguous label
classes = np.unique(df['claimvalues'])   # number of unique categories; should equal 109

# transform claims from integers into strings
classes = [str(x) for x in classes]
df['claimvalues'] = [str(claim) for claim in df['claimvalues']]

# create class embeddings and get similarity
df['sim_to_median_class_emb'] = [[] for i in range(len(df))]
for i in classes:
          df_subset = df[df['claimvalues'].apply(lambda x: str(i) in x)]
          sentences = df_subset['quote']
          corpus = sentences.tolist()
          embeddings = model.encode(corpus, convert_to_tensor=True)
          med_emb = torch.median(embeddings, dim=0)[0]
          
          # calculate similarity for later pruning
          cosine_scores = util.cos_sim(med_emb, embeddings).tolist()[0]
          idx = df_subset.index.tolist()
          for i in range(len(idx)):
              df.sim_to_median_class_emb[idx[i]].append(cosine_scores[i])

# sort to match order of predicted classes
df['sim_to_median_class_emb'] = [sorted(np.round(df.sim_to_median_class_emb[i], 3), reverse = True) for i in range(len(df))]

# export 
pd.DataFrame.to_csv(df, "lre_single_sim.csv")

The next step is to create a median-class-embedding (MCE) for each label. The idea is to approximate one average embedding that is representative for each category. This allows us to compare all embeddings of each individual sentence against the corresponding MCE. This way, one can estimate how similar the sentence is to the representative embedding.

Similarity to the respective Median-Class-Embedding
quote codes labels_en sim_to_median_class
Die CSU lehnt eine Erweiterung ab 309 Care (medical, financial, …) 0.305
Die Staats und Regierungschefs waren strikt dagegen 501 EU solution (quotas for refugees) 0.34
Zudem treibt de Maizière seine Forderung Asylzentren in Nordafrika einzurichten weiter voran 505 Asylum procedure in countries of origin 0.802
Und sie verspricht Ländern und Kommunen dass der Bund finanziell mehr tun werde 805 additional financing 0.791

If it is dissimilar, we assume that it is context-dependent and remove it. This means we have to decide on a threshold to distinguish between context-dependent and self-contained claims. In practice, we observed that a similarity-score of 0.6-0.65 yields good results for this particular data set (for later fine-tuning). After applying a threshold of \(\geq 0.6\) we end up with 1884 presumably self-contained sentences (0.77 % of the full data set).4

This pruned data set is then used to fine-tune a model for the classification of claim-categories within the migration domain.5

Task 2: Use the fine-tuned model to predict categories

The fine-tuned model can then be used to a) embed new sentences and b) create more accurate median-class-embeddings (as described above). Together they can be used to predict categories for new sentences.

To see how well this approach works to distinguish between different categories we visualize the encoded sentences using a dimensionality reduction technique. Here we map the embeddings of 768 dimensions unto 2.

Just by looking at the different embeddings in two dimensions, one can see that the fine-tuned model yields more clear-cut and well-spaced clusters. Still, they are not perfect as indicated by the overlapping points of different color. This means that a fully automated approach will mislabel sentences and is thus not appropriate without human intervention.

New sentences can be encoded analogously. With the above mentioned limitation in mind, we treat the prediction as educated guesses: The categories of closest \(k\) MCE are then presented to the human-annotator during the annotation process. Instead of 100 categories, they only have to select from a reduced number.

Prediction Examples at k = 3
quote codes labels_en codes_pred
Seine Tat wolle er als Kritik an der Flüchtlingspolitik von Henriette Reker verstanden wissen 190 Current migration policy [190, 401, 199]
Die Flüchtlinge sollten zu Hause bleiben ihr Land wieder aufbauen 104 isolation/immigration stop [209, 202, 211]
Gleichzeitig erinnerte Ponta daran dass Rumänien die Aufnahme von Flüchtlingen mit dem Stock so wie sie von Ungarn betrieben werde aufs Schärfste verurteile 104 isolation/immigration stop [104, 101, 199]
Das kann nur Europa gemeinsam lösen 501 EU solution (quotas for refugees) [501, 502, 899]
Das Einwanderungsgesetz das die SPD und Teile der CDU fordern nennt sie im Augenblick nicht vordringlich 108 immigration law [108, 315, 114]

Accuracy for different \(k\) on the test data set is reported below. As can be seen, the pre-trained model already gives good results. However, at lower \(k\), accuracy can be increased using the fine-tuned model: The first prediction is correct in roughly 60% of the cases. In 75% of the cases the correct result is among the top 3 predictions.

Accuracy at k
k pre-trained fine-tuned
1 0.51 0.62
3 0.69 0.75
5 0.76 0.83
10 0.84 0.87

Although it is not perfect, this setup has the potential to change the role of the annotator in two ways:

  • they either take the role of an “super-annotator”, curating the predictions shown to them or,
  • in the case of double-annotation, substitute one of the annotators with predictions from the classifier

Interactive usage

The model can be used by loading it from the huggingface repository. One can apply it to (newspaper) content different from the training and test material. For that we encode new sentences using the fine-tuned model debatenet-2-cat. Additionally, we load a list of categories contained in the training data set (classes_train.npy) with corresponding labels (codebook_migr.csv), and the tuned-MCE (tuned_emb).6

Code
import pandas as pd
import numpy as np
import torch
import json
from sentence_transformers import SentenceTransformer, util

# get categories
classes = np.load("classes_train.npy")

# get labels
labels = pd.read_csv("codebook_migr.csv")
labels_dict = labels.set_index('sub').to_dict(orient='dict')['description']

# define and load models
model_fine = SentenceTransformer('nblokker/debatenet-2-cat')
pool_fine = torch.load("tuned_emb", map_location=torch.device('cpu'))

# def classification function
def classify_claim(query_string, hits = 1, verbose = True, wmodel = "fine_tuned"):
        if wmodel == "fine_tuned":
            model = model_fine
            pool = pool_fine
        query = model.encode(query_string, convert_to_tensor= True)
        sim = []
        for i in pool:
               sim.extend(util.cos_sim(query, i).tolist()[0])
        idx = np.argsort(sim).tolist()[::-1][:hits]
        categories = [classes[i] for i in idx]
        score = [sim[i] for i in idx]  
        label = [labels_dict[i] for i in categories]
        if verbose == True:
                    return({'quote': query_string, 'codes_pred':categories, 'labels_en':label, 'score':np.round(score,2)}) 
        else:  
                    return(categories)

examples = ['Sunak ‘plans to stop deportation appeals’ for people who reach UK in small boats',
           'Home Office reportedly proposed two options to try to prevent those crossing Channel from claiming asylum',
           'Rishi Sunak has restated his promise to cut overall migration to the UK, but suggested he would delay a cap on refugee numbers that was promised in his leadership campaign.',
           'The immigration minister, Robert Jenrick, has clashed with business bosses over access to overseas workers, saying companies should train UK staff to fill vacancies rather than relying on people from other countries.']

examples_labeled = pd.DataFrame([classify_claim(i, hits = 3, verbose = True) for i in examples])          
Example queries (Guardian UK)
quote codes_pred labels_en score
Sunak ‘plans to stop deportation appeals’ for people who reach UK in small boats 104, 101, 207 isolation/immigration stop, controlled migration , deportations 0.59, 0.57, 0.55
Home Office reportedly proposed two options to try to prevent those crossing Channel from claiming asylum 104, 101, 110 isolation/immigration stop, controlled migration , asylum law 0.65, 0.65, 0.58
Rishi Sunak has restated his promise to cut overall migration to the UK, but suggested he would delay a cap on refugee numbers that was promised in his leadership campaign. 102, 104, 101 ceiling/upper limit , isolation/immigration stop, controlled migration 0.77, 0.56, 0.47
The immigration minister, Robert Jenrick, has clashed with business bosses over access to overseas workers, saying companies should train UK staff to fill vacancies rather than relying on people from other countries. 804, 602, 603 staff increase , combating shortage of skilled labour, easier/faster access 0.73, 0.60, 0.57

Summary

This post shows how one can use SBERT language models to speed up sentence-classification, e.g., for qualitative content analysis. Building on semantic textual similarity, we are able to first identify meaningful sentences from our data set and subsequently label them according to pre-defined categories.

Pro

  • does not require context
  • it is fast
  • works with > 100 categories
  • multi-lingual
  • agnostic to the domain

Contra

  • does not include context
  • assumes ‘self-sufficient’ sentences
  • requires human intervention

References

Blokker, Nico, André Blessing, Erenay Dayanik, Jonas Kuhn, Sebastian Padó, and Gabriella Lapesa. 2023. “Between Welcome Culture and Border Fence.” Language Resources and Evaluation, February. https://doi.org/10.1007/s10579-023-09641-8.
Haunss, Sebastian, Jonas Kuhn, Sebastian Padó, Andre Blessing, Nico Blokker, Erenay Dayanik, and Gabriella Lapesa. 2020. “Integrating Manual and Automatic Annotation for the Creation of Discourse Network Data Sets.” Politics and Governance 8 (2): 326–39. https://doi.org/10.17645/pag.v8i2.2591.
Leifeld, Philip. 2016. Policy Debates as Dynamic Networks: German Pension Politics and Privatization Discourse. Frankfurt/New York: Campus Verlag.
Mayring, Philipp. 2010. Qualitative Inhaltsanalyse.” In Handbuch Qualitative Forschung in der Psychologie, edited by Günter Mey and Katja Mruck, 601–13. Wiesbaden: VS Verlag für Sozialwissenschaften. https://doi.org/10.1007/978-3-531-92052-8_42.
Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” arXiv. https://doi.org/10.48550/arXiv.1908.10084.

Footnotes

  1. https://www.sbert.net/docs/usage/semantic_textual_similarity.html↩︎

  2. An alternative would be to train a classifier on top of these embeddings.↩︎

  3. But also prone to overfitting.↩︎

  4. Of course this is not a fool-proof way to distinguish between the two types of sentences and should rather be seen as rough approximation.↩︎

  5. The script for fine-tuning is adapted from the SBERT documentation and not included in this post.↩︎

  6. These are generated during training, which is not covered here.↩︎