Qualitative content analysis (QCA) is like programming – easy, until you start coding. It is also a powerful tool that lets you annotate and categorize (code) text for in-depth analyses (Mayring 2010). By carefully reading text passages, trained annotators are able to assign pre-defined categories or labels.
However, assigning categories to thousands of sentences is tedious work. Luckily, recent advances in natural language processing (NLP) allow us to speed up this task (see also Haunss et al. 2020). In this post, we present an efficient way to assist (human) annotators during the labeling process. For that we predict a label for a given sentence based on its semantic similarity to already labeled sentences.
This classification task can be separated into the following sub-tasks:
identify meaningful sentences using a pre-trained model (Section 4.1)
use a fine-tuned language model and make predictions (Section 4.2)
The setup
Let’s consider a few examples from the DEbateNet2.0-corpus(Blokker et al. 2023). This data set on the German migration debate in 2015 contains demands and propositions made by political actors as reported in newspaper articles. You can explore the data here and download it from Github.
Code
library(mardyr2)library(tidyr)library(dplyr)# load debatenet2.0 from mardyR package# remotes::install_github("nicoblokker/mardyr2")lre <- mardyr2:::LRE %>%separate_rows(claimvalues, sep =",") %>%mutate(claimvalues =gsub("\\D", "", claimvalues)) %>%filter(!grepl("[1-9]00|999", claimvalues)) %>%mutate(label =suppressWarnings(mardyr2:::lookup_codes(claimvalues))) %>%as.data.frame() %>%select(quote, claimvalues, label) %>%mutate(quote =trimws(gsub("\\s+", " ", quote)))# collapse quotes with different labelslre_compressed <- lre %>%group_by(quote) %>%summarise(claimvalues =paste(claimvalues, collapse ="; "),labels =paste(label, collapse ="; "))# collapse quotes with different polaritylre_compressed$claimvalues <-sapply(1:nrow(lre_compressed), function(x) unique(stringr::str_extract_all(lre_compressed$claimvalues[x], "\\d+")[[1]]))# remove multi-label claimslre_single <- lre_compressed %>%filter(!grepl(",", claimvalues)) %>%mutate(claimvalues =unlist(claimvalues))# savereadr::write_csv(lre_single, "lre_single.csv")# examples for demonstration purposesdisplay_examples <- lre_single %>%slice(c(372, 861, 471, 1715))display_examples$quote_translated <-c("CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline", "The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward", "The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview", "Chancellor Angela Merkel relies on European Union and Turkey to set quotas for refugees in refugee crisis")
CDU CSU Horst Seehofer droht der Kanzlerin mit Konsequenzen sollte die Zahl der Flüchtlinge nicht sinken
2
Die Europäische Union sucht die Verständigung mit der Türkei und findet keine Lösung Bundeskanzlerin Angela Merkel prescht vor
3
Das Grundrecht auf Asyl für politisch Verfolgte kennt keine Obergrenze verkündete Merkel ebenfalls per Interview
4
Kanzlerin Angela Merkel setzt in der Flüchtlingskrise auf die Festlegung der Europäischen Union und der Türkei auf Kontingente für Flüchtlinge
Example Sentences from DEbateNet2.0
rowid
quote_translated
1
CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline
2
The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward
3
The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview
4
Chancellor Angela Merkel relies on European Union and Turkey to set quotas for refugees in refugee crisis
Annotation was carried out on a sentence-level. In this example, we are mostly interested in the codes – the claim-categories – and their corresponding labels. However, we also add information regarding the name of the actor or proponent and their stance towards the claim (support (\(1\)) or opposition (\(-1\))).
Example Annotations from DEbateNet2.0
rowid
name
labels_en
codes
weight
1
Horst Seehofer
ceiling/upper limit
102
1
2
EU
Cooperation with transit countries
507
1
3
Angela Merkel
ceiling/upper limit
102
-1
4
Angela Merkel
Cooperation with transit countries
507
1
Use-case: Discourse Networks
When combined with network analysis, this setup allows to capture and represent complex relations from texts in an informative way. Here, it condenses sequential information given as text (e.g., sentences in newspaper articles) and combines it into a network representation (Discourse Network Analysis (DNA), Leifeld 2016). This network can subsequently be visualized:
Bipartite network
In this example, only three actors (blue circles) and two claim-categories (red squares) are displayed. As more and more text is considered, the number of different categories increases. Depending on the level of desired granularity their number easily reaches 100 and more distinct categories.
These categories are part of a so-called Codebook, which contains guidelines describing the various categories. Codebooks are typically developed by domain experts and subsequently manually applied to the sentences in question.
In the above example, the following sentence
Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline.
arguably contains the same demand as this sentence
The basic right to asylum for politically persecuted persons knows no upper limit […].
Therefore, they are assigned the same category (with opposing weights). The question then is, can we automate this process?
By comparing sentences against each other, one can quickly find similar sentences. If sentences are semantically similar then they more likely belong to the same category. Humans can carry out this comparison without much explanation. Language models can be trained for this task (Reimers and Gurevych 2019).
For starters, we use a (multi-lingual) pre-trained model to encode our sentences and return sentence-embeddings. These embeddings are subsequently compared with each other. Here, a cosine-similarity score (ranging between \(0-1\)) is returned to quantify the similarity between sentences.
Sentence similarity (selection)
from
to
score
CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline
The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview
0.47
The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward
The basic right to asylum for politically persecuted persons knows no upper limit, Merkel also announced in an interview
0.40
CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline
The European Union seeks understanding with Turkey and finds no solution German Chancellor Angela Merkel moves forward
0.32
The results are not bad at all. Sentences of the same category score higher than sentences with different labels. To check whether the multi-lingual model actually works, we compare the German sentence to its English translation:
Code
# Code from here: https://www.sbert.net/docs/usage/semantic_textual_similarity.html# add German translationsentences = pd.concat([sentences[[0]], pd.Series("CDU CSU Horst Seehofer droht der Kanzlerin mit Konsequenzen sollte die Zahl der Flüchtlinge nicht sinken", index = [1])])embeddings = model.encode(sentences, convert_to_tensor=True)cosine_scores = np.round(util.cos_sim(embeddings, embeddings),2)pairs = []for i inrange(len(cosine_scores)-1):for j inrange(i+1, len(cosine_scores)): pairs.append({'English': sentences[i], 'German': sentences[j], 'score': float(cosine_scores[i][j])})pairs =sorted(pairs, key=lambda x: x['score'], reverse=True)edgelist = pd.DataFrame(pairs)
Sentence similarity between translated and original sentence
English
German
score
CDU CSU Horst Seehofer threatens the Chancellor with consequences if the number of refugees does not decline
CDU CSU Horst Seehofer droht der Kanzlerin mit Konsequenzen sollte die Zahl der Flüchtlinge nicht sinken
0.94
Let’s say we did not have any labels for the above sentences. Then we could easily cluster the sentences – given their embeddings or similarity-score – and assign labels in an unsupervised fashion. Conversely in a supervised approach, we could start with a few labeled sentences, find the most similar ones from the unlabeled set, and assign the corresponding label.
Since we have a good amount of labeled sentences and a comprehensive Codebook, we will do the latter. Instead of simply finding the most similar sentence to the already categorized one, we will use a similar yet slightly different and more robust approach. First, we create a median-embeddings for each category. Only then, we compare the sentence in question to each median-embedding and choose the closest one as label.2
Task 1: Identify Self-contained vs. context-dependent Claims
The current setup of comparing sentences without further context information (e.g., in form of adjacent sentences) imposes a strong restriction: We only consider sentences containing all information needed to classify them. We call these sentence self-contained. In fact, all sentences so far introduced do not require context-information. This sentence on the other hand does:
The CDU opposes this notion.
Without context we can only guess its label – it is therefore context-dependent. Although their removal might seem Procrustean, it is not unfounded: Most claims – especially important ones – occur more than once and in their self-contained form.
So the first task is to weed out those sentences that a) require context, b) are potentially malformed, or c) are outliers. This improves the quality of our training data for later fine-tuning as well, because the classes become more homogeneous.3 We start by loading the data set and remove multi-label claims. The data set contains 2454 individual sentences, each categorized with one out of 109 labels.
Code
import pandas as pdimport numpy as npimport torchfrom sentence_transformers import SentenceTransformer, util , SentencesDataset, losses, evaluation# set model model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')# load data setdf = pd.read_csv("lre_single.csv") # only contains sentences with a single unambiguous labelclasses = np.unique(df['claimvalues']) # number of unique categories; should equal 109# transform claims from integers into stringsclasses = [str(x) for x in classes]df['claimvalues'] = [str(claim) for claim in df['claimvalues']]# create class embeddings and get similaritydf['sim_to_median_class_emb'] = [[] for i inrange(len(df))]for i in classes: df_subset = df[df['claimvalues'].apply(lambda x: str(i) in x)] sentences = df_subset['quote'] corpus = sentences.tolist() embeddings = model.encode(corpus, convert_to_tensor=True) med_emb = torch.median(embeddings, dim=0)[0]# calculate similarity for later pruning cosine_scores = util.cos_sim(med_emb, embeddings).tolist()[0] idx = df_subset.index.tolist()for i inrange(len(idx)): df.sim_to_median_class_emb[idx[i]].append(cosine_scores[i])# sort to match order of predicted classesdf['sim_to_median_class_emb'] = [sorted(np.round(df.sim_to_median_class_emb[i], 3), reverse =True) for i inrange(len(df))]# export pd.DataFrame.to_csv(df, "lre_single_sim.csv")
The next step is to create a median-class-embedding (MCE) for each label. The idea is to approximate one average embedding that is representative for each category. This allows us to compare all embeddings of each individual sentence against the corresponding MCE. This way, one can estimate how similar the sentence is to the representative embedding.
Similarity to the respective Median-Class-Embedding
quote
codes
labels_en
sim_to_median_class
Die CSU lehnt eine Erweiterung ab
309
Care (medical, financial, …)
0.305
Die Staats und Regierungschefs waren strikt dagegen
501
EU solution (quotas for refugees)
0.34
Zudem treibt de Maizière seine Forderung Asylzentren in Nordafrika einzurichten weiter voran
505
Asylum procedure in countries of origin
0.802
Und sie verspricht Ländern und Kommunen dass der Bund finanziell mehr tun werde
805
additional financing
0.791
If it is dissimilar, we assume that it is context-dependent and remove it. This means we have to decide on a threshold to distinguish between context-dependent and self-contained claims. In practice, we observed that a similarity-score of 0.6-0.65 yields good results for this particular data set (for later fine-tuning). After applying a threshold of \(\geq 0.6\) we end up with 1884 presumably self-contained sentences (0.77 % of the full data set).4
This pruned data set is then used to fine-tune a model for the classification of claim-categories within the migration domain.5
Task 2: Use the fine-tuned model to predict categories
The fine-tuned model can then be used to a) embed new sentences and b) create more accurate median-class-embeddings (as described above). Together they can be used to predict categories for new sentences.
To see how well this approach works to distinguish between different categories we visualize the encoded sentences using a dimensionality reduction technique. Here we map the embeddings of 768 dimensions unto 2.
Just by looking at the different embeddings in two dimensions, one can see that the fine-tuned model yields more clear-cut and well-spaced clusters. Still, they are not perfect as indicated by the overlapping points of different color. This means that a fully automated approach will mislabel sentences and is thus not appropriate without human intervention.
New sentences can be encoded analogously. With the above mentioned limitation in mind, we treat the prediction as educated guesses: The categories of closest \(k\) MCE are then presented to the human-annotator during the annotation process. Instead of 100 categories, they only have to select from a reduced number.
Prediction Examples at k = 3
quote
codes
labels_en
codes_pred
Seine Tat wolle er als Kritik an der Flüchtlingspolitik von Henriette Reker verstanden wissen
190
Current migration policy
[190, 401, 199]
Die Flüchtlinge sollten zu Hause bleiben ihr Land wieder aufbauen
104
isolation/immigration stop
[209, 202, 211]
Gleichzeitig erinnerte Ponta daran dass Rumänien die Aufnahme von Flüchtlingen mit dem Stock so wie sie von Ungarn betrieben werde aufs Schärfste verurteile
104
isolation/immigration stop
[104, 101, 199]
Das kann nur Europa gemeinsam lösen
501
EU solution (quotas for refugees)
[501, 502, 899]
Das Einwanderungsgesetz das die SPD und Teile der CDU fordern nennt sie im Augenblick nicht vordringlich
108
immigration law
[108, 315, 114]
Accuracy for different \(k\) on the test data set is reported below. As can be seen, the pre-trained model already gives good results. However, at lower \(k\), accuracy can be increased using the fine-tuned model: The first prediction is correct in roughly 60% of the cases. In 75% of the cases the correct result is among the top 3 predictions.
Accuracy at k
k
pre-trained
fine-tuned
1
0.51
0.62
3
0.69
0.75
5
0.76
0.83
10
0.84
0.87
Although it is not perfect, this setup has the potential to change the role of the annotator in two ways:
they either take the role of an “super-annotator”, curating the predictions shown to them or,
in the case of double-annotation, substitute one of the annotators with predictions from the classifier
Interactive usage
The model can be used by loading it from the huggingface repository. One can apply it to (newspaper) content different from the training and test material. For that we encode new sentences using the fine-tuned model debatenet-2-cat. Additionally, we load a list of categories contained in the training data set (classes_train.npy) with corresponding labels (codebook_migr.csv), and the tuned-MCE (tuned_emb).6
Code
import pandas as pdimport numpy as npimport torchimport jsonfrom sentence_transformers import SentenceTransformer, util# get categoriesclasses = np.load("classes_train.npy")# get labelslabels = pd.read_csv("codebook_migr.csv")labels_dict = labels.set_index('sub').to_dict(orient='dict')['description']# define and load modelsmodel_fine = SentenceTransformer('nblokker/debatenet-2-cat')pool_fine = torch.load("tuned_emb", map_location=torch.device('cpu'))# def classification functiondef classify_claim(query_string, hits =1, verbose =True, wmodel ="fine_tuned"):if wmodel =="fine_tuned": model = model_fine pool = pool_fine query = model.encode(query_string, convert_to_tensor=True) sim = []for i in pool: sim.extend(util.cos_sim(query, i).tolist()[0]) idx = np.argsort(sim).tolist()[::-1][:hits] categories = [classes[i] for i in idx] score = [sim[i] for i in idx] label = [labels_dict[i] for i in categories]if verbose ==True:return({'quote': query_string, 'codes_pred':categories, 'labels_en':label, 'score':np.round(score,2)}) else: return(categories)examples = ['Sunak ‘plans to stop deportation appeals’ for people who reach UK in small boats','Home Office reportedly proposed two options to try to prevent those crossing Channel from claiming asylum','Rishi Sunak has restated his promise to cut overall migration to the UK, but suggested he would delay a cap on refugee numbers that was promised in his leadership campaign.','The immigration minister, Robert Jenrick, has clashed with business bosses over access to overseas workers, saying companies should train UK staff to fill vacancies rather than relying on people from other countries.']examples_labeled = pd.DataFrame([classify_claim(i, hits =3, verbose =True) for i in examples])
Example queries (Guardian UK)
quote
codes_pred
labels_en
score
Sunak ‘plans to stop deportation appeals’ for people who reach UK in small boats
Home Office reportedly proposed two options to try to prevent those crossing Channel from claiming asylum
104, 101, 110
isolation/immigration stop, controlled migration , asylum law
0.65, 0.65, 0.58
Rishi Sunak has restated his promise to cut overall migration to the UK, but suggested he would delay a cap on refugee numbers that was promised in his leadership campaign.
The immigration minister, Robert Jenrick, has clashed with business bosses over access to overseas workers, saying companies should train UK staff to fill vacancies rather than relying on people from other countries.
804, 602, 603
staff increase , combating shortage of skilled labour, easier/faster access
0.73, 0.60, 0.57
Summary
This post shows how one can use SBERT language models to speed up sentence-classification, e.g., for qualitative content analysis. Building on semantic textual similarity, we are able to first identify meaningful sentences from our data set and subsequently label them according to pre-defined categories.
Pro
does not require context
it is fast
works with > 100 categories
multi-lingual
agnostic to the domain
Contra
does not include context
assumes ‘self-sufficient’ sentences
requires human intervention
References
Blokker, Nico, André Blessing, Erenay Dayanik, Jonas Kuhn, Sebastian Padó, and Gabriella Lapesa. 2023. “Between Welcome Culture and Border Fence.”Language Resources and Evaluation, February. https://doi.org/10.1007/s10579-023-09641-8.
Haunss, Sebastian, Jonas Kuhn, Sebastian Padó, Andre Blessing, Nico Blokker, Erenay Dayanik, and Gabriella Lapesa. 2020. “Integrating Manual and Automatic Annotation for the Creation of Discourse Network Data Sets.”Politics and Governance 8 (2): 326–39. https://doi.org/10.17645/pag.v8i2.2591.
Leifeld, Philip. 2016. Policy Debates as Dynamic Networks: German Pension Politics and Privatization Discourse. Frankfurt/New York: Campus Verlag.
Mayring, Philipp. 2010. “Qualitative Inhaltsanalyse.” In Handbuch Qualitative Forschung in der Psychologie, edited by Günter Mey and Katja Mruck, 601–13. Wiesbaden: VS Verlag für Sozialwissenschaften. https://doi.org/10.1007/978-3-531-92052-8_42.