Claim classification in news reports using R

Author

Nico Blokker

Published

May 5, 2023

Headline classification

In this post I demonstrate how to use the models developed in MARDY to identify and classify headlines from news reports (from the German Tagesschau and the British Guardian) using R.1

The goal is to a) identify political claims in the headers of news reports and b) label them according to the guidelines developed within MARDY.

Accessing the data - Tagesschau

The first step is to download and prepare the data. In this example, I scrape the first news page on the topic of migration. I am interested in both the header and the additional short-text provided there.

Code
# load packages
library(rvest)
library(dplyr)

# scrape data
scrape_ts_ticker <- function(page){
          Sys.sleep(5)
          url <- paste0("https://www.tagesschau.de/thema/migration/?pageIndex=", page)
          page <- read_html(url)
          header <- page %>% html_nodes(".teaser__headline") %>% html_text(trim = T)
          date <- page %>% html_nodes(".teaser__date") %>% html_text(trim = T)
          shorttext <- page %>% html_nodes(".teaser__shorttext") %>% html_text(trim = T)
          df <- data.frame(header = gsub("\\s?\\+\\s?", "", header),
                           date = gsub("\\s.*", "", date),
                           shorttext = trimws(gsub("Von\\s+[A-Z].*mehr|\\n\\s+mehr", "", shorttext)))
          return(df)
}
ts_migr <- purrr::map_df(.x = 1, .f = scrape_ts_ticker)
saveRDS(ts_migr, "ts_migr.rds")

Let’s take a look at a sample of the data:

date header shorttext
30.04.2023 Franziskus wirbt in Ungarn für “offene Türen” Papst Franziskus hat bei einer Freiluftmesse in Budapest für Toleranz und die Aufnahme von Migranten geworben. Damit distanzierte er sich von der Politik des ungarischen Ministerpräsidenten Orban - der saß im Publikum.
27.04.2023 Unterhaus beschließt Gesetz gegen illegale Migration Großbritannien will Migranten abschrecken: Wer illegal über den Ärmelkanal kommt, muss mit Internierung, einer Abschiebung nach Ruanda und einer lebenslangen Einreisesperre rechnen. Das Unterhaus beschloss nun das Gesetz.
26.04.2023 Hunderte Migranten erreichen Lampedusa Noch immer wagen viele Menschen die gefährliche Flucht über das Mittelmeer. Allein seit Mitternacht kamen Hunderte Migranten auf der italienischen Insel Lampedusa an. Rettungskräfte fanden auch zwei Leichen.

In total the first page contains the headlines of 21 reports.

Accessing the data - Guardian

After downloading the German data, I turn to the Guardian. Luckily, the Guardian provides an API to access and filter their data base (Note: API key needed).

Code
library(glue)
library(httr)
tag <- "migration OR refugee"
page_size = 21
api_key <- "YOUR API KEY"
date <- "from-date=2023-01-01"
url <- glue('https://content.guardianapis.com/search?q={tag}&{date}&api-key={api_key}&page-size={page_size}&show-fields=headline,trailText') 


get_guardian <- function(page){
          request <- GET(URLencode(glue("{url}&page={page}")))
          response <- content(request)$response$results
          guardian_migr <- purrr::map_chr(response, ~.x$fields$headline)
          guardian_migr_short <- purrr::map_chr(response, ~.x$fields$trailText)
          guardian_migr_date <- purrr::map_chr(response, ~.x$webPublicationDate)
          data.frame(date = guardian_migr_date, header = guardian_migr, shorttext = guardian_migr_short)
}

guardian <- purrr::map_df(1, get_guardian)
saveRDS(guardian, "guardian-world-migration-sample.rds")

Again, let’s take a look at a sample of the data (\(N\) = 21):

date header shorttext
2023-03-29T20:26:37Z Illegal migration bill could topple world refugee system, lawmakers told UN refugee agency representative warns that legislation could have ‘domino effect’ on other countries
2023-03-07T20:10:09Z UN refugee agency ‘profoundly concerned’ by UK’s illegal migration bill saying it amounts to an asylum ban – as it happened UNHCR says bill extinguishes the right to seek refugee protection in the UK for those who arrive irregularly
2023-03-07T18:01:12Z ‘A revenge plan’: refugees and Dover residents react to illegal migration bill Home secretary’s latest plan to curb small boats crossings across the Channel has been met with scepticism

Classification

Now that I have the data I want to classify, I start by setting up the claimIdent2 package that contains the necessary files needed for classification.

Code
library(claimIdent) # Unpublished at the time of publication of this post
configure()
Initializing - this may take a while...

This command needs to be run once a session and sets up the python environment, libraries, and code.3. Afterwards the sentences can be classified as follows:

Classify the short-text provided and assign one label.

Code
library(purrr)
ts_res <- map_df(ts_migr$shorttext, 
                 ~classify(.x, hits= 1, threshold = .9), 
                 .id = "sentence"
                 )
knitr::kable(ts_res[1:3,])
sentence cat cat_sim label claim_prob query
1 715 0.49 open society 0.873 Papst Franziskus hat bei einer Freiluftmesse in Budapest für Toleranz und die Aufnahme von Migranten geworben. Damit distanzierte er sich von der Politik des ungarischen Ministerpräsidenten Orban - der saß im Publikum.
2 104 0.56 isolation/immigration stop 0.965 Großbritannien will Migranten abschrecken: Wer illegal über den Ärmelkanal kommt, muss mit Internierung, einer Abschiebung nach Ruanda und einer lebenslangen Einreisesperre rechnen. Das Unterhaus beschloss nun das Gesetz.
3 NaN 0.00 no_claim 0.000 Noch immer wagen viele Menschen die gefährliche Flucht über das Mittelmeer. Allein seit Mitternacht kamen Hunderte Migranten auf der italienischen Insel Lampedusa an. Rettungskräfte fanden auch zwei Leichen.

Classify the short-text provided and assign 5 labels.

Code
library(purrr)
library(tidyr)
ts_res_mult <- map_df(ts_migr$shorttext, 
                 ~classify(.x, hits= 5, threshold = .9), 
                 .id = "sentence"
                 )
ts_res_wide <- ts_res_mult %>%
          nest(data = c(cat, cat_sim, label)) %>%
          unnest_wider(data) 
knitr::kable(ts_res_wide[1:3,]) 
sentence claim_prob query cat cat_sim label
1 0.873 Papst Franziskus hat bei einer Freiluftmesse in Budapest für Toleranz und die Aufnahme von Migranten geworben. Damit distanzierte er sich von der Politik des ungarischen Ministerpräsidenten Orban - der saß im Publikum. 715, 104, 712, 799, 706 0.49, 0.49, 0.44, 0.44, 0.43 open society , isolation/immigration stop , public debate , General , Recognition of fundamental rights
2 0.965 Großbritannien will Migranten abschrecken: Wer illegal über den Ärmelkanal kommt, muss mit Internierung, einer Abschiebung nach Ruanda und einer lebenslangen Einreisesperre rechnen. Das Unterhaus beschloss nun das Gesetz. 104, 207, 211, 408, 209 0.56, 0.54, 0.52, 0.49, 0.48 isolation/immigration stop, deportations , right of abode , deprivation of liberty , residence obligation
3 0.000 Noch immer wagen viele Menschen die gefährliche Flucht über das Mittelmeer. Allein seit Mitternacht kamen Hunderte Migranten auf der italienischen Insel Lampedusa an. Rettungskräfte fanden auch zwei Leichen. NaN 0 no_claim

Classify the headline provided and assign one label.

Code
library(purrr)
guardian_res <- map_df(guardian$header, 
                 ~classify(.x, hits= 1, threshold = .9), 
                 .id = "sentence"
                 )
knitr::kable(guardian_res[1:3,])
sentence cat cat_sim label claim_prob query
1 108 0.49 immigration law 0.144 Illegal migration bill could topple world refugee system, lawmakers told
2 104 0.53 isolation/immigration stop 0.350 UN refugee agency ‘profoundly concerned’ by UK’s illegal migration bill saying it amounts to an asylum ban – as it happened
3 108 0.68 immigration law 0.720 ‘A revenge plan’: refugees and Dover residents react to illegal migration bill

Classify the headline provided and assign five labels.

Code
library(purrr)
library(tidyr)
guardian_res_mult <- map_df(guardian$header, 
                 ~classify(.x, hits= 5, threshold = .9), 
                 .id = "sentence"
                 )
guardian_res_wide <- guardian_res_mult %>%
          nest(data = c(cat, cat_sim, label)) %>%
          unnest_wider(data) 
knitr::kable(guardian_res_wide[1:3,])
sentence claim_prob query cat cat_sim label
1 0.144 Illegal migration bill could topple world refugee system, lawmakers told 108, 509, 101, 114, 104 0.49, 0.37, 0.36, 0.36, 0.36 immigration law , Dublin regulation , controlled migration , (Canadian) points system , isolation/immigration stop
2 0.350 UN refugee agency ‘profoundly concerned’ by UK’s illegal migration bill saying it amounts to an asylum ban – as it happened 104, 207, 406, 509, 101 0.53, 0.49, 0.48, 0.44, 0.37 isolation/immigration stop, deportations , ban mile , Dublin regulation , controlled migration
3 0.720 ‘A revenge plan’: refugees and Dover residents react to illegal migration bill 108, 114, 104, 101, 509 0.68, 0.45, 0.44, 0.43, 0.42 immigration law , (Canadian) points system , isolation/immigration stop, controlled migration , Dublin regulation
  • sentence: index of the queried sentence
  • cat: assigned code-category from the codebook
  • cat_sim: similarity of the queried sentence to the median-embedding of the assigned code-category
  • label: corresponding label to the code-category
  • claim_prob: probability of query containing a claim
  • query: the queried sentence

Conclusion

The results are not perfect but provide a good first impression from which the annotation can proceed semi-automatically.

References

The data sets used to train the models are part of the following publications:

Lapesa, G., Blessing, A., Blokker, N., Dayanik, E., Haunss, S., Kuhn, J., & Padó, S. (2020). DEbateNet-mig15: Tracing the 2015 immigration debate in Germany over time. Proceedings of LREC, 919–927. https://www.aclweb.org/anthology/2020.lrec-1.115

Blokker, N., Blessing, A., Dayanik, E., Kuhn, J., Padó, S., & Lapesa, G. (2023). Between welcome culture and border fence. A dataset on the European refugee crisis in German newspaper reports. Language Resources and Evaluation, 121 - 153. https://link.springer.com/article/10.1007/s10579-023-09641-8

Footnotes

  1. Making use of the reticulate package to run python code from R↩︎

  2. Unpublished at the time of publication of this post; check https://github.com/nicoblokker/claim-classification for updates.↩︎

  3. Python and the required libraries need to be installed properly first, see Github repository.↩︎