Text Reuse in Finnish Newspapers and Journals, 1771–1920
The database Text Reuse in Finnish Newspapers and Journals, 1771–1920 has been created by the consortium Computational History and the Transformation of Public Discourse in Finland, 1640–1910 (funded by the Academy of Finland programme on Digital Humanities, 2016–2019). The consortium is based on the cooperation of four partners: The Faculty of Humanities at the University of Helsinki, the Departments of Cultural History and Future Technologies at the University of Turku and the Centre for Preservation and Digitisation of the National Library of Finland. This brings together relevant complementary expertise on the research subject (eighteenth- and nineteenth-century history), methodology (computational sciences and language technology) and data (the preservation and enhancement of digital resources). The objective of the consortium is to reassess the scope, nature, development, and transnational connections of public discourse in Finland, 1640–1910.
The database Text Reuse in Finnish Newspapers and Journals, 1771–1920 has been built by the researchers of the Department of Future Technologies and the Department of Cultural History, University of Turku. It is an outcome of the WP2 of the consortium, “Viral Texts and Social Networks of Finnish Public Discourse in Newspapers and Journals 1771–1910”. The main idea has been to identify repeated texts, or passages of texts, from a corpus of scanned and OCR-recognized Finnish newspapers and journals. Due to various reasons, like the 'Fraktur' font, the quality is at times fairly illegible. For this purpose, software was created for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system uses NCBI BLAST, which is software created for comparing and aligning biological sequences, in order to find the duplicated passages. The web interface uses Solr as the back-end to perform the search queries and a modified version of Blacklight as the front-end.
The original OCR data is not perfectly segmented into articles, so elements such as page breaks, pictures etc. in the original image can cause multiple clusters of similar passages. Therefore, the absolute number of found passages does not necessarily reflect the actual level of reuse. The passages also have a minimum length requirement of 300 characters, as anything shorter often results in boilerplate.
For each cluster, we have also calculated a virality score. This score takes into account three factors: number of unique locations, number of unique titles and the time it took for the passage to circulate. We also do not consider outliers in the calculation, so that one or a few stray repetitions can not skew the results. Therefore by looking at the dates, we discard any repetitions that apper outside the Tukey fences.
The formula used: locations*titles*(1/timespan) , normalized to 0-100 range.
Further readings on the method and preliminary results:
Vesanto, Aleksi, Asko Nivala, Heli Rantala, Tapio Salakoski, Hannu Salmi & Filip Ginter, ’Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771–1910’, Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), 54–58, http://www.ep.liu.se/ecp/133/010/ecp17133010.pdf
Vesanto, Aleksi, Asko Nivala, Tapio Salakoski, Hannu Salmi & Filip Ginter, ’A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora’, Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), 330–333, http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf
Further information of the COMHIS consortium: