Large-scale computational approaches to evolution and change: prospects and pitfalls

Name: Large-scale computational approaches to evolution and change: prospects and pitfalls
Start: 2024-05-18T09:00:00Z
End: 2024-05-18T15:00:00Z
Location: Evolang XV, Madison, US

Andres Karjus, Nina Tahmasebi, Haim Dubossarsky, Stefano De Pascale

Abstract

A first workshop on Large-scale computational approaches to evolution and change will be held at Evolang XV, Madison US. We aim to bring together language evolution, cutting-edge NLP, and LLM-driven approaches, to critically discuss novel opportunities of large-scale empirical approaches to language evolution and change.

Date

May 18, 2024 9:00 AM — 3:00 PM

Event

Evolang Workshop

Location

Evolang XV, Madison, US

To understand how and why a complex system like language works, understanding how it changes is key. Among the multitude of possible approaches to studying language change dynamics, the focus in this workshop is on the empirical study of large collections of linguistic data such as corpora and lexical databases. Scale necessitates computation: no human alone could read through billions of words fast enough. But machines can.

There is no time like now to apply machine learning to language. Advances in the NLP fields of semantic shift and lexical semantic change detection yield increasingly accurate automated inferences (Schlechtweg et al. 2020; Montanelli & Periti 2023, Tahmasebi et al 2021), primarily driven by various (large) language models. Machine-readable diachronic data at both short and long time scales has become abundant thanks to corpus building and digitization efforts, and gold standard test sets are available to objectively evaluate different approaches (e.g. Schlechtweg et al. 2021). Developments in generative LLMs have also come to a point where previous complex NLP pipelines and costly supervised learning architectures can often simply be replaced with zero-shot LLM queries without loss in performance (Ziems et al. 2023; Karjus 2023).

These approaches however don't come without limitations. While the availability of pretrained LLMs opens up new and simplifies previously complex research avenues, it is important to take into account and mitigate their biases (Dubossarsky et al. 2017) (which are inherent to any pretrained model). While research rooted in NLP often focuses rather on the what (has changed), the what can inform the how and why. It is therefore crucial to embed them in theoretically meaningful frameworks, and furthermore, delineate which detected changes may be driven by inherent linguistic mechanisms like selection and drift (Montero et al. 2023) or general cognitive factors like optimal encoding strategies in our mental categories (Dubossarsky et al. 2015), in contrast to those reflecting changes in the socio-cultural aspects of the language communities and their communicative needs (Kemp et al. 2018; Karjus et al. 2020; De Pascale & Marzo 2023).

This workshop seeks to bring together language evolution, cutting-edge NLP, and LLM-driven approaches, to critically discuss novel opportunities of large-scale empirical approaches to language evolution and change (cf. Hartmann 2020), but also the aforementioned issues. Submissions will be short abstracts, assessed by rigor and relevance to the following questions and themes:

How to combine large-scale computational language change detection with meaningful inference frameworks, and cognitively plausible theories of language evolution mechanisms?
To what extent are LLMs as zero-shot classifiers and inference engines applicable to the study of language change?
How to tease apart linguistic and the socio-cultural drivers of change at scale?
If and how can applied NLP benefit from evolutionary thinking?
How to evaluate and mitigate bias in pretrained language models? This includes issues of applying models trained on modern data to historical material, as well as potentially harmful social and other biases that a model may propagate from (at times unknown) training data.
How can experimental methods from cognitive science, psychology or social sciences be used to study machine bias or “behavior”? What are possible pitfalls of such method transfer?

Schedule

TIME	EVENT	AUTHORS
09:15-09:30	Introduction	Andres Karjus & Nina Tahmasebi
09:30-10:15	Plenary talk What's the "language" in Large Language Models?	Claire Bowern
10:15-10:30	Coffee break
10:30-11:00	Modeling semasiological mechanisms of change by means of onomasiological comparisons	Stefano De Pascale, Nina Tahmasebi & Pierluigi Cassotti
11:00-11:30	Evolution in morphological complexity and word order rigidity	Julie Nijs, Freek Van de Velde & Huybert Cuyckens
11:30-12:00	Using large-scale computational approaches to reconstruct the evolutionary dynamics of lexical meaning and gender	Gerd Carling, Noor Efrat-Kowalsky, Marc Allassonière Tang, Lev Michael, Filip Larsson, & Niklas Erben Johansson
12:00-13:15	Lunch break
13:15-14:30	Plenary talk Computational Modeling of Linguistic Leadership	Sandeep Soni
14:30-15:00	Instructable LLMs for scaling data-driven language and culture research	Andres Karjus
15:00-15:30	Information-theoretic measures to study change in language use: modeling socio-cultural up to local linguistic context	Stefania Degaetano-Ortlieb
15:30-16:00	General discussion & closure

Acknowledgement

The workshop has been sponsored by a EUTOPIA Seed Funding Grant for Joint Fundamental Research; EUTOPIA is a university alliance which the Vrije Universiteit Brussel and the University of Gothenburg are part of, represented by Stefano De Pascale and Nina Tahmasebi respectively.

Remote Participation The workshop is to be live-streamed and you are welcome to participate remotely through the Teams link below. You are of course free to share the link with others interested in attending the workshop remotely.

Microsoft Teams
Join the meeting now
Meeting ID: 380 715 961 099
Passcode: VWWt9x

References

De Pascale, S., Marzo, S., 2023. Lexical coherence in contemporary Italian: a lectometric analysis. Sociolinguistica 37, 145–166. https://doi.org/10.1515/soci-2022-0027

Dubossarsky, H., Weinshall, D., & Grossman, E., 2017. Outta control: Laws of semantic change and inherent biases in word representation models. In Proceedings of the 2017 conference on empirical methods in natural language processing.

Dubossarsky, H., Tsvetkov, Y., Dyer, C., & Grossman, E., 2015. A bottom up approach to category mapping and meaning change. In NetWordS (pp. 66-70).

Hartmann, S., 2020. Language change and language evolution: Cousins, siblings, twins? Glottotheory 11, 15–39. https://doi.org/10.1515/glot-2020-2003

Karjus, A. 2023. Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence. arXiv:2309.14379

Karjus, A., Blythe, R.A., Kirby, S., Smith, K., 2020. Quantifying the dynamics of topical fluctuations in language. Language Dynamics and Change 10, 86–125. https://doi.org/10.1163/22105832-01001200

Kemp, C., Xu, Y., Regier, T., 2018. Semantic Typology and Efficient Communication. Annual Review of Linguistics 4, 109–128. https://doi.org/10.1146/annurev-linguistics-011817-045406

Montanelli, S., Periti, F., 2023. A Survey on Contextualised Semantic Shift Detection. arXiv.2304.01666

Francesco Periti and Nina Tahmasebi. 2024. A Systematic Comparison of Contextualized Word Embeddings for Lexical Semantic Change. arXiv:2402.12011.

Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara, and Nina Tahmasebi. 2023. Studying Word Meaning Evolution through Incremental Semantic Shift Detection: A Case Study of Italian Parliamentary Speeches. (2023) TechRxiv.

Montero, J.G., Karjus, A., Smith, K., Blythe, R.A., 2023. Reliable Detection and Quantification of Selective Forces in Language Change. arXiv.2305.15914

Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., Tahmasebi, N., 2020. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection, in: Proceedings of the Fourteenth Workshop on Semantic Evaluation. Presented at the SemEval 2020, International Committee for Computational Linguistics, Barcelona (online), pp. 1–23. https://doi.org/10.18653/v1/2020.semeval-1.1

Schlechtweg, D., Tahmasebi, N., Hengchen, S., Dubossarsky, H., McGillivray, B., 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Presented at the EMNLP 2021, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp. 7079–7091. https://doi.org/10.18653/v1/2021.emnlp-main.567

Tahmasebi, N., Borin, L., Jatowt, A., Xu, Y., Hengchen, S., 2021. Computational approaches to semantic change. Language Science Press, Berlin. https://doi.org/10.5281/zenodo.5040241

Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., Yang, D., 2023. Can Large Language Models Transform Computational Social Science? arXiv.2305.03514

Language Change Detection Historical Semantic Change Lexical Replacement Digital Humanities