1. Introduction
While relatively recent in terms of availability to the general public, the rise of generative artificial intelligence (AI) and large language models (LLM) has quickly been recognised as a tool with immense potential in the field of lexicography, which— despite its acknowledged limitations and errors (Lew, 2023; Rees & Lew, 2023; Martin 2024, McKean & Fitzgerald, 2024; Harnad, 2025; Kalaš, 2025, etc.)—demonstrates remarkable abilities in defining dictionary entries compared to previous software tools (see Yin & Skiena, 2023; Lew et al., 2024; Klosa-Kückelhaus & Tiberius, 2025). The findings on how well the AI models do when it comes to providing definitions, examples, common phrases, related forms, and completing other dictionary compiling-related tasks are relatively consistent, but with varying assessments of AI prowess depending on the task at hand (for overviews, see De Schryver, 2023; Rundell, 2024). This study’s exploration is motivated by the relative novelty of research in lexicographic applications of artificial intelligence, and the fact that its capabilities—in providing definitions and examples for foreign words verifiably employed in the observed recipient language—have yet to be tested. Moreover, the exploration of lexicographic potentials becomes even more challenging when linguistic interference is considered (cf. Chow et al., 2024; Li & Tarp, 2024; Merx et al., 2024, etc.), which seems to be the reason why the research on this topic is rather scarce, and why we delve into the issue of loanword meaning identification in L2 context.
As far as the mechanism by which they operate is concerned, the AI tools such as ChatGPT, DeepSeek and Gemini are best described as deep neural networks specialised in natural language processing (NLP) tasks, the functionality of which is based on modelling linguistic patterns and performing statistical analysis on vast amounts of textual data (see Carlini et al., 2021; Xu et al., 2022; Floridi, 2023; Min et al., 2023, etc.). More specifically, these models are trained on extensive datasets primarily through unsupervised learning (drawing from sources such as internet content, books, and articles), which is then followed by fine-tuning, a process of discriminative adjustment and optimisation, aimed at enhancing their ability to »understand« linguistic patterns, respond to queries, and generate contextually relevant text (Radford et al., 2018). Their architecture enables them to process input data, interpret contextual meaning, and generate responses or text that are both coherent and relevant to users. Apart from being able to generate text in various structural and stylistic formats, the ability of ChatGPT to refine its performance on similar tasks is what makes it impressive, although it is important to emphasise that this does not equate to reasoning or emotion-based feedback; rather, »it compares existing data to draw the most likely (e.g., the most frequent and relevant) responses« (Hong, 2023, p. 38; also see Jiang et al., 2020). On the other hand, some researchers argue that generating conclusions and responses based on statistical analysis is, in fact, at the core of intelligent-like systems, with these models representing the first instances of something that illustrates how language comprehension and intelligence can be decoupled from the physical and emotional characteristics traditionally associated with humans and animals (Aguera-Arcas, 2022).
2. Previous Research on the Application of AI Models
The application of AI models in educational settings has been increasing at rapid rates since the breakout of highly evolved LLMs such as ChatGPT, with several key advantages being identified in the academic context, including personalised learning, lesson planning, language learning, etc. (see Kasneci et al., 2023). More specifically, in the context of language-related tasks and applications, the numerous capabilities include: emphasising key phrases, generating summaries and translations, explaining grammar and vocabulary, suggesting improvements in grammar or style, assisting with conversational practice, providing feedback to students, identifying and correcting typographical errors, and recognising opportunities to enhance writing styles tailored to specific topics (ibid., 2023, p. 3). However, when reviewing AI-generated texts, educators often notice superficial content, occasional inaccuracies, incorrect paraphrasing of existing research, or even fabricated references, while proper citations within the text and in the bibliography are frequently missing, or, what is even more concerning, the text may include entirely fictitious sources (see King & ChatGPT, 2023; Rudolph, 2023). The problematic and often fabricated information is particularly interesting in lexicography, where accuracy and the summation of relevant information play a crucial role. Two recent studies, one on 166 and the other on 223 university students, found that ChatGPT significantly outperforms the web version of the Longman Dictionary of Contemporary English (LDOCE) in both language reception and production tasks (see Lew et al., 2024, Ptasznik et al., 2024). The results suggest that AI-based chatbots like ChatGPT constitute strong competitors to traditional dictionaries in supporting advanced language learners, particularly in English, with their greatest advantage being in production tasks, where they can generate expected phrasal verbs but also help students convey meaning that is less deterministic in terms of employing specific lexical items. In the context of our study, it is the ability to produce a definition, not anchored in already existing repositories or dictionaries such as LDOCE, which requires a multi-layered understanding of the prompt at hand, but, more importantly, it also requires a multi-level operation on behalf of the chatbot. It is rather atypical to engage in a lexicographic endeavour in terms of analysing the meaning of loanwords (especially those retaining a form that has not been fully adopted and adapted in the target language), and this atypicality is one of the ways to test the AI’s adeptness at providing definitions that are more contextually anchored and language specific.
In this sense, it is worth noting that restricting the lexicographic research to dictionary-compiling studies observing solely the AI’s feedback when it comes to the definition of the targeted entries might not be enough, for it is not sufficient to summarise the available recorded data on language use, but it is also necessary to determine its relevance regarding presence, context, potential productivity, etc., which requires a »sensorimotor grounding« of sorts (see Harnad, 2025). To test the requisite multi-level approach to meaning, the newer research has also focused on tasks going beyond the dictionary-compiling or definition-oriented tasks. For example, when it comes to addressing phrases not rooted solely in written or oral language presence, the findings indicate relatively high agreement between human and ChatGPT evaluation of neologisms such as blends or derivatives (Georgiou, 2025). In other words, the findings suggest that AI primarily captures the most common or dominant interpretation and excels in processing form-based linguistic cues, whereas when it comes to the requisite extralinguistic knowledge and multi-level approach to linguistic phenomena, it tends to struggle with meanings that require broader contextual or world knowledge (ibid.). In our research, we primarily seek to challenge its capacity to properly ascertain meanings of loanwords in the L2 language context, and more importantly, to demonstrate its ability to delineate different word senses concerning prompt-design differences. This entails that we expect it not only to provide accurate feedback on the presented question about the use of the target loanword but also to understand what exactly is being asked. Though it may seem that one cannot be realised without the other (i.e. accurate feedback without prompt-understanding), our data will demonstrate that answers vary not depending on the prompt (for it stays the same), but on the loanword at hand, thus exhibiting a nondeterministic character that goes beyond the mere outline and structure of the response, and directly influences the content itself.
Our study aims to assess the capacity of artificial intelligence tools to distinguish meanings of English loanwords in the Croatian language compared to their usage in English and to evaluate the quality of responses based on the frequency of English loanwords in the Croatian language.1 More specifically, it aims to assess the potential of AI in addressing differences between specific senses in which L1 words can be used in L2, and how this compares to the meanings in which they are used in L1.2 The targeted loanwords and the selection criteria are based on the work of Bogunović & Kučić (2022) and Bogunović (2023), who compiled a list of English words used in Croatian (ENGRI corpus), focusing on those that have largely retained their original orthographic and phonetic features.
3. Methodology
3.1. Research goals and questions
As stated, our research aims to assess the potential of AI in addressing differences between specific senses in which English words can be used in Croatian, and to answer the following research questions:
Is there a difference in AI responses when the definition of the same terms is required in Croatian and English?
Does AI recognise differences in the use of the same expression in Croatian and English?3
Does the frequency of a particular English loanword in Croatian affect the AI’s ability to evaluate their use adequately in Croatian and provide accurate feedback?
3.2. Method
In order to address the aforementioned research questions, we designed 6 different prompts, asking the free version of ChatGPT to provide definitions of the targeted lexical items (see Section 2.3. for the sample description). The prompts were designed as 3 questions in Croatian and 3 in English, mirroring the same requirements stated in Croatian. In other words, the first 3 prompts were designed to extrapolate relevant information on the definition of an English word in terms of its use in the Croatian language, while the other 3 prompts—which were the English translations of the first 3 prompts—required of ChatGPT to provide feedback on the use of the same lexical items, but exclusively in English (see Table 1).
Table 1. List of prompts designed to elicit information on the target English loanword where »X« is replaced by a new loanword each time / Tablica 1. Popis upita osmišljenih za dobivanje informacija o ciljanoj posuđenici iz engleskog jezika, pri čemu se »X« svaki put zamjenjuje novom posuđenicom
An important aspect of data collection entailed engaging in a new conversation each time one of the 6 prompts was given to ChatGPT, to avoid interference with previous feedback provided by the AI tool. Since ChatGPT considers the interaction established thus far (especially the previous 3 exchanges) in a particular conversation, starting a new conversation when the prompt was given ensured that the information retrieved was always devoid of the impact the previous interaction may have caused.
In order to capture the evolving nature of AI algorithms and compare the potential improvements in targeted interactions with the chatbot, we conducted the analysis of the responses on data retrieved during two different periods (January 2024 vs February and March 2025). More precisely, we reiterated the 3 prompts of the Croatian context and the corresponding treatment of loanwords to see whether the observed idiosyncrasies of ChatGPT’s feedback changed or remained the same. These interactions were carried out with the free version of ChatGPT, or more precisely ChatGPT-3.5 and ChatGPT 4o, which we hereafter refer to as ChatGPT 2024 and ChatGPT 2025 to highlight the temporal gap between two data sampling points.4
3.3. Sample
For the initial selection of English loanwords based on their frequency of use in the contemporary Croatian language, we utilised the study by Bogunović (2023), who extracted 9,452 »unadapted« English loanwords from the ENGRI corpus. The ENGRI corpus, as described by Bogunović and Kučić (2022), consists of 2,395,735 texts collected from the 12 most popular Croatian news portals (Reuters Institute for the Study of Journalism 2021), with publications ranging from 2014 to 2020. This corpus provides the advantage of newer data, although it is smaller in size compared to hrWaC or CLASSLA.5 The texts in ENGRI primarily derive from informal and journalistic styles, which are reflective of contemporary usage trends in the Croatian language.
The data was collected during January 2024 and February and March 2025, with the retrieval procedure covering 81 different English words in Croatian language. This resulted in a database of 486 units of information for the 2024 period and 243 for the 2025 period, which was then evaluated for accuracy, the number of meanings provided, and the soundness and plausibility of examples.
The loanwords were categorised based on their frequency of occurrence in the ENGRI corpus:
Highly Frequent Terms (>1000 occurrences) including show, rock, break, mail, party, reality, press, gay, summit, post, face, brand, cool, style, blues, punk, tablet, craft, monitor, stage, fair, resort, cloud, hot, cast, light, and story.
Relatively Frequent Terms (400 – 1,000 occurrences) including rank, pride, joint, screen, teaser, take, like, shake, space, share, position, school, insider, follow, round, deep, site, dog, force, way, card, cross, touch, name, capital, trick, and slow.
Less Frequent Terms (100 – 400 occurrences) including next, Bluetooth, index, resident, net, bad, fish, case, trip, extra, block, fax, showman, win, marker, unplug, special, input, grind, plank, budget, escort, fun, contact, tutorial, target, and relax.
The selection of these terms was informed by their frequency within the ENGRI corpus, providing a representative sample of loanwords as used in Croatian media.6 This approach ensures that the analysis is grounded in actual language use, capturing a range of terms from highly frequent to less frequent occurrences. These loanwords were then subject to further analysis using ChatGPT, which provided feedback on their definitions, senses, and word classes. This methodology enabled a comprehensive examination of the integration and adaptation of English loanwords in the contemporary Croatian language.
4. Results
4.1. Individual response analysis
ChatGPT’s feedback is most interesting when it comes to the prompt requiring word-class of a particular English expression in Croatian. Although there are a number of useful and incredibly detailed replies, at times, the tool seems to be prone to using certain English loanwords in the Croatian language in some idiosyncratic, if not completely incredulous contexts. Some rather unconvincing segments of ChatGPT’s responses (during both 2024 and 2025 sampling points) are portrayed in Table 2 (see the text in bold), clearly showing that the AI overgeneralises the meaning that has established itself in English by extending it into Croatian regardless of whether such application has been observed in everyday speech.7 Note that the plausible and legitimate examples that constitute the majority of ChatGPT’s responses are removed from the table for the sake of brevity.
Table 2. Segments from ChatGPT’s responses on Word class of »X« in Cro prompt
Tablica 2. Isječci iz odgovora ChatGPT-a o vrsti riječi »X« u upitu na hrvatskom jeziku
Similarly to the prompts asking for the word-class that the target loanword can occupy when used as a loanword in the Croatian language, the prompt asking for different senses makes implausible suggestions about their use (Table 3).
Table 3. Segments from ChatGPT’s responses on Senses of »X« in Cro prompt
Tablica 3. Isječci iz odgovora ChatGPT-a o značenjima riječi »X« u upitu na hrvatskom jeziku
Table 4 demonstrates that ChatGPT sometimes makes the mistake of providing completely inconsistent (e.g. providing examples for a Croatian term that is the literal translation as well as for the English expression as part of the same response to the prompt; see the example for the loanword fair) or unrelated definitions for the term in question (not only can it be inconsistent in the feedback, but it can also make morphologically-rooted errors, providing examples for words completely lexically unrelated to the English loanword in question; see the example for the loanword post). In some instances, it makes part-of-speech errors (for example, see the segment from the response concerning the loanword mail), claiming that the loanword takes on the role of the verb whilst providing an example where it functions as a noun in the position of a direct object. The interesting aspect of the 2025 data is that, unlike before, it regularly states that the dubious examples it provides have a rare occurrence in Croatian, thus demonstrating a greater degree of data awareness (because of the content it exemplifies) and prompt understanding (demonstrating that it recognises which information it should focus on based on the request of its interlocutor).
Table 4. Segments from ChatGPT’s responses indicating types of errors made
Tablica 4. Isječci iz odgovora ChatGPT-a koji upućuju na vrste pogrešaka
It is important to note that these are not the only implausible examples from ChatGPT’s responses, but only a selection of those that belong to the highly frequent terms employed in Croatian everyday internet jargon, which makes the dubious examples all the more discouraging.
Table 5. Nondeterministic nature of ChatGPT’s feedback to Senses of »X« in Cro prompt Tablica 5. Nedeterministička priroda odgovora ChatGPT-a na upite o značenjima riječi »X« u hrvatskom jeziku
It is also interesting to note that the stochastic nature of the ChatGPT’s feedback design persisted in data retrieved during 2025; i.e. it is not only that we find variability in style and content across different prompts that are slightly different in their design, but we find the same variability when the exact phrasing of the prompt is repeated, albeit with alternating loanwords. While it is obvious that the content differs from one loanword to another, the types of responses it provides differ in terms of data type provided (see Table 5), with possible loanword senses in the L2 context significantly differing concerning the word forms being exemplified.
4.2. The number of meanings per given prompt
As expected, the highest average number of meanings provided by ChatGPT is for prompts requesting the number of senses, and the lowest for those requesting the possible word classes (Table 6). The median and mode values are generally close to the mean, indicating a symmetric distribution of the AI’s responses around the mean. When it comes to prompts formulated in Croatian asking for the meaning, the AI provides an average of 3.5 meanings for Croatian prompts and 3.9 meanings for English prompts. This suggests only a slightly higher number of meanings in English responses, thus indicating that the language of formulation, if left without further specification of the language in which the target word is to be used, is less consequential to the way in which ChatGPT responds.8
According to expectations when it comes to senses of the observed loanwords, the AI provides more senses in English compared to Croatian, indicating it tends to generate a richer set of senses when the prompts specify the language context in which the target word’s definition is required. ChatGPT also seems to suggest fewer word classes overall, with a mean of 2.1 and 1.9 for Croatian (2024 and 2025 data respectively) and 2.9 for English. This is the lowest among the three categories for both languages, indicating that determining the word class yields fewer results compared to meanings and senses. Interestingly, the higher standard deviation (SD) values indicate variability in the English counterparts for meanings and senses (2.1 and 1.9, respectively), thus suggesting greater variability in the number of responses the AI provides for these prompts. Overall, the AI seems to provide a more extensive range of senses in English compared to Croatian, while the number of word classes remains relatively low and similar across both languages.
Table 6. General descriptive statistics for the number of meanings provided by ChatGPT per given prompt / Tablica 6. Opća deskriptivna statistika o broju značenja koje je ChatGPT dao po pojedinom upitu
Furthermore, the analysis of meanings provided by ChatGPT 2024 depending on the frequency of the English loanword in the ENGRI corpus (i.e. everyday Croatian internet jargon) provides some interesting insights when it comes to the average number of meanings in prompts addressing senses, especially in highly frequent terms shared between English and Croatian (see Figure 1). The most striking discrepancy is in the average number of senses provided for highly frequent terms, with prompts addressing the English language context yielding a much higher average (7.6) compared to those addressing the use of loanwords in the Croatian context (5.2). In accordance with previously stated results, the difference in the number of word classes between Croatian and English is minor (less than one on average, except in the category of relatively frequent terms), indicating that the AI’s part-of-speech assessment for the given terms does not vary much across languages. Interestingly, the frequency of terms does seem to affect the number of meanings and senses more prominently in English than in Croatian, especially in highly frequent terms.
Interestingly, the AI’s feedback to only one of the 6 prompts correlates with the frequency of the term in ENGRI corpus: the one asking for different senses of the word in English language, i.e. the more frequent it is in Croatian, the greater the number of senses in English the term seems to have (r = 0.33). Furthermore, the highest correlation appears to be between the definitions provided by the AI in response to prompts formulated in both English and Croatian, even when the prompts themselves do not specify the language context (r = 0.66). This indicates that the AI tends to generate consistent definitions across both languages (Table 7).

Figure 1. The average number of meanings provided by ChatGPT 2024 per given prompt depending on the frequency of the English loanword in the ENGRI corpus
Slika 1. Prosječan broj značenja koje je ChatGPT 2024 dao po pojedinom upitu, ovisno o učestalosti engleske posuđenice u korpusu ENGRI
Additionally, there is a moderate to relatively high correlation between the AI’s feedback on prompts asking for senses in both Croatian and English (r = 0.48). This suggests that the more senses a loanword has in English (according to ChatGPT), the more senses it also exhibits in Croatian (according to ChatGPT). This consistency implies that the AI’s understanding of the breadth of meanings and senses of loanwords is similarly comprehensive across both languages, but also agrees with the suggestion concerning the overgeneralisation-type mistakes where the meaning that has established itself in English is extended into Croatian regardless of whether it is appropriate.
Table 7. Correlation matrix: prompts and frequency
Tablica 7. Korelacijska matrica: upiti i učestalost
**p<0.01
Initially, ANOVA analysis has revealed no significant difference between the average number of meanings provided by ChatGPT in 2024 when it comes to responses to the prompts asking for the possible senses of targeted English loanwords in the Croatian language according to their frequency in the corpus (Figure 2). Although one might expect that with an increase in the frequency of a term in the Croatian language (as determined by the ENGRI corpus; Bogunović 2023), the degree of different meanings of loanwords would also increase (e.g. the more a specific term is used, the chances are that the number of senses attached to the situations in which it may be applied would increase), the situation is quite heterogeneous. There is no clear correlation between the frequency of loanwords in the Croatian language corpus and the number of meanings they can have in Croatian, based on ChatGPT’s responses. This heterogeneity aligns with the findings discussed earlier, where the AI demonstrated variability in the number of meanings and senses it provides based on prompts formulated in Croatian and English. Specifically, while the AI tends to generate a richer set of meanings and senses in English, this consistency does not necessarily translate to the Croatian language in a straightforward manner. Initially, it seems that the relatively high correlation between the AI’s feedback on senses in both languages suggests that the AI’s comprehension is broad but not directly influenced by the frequency of terms in Croatian.

Figure 2. Difference between the average number of meanings provided by ChatGPT 2024 in response to the prompt asking for the possible senses of targeted English loanwords in the Croatian language according to their frequency in the corpora
Slika 2. Razlika u prosječnom broju značenja koje je ChatGPT 2024 dao kao odgovor na upit o mogućim značenjima ciljanih posuđenica, prema njihovoj učestalosti u korpusima
On the other hand, the ANOVA analysis conducted on data retrieved from answers provided by ChatGPT in 2025 in relation to the same prompt revealed a significant difference between the number of senses (Figure 3). More specifically, the highly frequent category had significantly less senses discerned by ChatGPT in comparison to the two others (approx. 1.5 less on average). This indicates several potential explanatory threads, all of which might be true at the same time: (1) the AI model of the chatbot has changed in the period between the two data retrieval points, (2) the nondeterministic nature inherent to the model has been proven not only across and within same prompt feedback, but also across the evolution-time span of the model, (3) there is a language-related reason why the three categories differ in the number of senses their belonging loanwords manifest, which the newer model of ChatGPT has managed to differentiate. As already stated, the intuitive supposition might be that the increase in the frequency of a term in the Croatian language would result in a greater number of different senses in which the loanword can be used, and yet the data retrieved from the 2025 model shows the opposite situation. One of the reasons why the number of senses could be fewer in the highly frequent category is that ChatGPT has been able to accurately narrow down these senses precisely due to sufficient data on their use, as opposed to the less frequent categories where it failed to do so, and thus resorted to overgeneralisation and application of L1 context senses (senses in English) on L2 ones (senses in Croatian). This tendency has been observed and already discussed in the qualitative analysis data (see previous Section), and it may suggest that the ChatGPT’s flawed overgeneralised answers are often motivated by the fact that the chatbot »feels obliged« to provide some feedback, whether accurate or not. Finally, this would suggest that the model of the chatbot has evolved in the recent years but still falls short when it comes to less attested expressions in recorded language use.

Figure 3. Difference between the average number of meanings provided by ChatGPT 2025 in response to the prompt asking for the possible senses of targeted loanwords in Croatian language according to their frequency in the corpora
Slika 3. Razlika u prosječnom broju značenja koje je ChatGPT 2025 dao kao odgovor na upit o mogućim značenjima ciljanih posuđenica u hrvatskom jeziku, prema njihovoj učestalosti u korpusima
The data on whether the frequency of the term plays a role in the ChatGPT’s evaluation remains inconclusive when comparing data from 2024 and 2025. Initially, the results showed that while frequency might play a role in the AI’s responses, it does not significantly impact the variety of meanings for Croatian use of English loanwords. However, upon further analysis in 2025, some difference has been ascertained with respect to frequency, suggesting that improvements in the algorithm resulted in a more constrained and precise feedback when it comes to highly frequent terms as opposed to those less frequent (corroborated by the qualitative analysis of its responses, which indicated more flaws with the decline of frequency). Finally, this indicates a complex interaction between language prompts and AI interpretations, which may be significantly impacted by AI’s overextension of the loanword L1 senses into L2 language context.
5. Final discussion
Our findings remain consistent with the claims that, at least at this stage, there still needs to exist a degree of expert human oversight when it comes to AI-provided feedback. In fact, there is already a lot of data suggesting that, when going beyond the simple production of definitions and engaging in phrase or sentence-level related operations, the AI tends to underperform on certain occasions, often in terms of having to heavily rely on human oversight due to significant tendencies to invent facts, overgeneralise, or misrepresent them (cf. McKean & Fitzgerald, 2024). In this context, one of the immediate observations that become obvious in the qualitative analysis of ChatGPT’s responses is the stochastic nature of the feedback to prompts. This confirms previous research (De Schryver 2023), with our results showing answers that vary with respect to the type of information provided in them (sometimes referring to word types, and sometimes exclusively to the number of meanings without mentioning word types), even though the question was asked identically (this is especially pronounced with less detailed questions). Finally, while it may appear that accurate feedback is inherently dependent on understanding the prompt, our data illustrates that responses fluctuate—not due to changes in the prompt itself, which remains constant, but rather as a result of the specific loanword being analysed. This variability underscores the system’s nondeterministic nature, which extends beyond mere structural and stylistic elements and directly influences the content of its responses.
Again, as evident from a number of different studies, there seem to exist a number of advantages and disadvantages of working with chatbots like ChatGPT. On the positive side, they tend to note that such tools can enhance productivity, lower lexicographic costs in terms of both time and money, and facilitate access to data that may otherwise be difficult to obtain, but well-documented disadvantages pertaining to the prevalence of hallucinations in AI-generated responses are just as prevalent at this stage of research (Fuertes-Olivera, 2024). The qualitative analysis of the data retrieved in our research reaffirms the claim that a noticeable amount of responses appear to be hallucinations, with examples of English loanwords provided by the ChatGPT sounding completely implausible in the Croatian language context. There is also more observable difference in data retrieved in 2025 between more frequent and less frequent loanwords, as evidenced by a greater number of flawed responses with the decline in frequency (cf. Merx et al., 2024), but this is not to be interpreted in favour of ChatGPT data from 2024; instead, in data retrieved from 2024, we have found an equal distribution of flawed answers concerning the frequency of loanwords. Again, note that we have corroborated our suspicions regarding flawed responses in view of loanword senses and definitions by consulting various phrase combinations in exact or related context formulations and found no uses resembling the ones provided by ChatGPT via Google search. This has been confirmed both in the 2024 and 2025 versions of ChatGPT, which has the more advanced GPT4o model integrated. Interestingly, we observe flawed answers both in prompts requiring the outlining of the possible senses of the loanword in L2 context and the one requiring the outlining of possible part-of-speech categories of said loanword (e.g. wrong part-of-speech identification, target language-related inconsistency in examples, inconsistent and lexically unrelated examples, etc.). On the other hand, when restricted to solely defining the term in its L1 context, there are little to no issues worth mentioning.
To be fair, in our paper, we mostly focus on the examples that, according to our standards, are either flawed or misreflect the crucial information required by the related prompts. However, it is important to emphasise that ChatGPT was mostly rather good at providing definitions, especially when they concerned the use of loanwords in their source language (which, in that case, constitute just »words«), and would almost always provide the expected or dominant interpretation of the loanword in context (cf. Georgiou, 2025). Additionally, when the chatbot is pressed for further information in cases where the answer seems flawed, it quickly tends to mend the answer in terms of factual or use-verifiable accuracy. In this context, Trap- Jensen (2025) has claimed that the results of lexicographers’ experiments, due to individuals’ varying attitudes toward the technology involved, may have been inconsistent, arguing that those who are enthusiastic about new technology may be impressed when a chatbot achieves 75% accuracy, while sceptics might focus on the remaining 25% of errors. At this stage of AI development, the accurate approach seems to be that probabilistic models produce outputs that are neither entirely correct nor entirely incorrect, but instead fall somewhere in between. For example, there is the issue of English bias, both linguistically and culturally, with English holding a significant advantage that makes a direct comparison impossible (see Trap-‑Jensen, 2025). Some researchers now propose building monolingual LLMs, including those dealing with Croatian-specific tasks; their argument is that multilingual LLMs may not perform optimally across all languages, especially lower-resourced ones, due to imbalanced training data that favours high-resource languages, and that without standardized evaluation tools for assessing multilingual performance, it remains unclear how other languages in a multilingual LLM affect its capabilities in Croatian (Štefanec et al., 2024; Thakkar et al., 2024). In the context of our research, it is interesting to observe the fact that the definition of the loanword is primarily addressed through the lens of English, which is the source language in this case, but the language of the prompt needs not be; i.e. the language by which the prompt is formulated plays little to no role in defining a specific term (in this case, an English loanword). It is not enough to formulate a prompt in the language in the context in which the definition is required, but it seems necessary to further specify the language where the term is used. This is further evidenced by the fact that, regardless of whether the prompt inquiry concerning the definition of targeted expression is designed in English or Croatian, there seems to be little difference in the average number of meanings in ChatGPT’s feedback. More precisely, the only evident difference in the number of meanings provided by ChatGPT exists in the category of highly frequent English loanwords in Croatian, where one additional meaning is provided by ChatGPT on average when the question is formulated in English.
While it may seem that the main conclusion of our study regarding AI application is that sufficient training data for their respective languages needs to be improved because of abundance, the results address a problem slightly more complex than the one seemingly at hand. Specifically, understanding prompt intention is sometimes a key factor in shaping the chatbot’s response. This is as much an extralinguistic issue as it is a linguistic one, and it will likely only be fully resolved with the integration of more advanced semantic networks and contextual learning mechanisms, i.e. with the development of artificial general intelligence (AGI). The types of tasks designed in our research and the data retrieved are in line with the claims that ChatGPT does not truly »understand« (at least not yet) and that it merely reproduces or mirrors language production and comprehension (cf. Harnad, 2025). For example, as predicted, the number of senses (of use) of the examined loanwords has been generally somewhat lower in the target language (Croatian) than in the source language (English). In-depth analysis has also revealed an artificial number of flawed responses, suggesting that the difference should have been higher than observed. Additionally, although one might expect that, with the increase in frequency, the degree of different meanings of English loanwords in Croatian would also increase, the situation is quite heterogeneous, as there is no clear correlation between the frequency of loanwords in the Croatian language corpus and the number of meanings they can have in Croatian when considering the responses of artificial intelligence retrieved in 2024. However, data from the 2025 model indicates the opposite trend. A possible explanation for the reduced number of senses in the high-frequency category is that ChatGPT has been able to accurately constrain these meanings due to the availability of sufficient usage data. In contrast, for less frequent terms, the model appears to struggle with this distinction, leading to overgeneralisation and the transfer of L1 contextual meanings onto L2 contexts, which has already been observed in the qualitative analysis. The chatbot model has improved over the year, but it still demonstrates limitations when handling less commonly attested expressions in language use. It would be interesting to observe whether these trends would occur on another set of English loanwords in Croatian that is not necessarily based on domain-specific corpus (i.e., our was determined according to the ENGRI corpus; Bogunović 2023), since the compiling methodology can play a role in the frequencies, skewing the results and affecting data interpretation.
Future research should examine more thoroughly the idiosyncratic examples provided by AI and compare them with recorded corpus or other text data, i.e. to corroborate on a greater scale whether each of the suggested uses truly constitute »mistakes«. Furthermore, it would be beneficial to cross-compare responses from different AI tools (e.g. Gemini, ChatGPT, DeepSeek, Grok, Perplexity, etc.) and evaluate their accuracy and style in view of understanding and ability to define loanwords in L2 context, thus further providing an insight into the difference in the quality of different AI models for language-related tasks.
6. Conclusion
While it should be noted that the language of the response always matches the language of the question, regardless of the specific nuances these tools provide in their answers, the non-deterministic nature of the feedback is most apparent when responding to questions that are not sufficiently well-formulated or explicitly defined. For instance, when the prompt requires the word’s meaning and is formulated in Croatian, but does not provide additional information on the language context in which the word is used, the feedback can vary significantly on a case-by-case basis. Given that the inquiry pertains to the definition of an English loanword, the responses vary depending on whether the word is defined primarily through the lens of its meaning in the source language or whether artificial intelligence accounts for the fact that the question is asked in Croatian—presumably because the user seeks information about the word’s meaning in the target language (i.e. Croatian rather than English). Additionally, in some instances, the response focuses on the grammatical category that the lexical entry can assume in a given context. In others, it provides different meanings without referencing the word’s grammatical classification.
Although artificial intelligence is a practical tool in lexicography, a more detailed comparison reveals that it can serve as its counterpoint. The fundamental purpose of a dictionary is to provide a relatively stable description of the language (even though the language itself is inherently unstable), whereas the operational model of artificial intelligence follows a completely different approach. While AI does provide accurate word meanings, the number of meanings it generates often varies; more often than not, a single meaning is divided into multiple, subtly distinct senses. This variability stems from the fact that artificial intelligence is designed to generate relatively new and original text, but this artificial fluidity seems rather forced—especially when it lacks an explainable method as to why the answer would need to vary. In o ther words, the algorithm’s tendency to always provide an answer, regardless of whether that answer is grounded in truth or language use, affects the feedback to the point where it often becomes significantly flawed—which is certainly more evident in language contact lexicography than in monolingual dictionary compiling instances. Some of the findings in our data can be summarised as follows:
The results demonstrate the AI’s proficiency in providing accurate definitions and distinguishing between senses in which the examined expressions are used, although this is not consistently demonstrated when it comes to possible loanword senses in the L2 context. The less attested the loanword is in everyday use, the greater the chances the AI will provide a flawed response in some respect (often in view of L1 sense overgeneralisation onto L2).
When it comes to the inquiry concerning the definition of targeted English expression and the difference depending on whether the prompt is designed in English or Croatian, there seems to be little difference in the average number of meanings in ChatGPT’s feedback. We interpret this result as a further confirmation that the definition of the loanword is primarily addressed from the English standpoint even when the prompt is formulated in L2.
According to predictions, the number of senses (of use) of the examined loanwords is generally somewhat lower in the target language (Croatian) than in the source language (English). However, qualitative analysis reveals that artificial intelligence often makes errors and provides flawed responses with examples not attested in the available repositories of recorded language performance, suggesting ways of use that do not align with their everyday use in Croatian, both semantically and morphologically. This is true for both the data retrieved during early 2024 and early 2025, although with qualitatively observable improvements in the 2025 model, which does address the debatable nature of the dubious examples it provides at times.
Although one might expect that with the increase in language use frequency, the degree of different meanings of loanwords in Croatian would also increase, the situation is quite heterogeneous, as there is no clear correlation between the frequency of loanwords in the Croatian language corpus and the number of meanings they can have in Croatian when considering the responses of ChatGPT from 2024. On the other hand, some difference has been observed in the responses of ChatGPT from 2025, with cunter-intuitive findings suggesting a greater number of senses in the highly frequent category. Again, we interpret this as a by-product of ChatGPT’s tendency to prioritise generating a response regardless of accuracy, with overgeneralisation and the transfer of L1 contextual meanings onto L2 contexts being responsible for the (potentially) unjustified increase in the number of overall senses for less frequent categories of targeted expressions.
In the realm of lexicography, the opaque nature of AI training texts poses challenges, colloquially described as a »black box« (Steurs et al. 2020: p. 12), but despite occasional inexplicable outputs, it appears inevitable that prompt engineering will play a crucial role in the evolving landscape of lexicographers’ work as the technology progresses and becomes more transparent. Nevertheless, the main conclusion arising from our lexicographic experiments indirectly reflects that language, in a manner consistent with the cognitivist tradition, is not an entity isolated from the contextual elements of the environment in which it manifests itself. Extralinguistic factors are integral even within lexicography, a discipline primarily focused on providing a factual, concise, yet detailed view of the lexicon.
REFERENCES
Central European Conference on Information and Intelligent Systems - CECIS 2024, 225–229. Varaždin: University of Zagreb Faculty of Organization and Informatics.
