The UF Corpus Linguistics Lab is located in the basement of Turlington Hall at the University of Florida and is one of several labs of the Linguistics Department. In the Corpus Linguistics Lab, we investigate language data using corpora. Corpora are large-scale digital collections of language. The lab offers access to various corpora of English, German, Spanish, and other languages; corpora of written and transcribed spoken language; and specialized corpora such as corpora of academic speech and writing, learners of English as a second language, and the like. Access to these corpora is provided using various software tools such as AntConc, MonoConcPro, WordSmith Tools, and R. The lab also provides access to Eprime for experiments.
In the UF Corpus Linguistics Lab, we see corpus linguistics as a method, not a theory. All faculty and students affiliated with the Corpus Linguistics Lab are united by their commitment to rigorous, empirical analyses of language data. Correspondingly, the researchers affiliated with our lab conduct research in various theoretical frameworks and on a wide range of topics, including language processing, second language acquisition, and the synchronic and diachronic description of languages such Dutch, English, Spanish, and many others. For a list of currently ongoing research projects, check out a list of some of our projects below.
If you are a student interested in studying with us, we want to speak with you. Please contact the lab director, Stefanie Wulff (swulff@ufl.edu).
Lab Members
Ryan Cheves (graduate student)
Jamie Garner (faculty member)
Edith Kaan (faculty member)
Shengyu Liao (Ph.D. student)
Zoey Liu (faculty member)
Yigit Savuran (postdoctoral visiting scholar)
Haiyin Yang (Ph.D. student)
Affiliate Lab Members
Laurence Anthony (Waseda University)
Noa Attali (UC Irvine)
Ryan K. Boettger (University of North Texas)
Jorge González Alonso (UiT The Arctic University of Norway)
Stefan Th. Gries (UC Santa Barbara)
Ethan Kutlu (University of Iowa)
Nicholas A. Lester (University of Zurich)
Magali Paquot (Université catholique de Louvain)
Michelle Perdomo (Vanderbilt University)
Manuel Pulido (Penn State University)
Mike Putnam (Penn State University)
Ute Römer (Georgia State University)
Jason L. Rothman (UiT The Arctic University of Norway)
Debra Titone (McGill University)
Former Lab Members
Anna Bjorklund, B.A. Student/Lab Volunteer
James Blackeagle (undergraduate student)
Samantha Creel (Ph.D. student)
Steven Critelli, B.A. Student/Lab Volunteer Erica Drayer, B.A. Student/Lab Volunteer
Dylan Attal, B.A. Student
Corinne Futch, B.A. Student/Research Assistant
Chad Hammock (Ph.D. student)
Isa Hendrikx, Visiting Scholar
Eva Harvey, B.A. Student/Lab Volunteer Martha Hinrichs, B.A. Student
Alexandra Levrentovich, Ph.D. student
Hali Lindsay, B.A. Student/Lab Volunteer
Marc Matthews, Ph.D. Student
José Molina, B.A. Student
Rebecca Morris, B.A. Student
Meckenzie Powell, B.A. student
Holly Redman, Lab Volunteer
Noah Rucker, B.A. Student/Lab Volunteer
Chen Si, Ph.D. Student/Research Assistant
Beatrice Villanueva, Lab volunteer
Alexander Webber, M.A. Student/Lab Volunteer
Current Projects
Meckenzie Powell: The development of prepositional collocations in learner English
(2020/2021 University Scholars Program Fellowship undergraduate study project; advisor: Jamie Garner)
In this project, Mekenzie studies the use of prepositional collocations across proficiency levels by L1 Korean learners of English. Previous research on second language (L2) learners’ use of collocations has shown that L2 learners often have difficulty using collocations (Garner, Crossley, & Kyle, 2019). However, most of this research has focused on verb-noun and adjective-noun collocations. There has been a lack of research into the acquisition of prepositional collocations, specifically noun-preposition and adjective-preposition collocations (e.g. angry at, interested in, influence on, amount of). To that end, this project will examine the development of noun-preposition and adjective-preposition collocations use across multiple levels of L2 writing proficiency. The data for this study comes from the Yonsei English Learner Corpus (YELC; Rhee & Jung, 2014) and consists of 1,350 essays (351,762 words) written by L1 Korean learners of English. These essays are evenly divided into high-beginner, low-intermediate, and high-intermediate proficiency levels. She will extract all noun-preposition and adjective-preposition combinations that contain one of the top 10 most frequent prepositions (e.g. at, on, with, to) in English. She will then calculate token frequency and type frequency for both categories of collocations as well as calculate association in order to examine how many total prepositional collocations are used as well as how many different prepositional collocations are used by each group of learners. This will be followed by the calculation of association strength scores for all combinations using frequency information from the Corpus of Contemporary American English (COCA; Davies, 2009), a larger corpus of native speech and writing. This will be done in order to assess how native-like the learners’ use of prepositional collocations is. These variables will be compared across the three proficiency levels in order to examine how the productive knowledge of these types of collocations develops from beginner to intermediate level for L1 Korean learners of English.
Noa Attali: The role of emphasis in scope ambiguity resolution
(NSF PIRE project)
In this research project, we’re investigating mechanisms of ambiguity resolution. We’re looking at cases of ambiguity arising when utterances have multiple modifiers (e.g., Everyone didn’t go, which has the quantifier every and negation n’t). Understanding these utterances involves interpreting which modifier takes scope over the other: in line with their surface word order, the quantifier could take scope over negation (e.g., meaning that no one went) or, according to their inverse order, negation could take scope over the quantifier (e.g., meaning that not everyone went). Previous research conflicts on interpretation preferences but it seems that a host of factors, including listener age, pragmatic expectations, and syntactic priming, matter for interpretation. Researchers have also claimed for a long time that intonation, as a marker of information structure, can determine interpretation, so we are especially interested in the role of emphasis expressed through prosodic prominence in speech and through bolding or capitalizing in text. We propose a computational model of interpretation based on the modifier (every, some, or no), the use of emphasis, and the question under discussion in the immediate discourse context. As part of determining values for parameters in our model and assessing its predictions, we seek to understand instances of this kind of scope ambiguity attested in corpora of natural written and spoken language. How often are these quantifiers used? When quantifiers appear in these ambiguous constructions, how often is the intended scope in fact clear, what is the question under discussion, and when is emphasis used?
Stefanie Wulff and Ryan K. Boettger (with the help of research assistant Chad Hammock): Collaborative Research: Evaluating a data-driven approach to teaching technical writing to STEM majors
(research project; funded by NSF #1708360/#1708362)
Overview. This research project seeks to improve the quality of writing instruction for undergraduates majoring in science, technology, engineering, and mathematics (STEM). Understanding writing disciplinary differences has become increasingly relevant as instruction moves from literature-based composition courses in English departments to include technical writing and content-based courses taught by scholars in different disciplines. One effect of these changes is that students need to write in a way that conforms to the practices of a discipline they may not (yet) be familiar with. However, STEM undergraduates have little access to customized, discipline-specific writing instruction. A solution to this problem is engaging students in a form of data-driven learning (or DDL) that teaches them how to write in their discipline rather than apply generalist writing principles that contradict how professionals actually communicate. An interdisciplinary team of researchers will develop a series of DDL instructional units for STEM undergraduates in both multi-major writing-intensive courses as well as STEM-focused content courses in physiology and ecology. Unit content and students’ application of the instruction will be validated through peer review and revised via a control-group quasi-experimental design. Results and instructional materials will be disseminated through publications, workshops, and publicly available web tutorials.
Intellectual Merit. Introductory technical writing courses provide a great service to STEM departments, but it is not uncommon for instructors to have 20 different majors represented in their classroom. This project includes an innovative combination of characteristics designed to help writing and discipline-specific instructors customize their curriculum to meet the needs of all students: (i) It introduces modern corpus-linguistic methods that make large-scale studies possible, covering more text types and more language features, rather than case studies of a small number of individuals, classes, or texts. (ii) The DDL environment provides STEM students an accessible forum for applying these techniques and learning to overcome writing deficiencies that are prevalent in their disciplines. (iii) The project’s personnel encompass content, language, and methodological expertise and represent three disciplines: biology, linguistics, and technical communication. (iv) The effects of DDL will be assessed in four diverse populations at a major public institution that reflects the global demographics and instructional challenges for teaching technical writing. The inclusion of multiple instructional settings will address how DDL transfers to diverse STEM settings and influences how students learn technical writing.
Broader Impacts. The proposed project advances discovery and understanding of how STEM students learn to write in their disciplines. Additionally, the project fosters new interdisciplinary collaborations focused on a fundamental component of STEM education—technical writing. STEM undergraduates need customized writing instruction and enhanced communication skills to prepare for the workforce. To help these students and their instructors, the team will disseminate the following for public use: (i) the Technical Writing Project (TWP), an online corpus of student technical writing previously compiled by the lead researchers; (ii) materials for the instructional units; and (iii) a series of web tutorials for audiences engaged in STEM writing practices on how to use the TWP and the instructional materials for individual and classroom learning purposes. The team will also disseminate the research findings through conference presentations, workshops, and peer-reviewed research within linguistics, technical communication, and STEM education. These venues attract academics and practitioners as well as national and international audiences.
Chad Hammock: Implementing data-driven learning in L2 Korean language classroom: a first foray
(Ph.D. dissertation project; advisor: Steffi Wulff)
Research on the applications and effectiveness of DDL has focused on English as a Second Language classrooms. As such, the overwhelming majority of research thus far has been on English as an L2 with fewer studies in other languages such as German, French, and Russian. All of these are Indo-European languages and, as a matter of fact, a perusal of the available research only shows one study on a non-Indo-European language, Chinese (Smith et al 2008). The proposed research will focus on DDL applications and effectiveness in a Korean language classroom. Korean is not an Indo-European language and is, in fact, a linguistic isolate. The proposed study would be the first to consider applications of DDL in Korean language learning. The Korean language itself is particularly interesting because of its overtly patterned usage and current Korean language pedagogy exploits these patterns when introducing new concepts to students. The proposed research aims to determine what effect this has when it comes to implementing DDL into the Korean language classroom. Korean language students, especially those studying at a high level in Korea, will already be accustomed to being introduced to new words and grammar structures in their relevant patterns and may well be “primed” for DDL in a way that English language learners or other Indo-European language learners are not.
Ethan Kutlu: Factors impacting native speakers’ FAS judgments
(Ph.D. dissertation project; advisor: Steffi Wulff)
In this dissertation, we are aiming to understand and identify factors that can affect a rater’s judgments while hearing foreign accented speech. Many L2 learners face daily discrimination as their speech may be accented and thus considered incomprehensible. In comparison to a regional accent, which is generally found to be more acceptable, foreign accented speech (FAS) is often regarded as problematic (Gluszek & Dovidio, 2010). Since the early 1970s, FAS has been examined in the (related) fields of linguistics, second language acquisition, and more recently, social psychology (Munro & Derwig, 1995; Ferguson et al. 2010; Van Engen & Peelle, 2014). Meanwhile, linguistic studies in accentedness and speech perception agree that speech perception is variable, and that humans can identify sounds even with minimal acoustic cues (Hillenbrand, Clark & Baer, 2011). This raises two questions: What makes FAS different from other kinds of speech variation? Why is FAS judged so negatively by so many native speakers?
Past Projects
Alexandra Lavrentovich: Using classifier features to determine cross-linguistic influence on the developmental trajectory of English morphemes
(Ph.D. dissertation project; advisor: Steffi Wulff)
One prevailing position in second language acquisition (SLA) research is that learners of another language (L2) follow a predictable, fixed path in the acquisition of morphosyntactic structures (Goldschneider & DeKeyser, 2001; VanPatten & Williams, 2007), regardless of their dominant language (L1) background (Ellis, 1994; Ortega, 2009). For example, grammatical morpheme studies propose the following so-called natural order for English learners (Krashen, 1987). However, recent literature reviews, experimental studies, and corpus approaches have cast doubt on the fixed nature of developmental sequences (Hulstijn et al., 2015; Weitze et al., 2011; Murakami & Alexopoulou, 2016). For example, Lukand Shirai (2009) find Japanese and Spanish learners of English show different hierarchies of accurate use of three morphemes, which may be explained by the presence or absence of the equivalent morpheme in the L1. In a longitudinal corpus study, Murakami (2016) shows individual variation and non-linearity in trends for accurate use across proficiency levels. Hence, L1 background and proficiency can reorganize the predicted order of morpheme acquisition. Aligning with the current research, this dissertation investigates cross-linguistic influence in the developmental trajectory of English grammatical morphemes. The research aims to model the dynamic and emergent nature of morpheme production by using a longitudinal learner corpus and computational methodology. The research has the following goals: 1. Quantitatively detect the under/overuse of grammatical morphemes between learners with different L1 backgrounds and qualitatively examine what underlies these patterns to determine cross-linguistic influence. 2. Model the absence and presence of grammatical morphemes for individual learners across different proficiency levels to determine the extent of individual variation in morpheme accuracy development. The data will come from the EF-Cambridge Open Language Database (EFCamDat), a 33-million-word longitudinal corpus of English learner scripts written by students enrolled in a virtual learning environment (Geertzen et al., 2014). From the data, I include Chinese, Spanish, Portuguese, Arabic, Russian, and German learner groups as they are the most represented in the corpus accounting for over 70% of the data (Alexopoulou et al., 2015; Nisioi, 2015). The learners pass through 16 proficiency levels in the online curriculum that correspond to the language proficiency guidelines A1 through C2 set forth by the Common European Framework of Reference. The main goal is to determine how cross-linguistic influence (CLI) might reorganize the predicted morpheme order at different proficiency levels of a learner’s developmental trajectory. To demonstrate L1 influence, I will follow criteria from a detection-based approach (Jarvis & Crossley, 2012) which uses frequency differences between English writing patterns and the selected L1 backgrounds. The criteria for determining CLI are as follows: (1) intragroup homogeneity: where learners with the same L1 show similar morpheme developmental trajectories; (2) intergroup hetereogeneity: where learners with different L1 backgrounds show different trajectories; (3) cross language congruity: where learners use an English pattern that is similar to one they have in their L1; and (4) intralingual contrasts: where learners differentially use an English feature depending on how congruent that feature is in their L1. One way to meet the criteria is to carry out a Native Language Identification (NLI) task where a machine classifier identifies a learner’s L1 based solely on the learner’s Englishwriting. An NLI analysis identifies the specific English features most likely to be affected by the L1 which we may not detect from more subjective, manual, surface-level analyses (Crossley, 2012). Acomputational approach to CLI has the advantage of being able to deal with a large quantity of very similar data points (e.g., the distribution of function words across all learner essays) and estimating the probability of a learner’s L1 given subtle patterns in the data (e.g., the overuse or underuse of function words). I will use a support vector machine classifier with features such as part-of-speech n-grams and function words. The findings from the classification task will be used to determine patterns of the presence or absence of specific linguistic features between L1 groups and how these patterns may change across proficiency levels. To further explore longitudinal factors and individual variation, I will use generalized additive mixed models on individuals in the corpus. The intellectual merit of this research will be in its triangulation of learner corpora, computational methods, and qualitative analysis to show how differences between learners can be approached in a data-driven way. The study looks at the emergence and distributional frequencies of grammatical morphemes for English learners with different L1 backgrounds across increasing proficiency levels. The NLI approach improves on manual comparisons or learner case-studies because we can use larger data sets, make more objective decisions for where L1-specific language transfer effects may occur, and perform more semi-automatic analyses on other available corpora. There’s also evidence that classifiers outperform human experts in detecting L1 background (Malmasi et al., 2015). The broader impact of this research is to exploit the findings on cross-linguistic transfer and individual variation in hypothesis-making in SLA and pedagogy. For example, the NLI task contributes to SLA research by adding quantitative data to known transfer effects that an otherwise manual inspection could miss and may help with hypothesis-making as to why these transfer effects exist. For direct applications in language teaching and learning, L1-specific transfer effects can be used informatively to tailor instruction, feedback, and methods in the classroom and curriculum as well as be applied to language teaching technology.
Sasha graduated in 2019 and currently works for Amazon Alexa.
Stefanie Wulff and Stefan Th. Gries (with the help of lab volunteers Anna Bjorklund, Steven Critelli, Erica Drayer, Corinne Futch, Hali Lindsay, and Noah Rucker): Cognitive determinants of oral and written blend formation
(research project)
In this research project, we aim to take a first step towards addressing this gap by conducting an experimental study in which native speakers of English are asked to blend source words together. The source word stimuli will be systematically controlled for the different cognitive determinants mentioned above. In a crucial extension of our previous research with Dylan Attal (see below), we will elicit blends both orally and in writing from our participants. The results will be statistically evaluated both monofactorially (means, interquartile ranges, and exact tests) and multifactorially by means of a linear model that identifies which factors contribute to an increasing distance of the chosen cut-off points to the ideal ones as determined by the predictors (Gries 2006). The findings of this study stand to make a valuable contribution to our understanding of subtractive word formation processes by providing us with first clues regarding an online production and comprehension model of blending and by informing our understanding of the differences between creative and conscious word formation processes such as intentional blending compared to involuntary and unconscious word formation processes such as speech errors.
Stefanie Wulff and Stefan Th. Gries (with the help of research assistant Corinne Futch): Particle placement in L2 learner language
(research project; funded by a Language Learning Small Research Grant)
In this project, we are carrying out the first large-scale- corpus-based analysis of particle placement in learner language. Particle placement is a word-order alternation that involves the variable position of the particle in English transitive phrasal verbs (The squirrel picked up the nut vs. The squirrel picked the nut up). While researched intensively in native language, the present study presents the first large-scale, corpus-based account of particle placement in learner language, including data from three L1 backgrounds (Chinese, German, and Spanish) as well as native English speakers; data from the spoken and written modes; and a statistical model integrating a large number of variation parameters known to influence alternations in general, especially under-researched phonological constraints.
Marc Matthews: Need to, have to, and must: a collostructional analysis
(2015/2016 graduate advanced study project; advisor: Steffi Wulff)
Modal verbs are a challenge even for intermediate-advanced learners of English. In this study, Marc examines three near-synonymous modals verbs in English, have to, need to, and must, in order to identify semantic nuances that distinguish these three verbs in authentic language use. The ultimate goal of the study is to present a number of teaching suggestions to improve learners’ understanding of how to use these modal verbs. To this end, Marc retrieved >5,000 tokens of the three modal verbs from the 2012 spoken sub-section of the Corpus of Contemporary American English. He is now in the process of annotating that data for several variables that we believe to impact native speakers’ choice of modal, including the subject (pronominal vs. lexical nouns), the animacy of the subject, the degree of association between the modal and the matrix clause verb, and the absence or presence of negation. We will subject the data to a series of collostructional analyses (Gries & Stefanowitsch 2004), and, ultimately, at multinomial regression analysis, to determine which factors play a role in the choice of modal, if and how these factors interact, and how important they are relative to each other.
Martha Hinrichs: The role of surprisal in L2 syntactic priming
(2014/2015 University Scholars Program Fellowship undergraduate study project; advisor: Edith Kaan and Steffi Wulff)
In this project, Martha investigates the double object alternation in L1 Korean L2 English written production data. In Korean, all ditransitive verbs can be used in the prepositional object construction, while only some verbs such as cwu– (give) can also be used in the Korean double object construction (Jung & Miyagawa 2004). This raises the question whether advanced Korean learners of English would exhibit the same kind of constructional priming effects observed for other L2 English learners at advanced levels of proficiency, and if so, to what extent these priming effects are modulated by verb-specific knowledge that is aligned with the constructional verb preferences of native speakers (Gries & Wulff 2005, 2009). This study uses the Young English Learners Corpus (YELC), a compilation of essays written by students in South Korea. Distinctive Collexeme Analysis (DCA; Gries & Stefanowitsch 2004) will be employed to to measure each verb’s associative bias towards either construction in the learner data. These statistics will then be compared to native English speakers’ preferences documented in previous research.
Rebecca Morris: L1 vs. L2 idiom processing: investigating the role of morpho-syntactic variation
(2014 undergraduate individual study project; advisor: Steffi Wulff)
In this project, Rebecca will test native and non-native speakers’ sensitivity to different variants of V NP idioms. Adopting a usage-based perspective, the hypothesis under investigation is that native speakers should be faster and more accurate in determining whether a given phrase constitutes an idiomatic or literal meaning depending on the surface form that the phrase is presented in. More specifically, native speakers are expected to identify idiomatic and literal senses faster when the phrase is presented in its most typical, i.e. frequent, surface form. A second hypothesis to be tested is that non-native speakers should exhibit the same qualitative behavior, yet less pronounced than native speakers. The underlying assumption is that since non-native speakers have had less input, and therefor weaker mental representations of what constitutes the most typical variant forms, they will be less able to rely on their knowledge of these more or less fixed assemblies as they make judgments and or give reaction times. In order to test these two hypotheses, Rebecca will first of all identify the most frequent as well as less frequent variant forms of a set of 60 V NP idioms (which are available as a data sample for previous and ongoing research of Dr. Wulff) and use these as stimuli in a combined judgment and RT task.
Rebecca graduated from the Univesity of Florida in 2015 and is currently a Ph.D. student at Indiana University.
Dylan Attal: Cognitive determinants of blend formation: an experimental approach
(2013/2014 University Scholars Program Fellowship and honors thesis project; advisor: Steffi Wulff)
In a television commercial broadcast at the 2013 Superbowl, the sandwich company Subway let its customers know that throughout the month of February, any sandwich would cost only 5 dollars. In order to make this promotion more memorable, they referred to it as Februany, a blend of February and any [sandwich]. Blending is an extremely popular word creation process, especially for advertisement campaigns and newspaper headlines – both genres in which space is limited and publishers compete for consumers’ attention. Blends fit the bill because they are compressed language, and they are catchy.
From a cognitive-linguistic research perspective, blends raise one major question: what factors impact the way in which a speaker blends two words together? For example, what makes brunch, a blend of breakfast and lunch, a better blend than breakfunch? Previous research on the basis of large collections of blends suggests that speakers take a variety of cognitive determinants into consideration in order to achieve the ideal balance between economy (the bigger the overlap of words, the better) and recognizability of the source words (the more material of both source words remains intact, the better). These cognitive determinants include various characteristics of the source words, such as their phonetic, phonemic, graphemic, segmental, and semantic similarity as well as their frequency in language. How exactly these characteristics interact in the online production and comprehension of blends, however, remains largely unclear to date. Gries (2012: 166) correspondingly points towards the dire need to leave behind purely descriptive linguistic accounts and turn to psycholinguistic concepts, notions and methods instead… With regard to experimental approaches, it would be interesting to have speakers coin blends of source words while controlling for many of the factors known to influence blending.
In this research project, we aim to take a first step towards addressing this gap by conducting an experimental study in which native speakers of English are asked to blend to source words together. The source word stimuli will be systematically controlled for the different cognitive determinants mentioned above. The results will be statistically evaluated both monofactorially (means, interquartile ranges, and exact tests) and multifactorially by means of a linear model that identifies which factors contribute to an increasing distance of the chosen cut-off points to the ideal ones as determined by the predictors (Gries 2006). The findings of this study stand to make a valuable contribution to our understanding of subtractive word formation processes by providing us with first clues regarding an online production and comprehension model of blending and by informing our understanding of the differences between creative and conscious word formation processes such as intentional blending compared to involuntary and unconscious word formation processes such as speech errors.
Dylan completed his USP Fellowship project and his honors thesis in April 2014. His honors thesis earned highest honors.
José Molina: Constructional priming as a function of L2 proficiency and L1 background
(2013 Honors thesis project; advisor: Steffi Wulff)
In this project, José elaborated on a previous study by Gries & Wulff (2005) that tested advanced German L2 English learners’ knowledge of verb argument structure constructions such as the ditransitive construction (José gave Steffi the paper) and the prepositional dative construction (José gave the paper to Steffi). José replicated two experiments, a syntactic priming experiment and a semantic sorting experiment. Rather than investigating only advanced learners of English from one L1 background only as in Gries & Wulff (2005), José elicited data from L2 learners at low-intermediate levels of proficiency, and from different L1 backgrounds. He found that the main controbutor to priming was the verb provided in the sentence fragment to be completed (as opposed to the verb presented in the prime sentence or the construction provided in the prime). Furthermore, he observed pronounced verb-specific effects such that certain verbs primed either construction significantly more often than others.
José completed his honors thesis project in December 2013 and was awarded highest honors. He graduated from the University of Florida in 2014 and then earned a M.A. degree in computational linguistics at Brandeis University.