NLP

Semantic word embeddings and conceptual change: A case study in creativity (With Sam Franklin, Postdoctoral Fellow in History at Stanford University)

This project adapts semantic word embeddings (vector models) developed by Stanford computer scientists (HistWords) to explore the conceptual history of creativity in the Corpus of Historical American English (COHA). Building on n-grams from the Google Books corpus, HistLex provides a more sophisticated means of measuring lexical change in target words over time, such as the move from creativity::productive to creativity::imaginative (or innovative or artistic). Semantic word embeddings have myriad advantages over simpler n-gram models: rather than deducing patterns strictly from collocation (e.g. words which occur in the neighborhood of creative/‘creativity’), semantic word embeddings also track the distribution of those collocates independent of the target word. This has been shown (Hamilton, Jurafsky, et al 2016) to provide a richer picture of the conceptual landscape of the target word at a time interval, rather than its simple linguistic context. Since implementing initial historical SWE models, we have several directions for future work which address questions raised in respective dissertation research:

  • What can we learn about the historical influences on the lexical/conceptual trajectory of creativity?
    Who were the innovators of our modern conception of creativity? One planned subproject addresses this using genre-classification tools to reflect usage across different genres (e.g. from the early uptake of ‘creativity in the mid-century psychological literature, to its later explosion into popular media).
  • What can learn about grammatical factors which influence or constrain lexical change?
    Why do we talk about creativity now when we used to talk about creativeness? As Franklin’s earlier work has shown, ‘creativeness’ and ‘creativity’ were both in use in the early 20th century, before the latter became the established standard. In the context of our case study, we plan to evaluate proposed linguistic generalizations about the semantic nature of two productive nominalizers in English: -ness and -ity. How do the lexical semantics of -ness and -ity nominalizations differ in general? Do these differences help us understand why ‘creativity’ came to dominate the discourse about this concept?

Detecting sound-symbolism in an understudied language: An NLP-assisted approach (With Arthur Hjorth, Postdoctoral Researcher at Northwestern University)

In many languages, sound symbolism is conventionalized in the lexicon in the form of ideophones: collections of marked words which dramatically convey sensory experiences. Ideophones are often defined as a phonosemantic class, for which the form-meaning mapping is not totally arbitrary and make up a considerable subset of the lexicon in languages as diverse as Japanese, Zulu, Siwu, and Mayan (Dingemanse 2012). Although it is well known that lexical subclasses can have different phonotactic profiles, there has been little study of how predictably ideophones can be identified by phonotactic cues in a given language. Understanding the systematicity of ideophonic sound patterns is a precondition for exploring other questions, including

To explore this question from a computational angle, we developed a classifier to distinguish ideophones from verbs in Wolof (Niger Congo, Atlantic Branch; Eth:Wo), a language rich in both ideophones (Ka 1986) and large texts in standard phonetic orthography (Dione 2012). Ideophones are vivid depictions of meanings otherwise conveyed descriptively by verbs (Dingemanse 2012), yet they commonly constitute a separate lexical class. This is the case in Wolof, where ideophones have a morphosyntactically unique profile and can be readily identified on these grounds. Nevertheless, initial qualitative analysis suggested several phonetic features which commonly characterized ideophones: 1) Many ideophones are (non-productively) reduplicated, sometimes with a special emphatic suffix on the second reduplicant. 2) While productively reduplicated forms often host affixes ending in ’u’, no ideophones do (to our knowledge). 3) Among non-reduplicated ideophones, most conform to a CVCC pattern with a geminate consonant coda and -ATR nucleus (/a/ or //). To test how well these sound patterns distinguished ideophones from verbs, independent of syntactic information, we constructed a dataset of 247 ideophones from Wolof dictionaries (Munroe 1997, Diouf 2003) plus 210 common verbs identified in a Peace Corps gram- mar manual. To this, we added 939 verbs automatically extracted from the Wolof New Testament based on syntactic environment, after eliminating duplicates. Using a phonetic feature-based model as described above and NLTK’s Naive Bayesian Classifier, we trained our classifier on a random selection of half of our ideophones (IDs) (123) and half of the non-ideophonic verbs (NIVs) (547), and calculated accuracy by classifying the remaining words in our corpus. Randomizing the training and test set 1000 times, we achieved accuracy of 0.631 for IDs, and 0.94%0 for NIVs. Ideophonic words are invariant in form, but every Wolof verb lemma also serves as a stem for derivational suffixes. Since our initial run included derivationally related verbs, we used a simple stemmer to replace the verb forms from the first with lemmas (n=942). Following the same training and testing procedure, stemming increased the accuracy of the classifier slightly to 0.703 for IDs, 0.934 for NIVs.

This work represents an early NLP-assisted approach to substantiating a feature-based model of ideophonic sound patterns. Future work will involve expanding the dataset’s inventory of ideophones and their English dictionary definitions. This will allow us to take a similar NLP-assisted approach to semantic subclasses of ideophones, as preliminary analysis suggests that certain sound patterns are more strongly correlated with certain types of sensory depictions, such as sounds or physical sensations.

 

The textual history of nostalgia (With Jonathan Schroeder, Lecturer at University of Warwick)

Nostalgia possesses a well-demarcated history: the word was invented in 1688 as a medical pathology (an extreme form of homesickness), but today holds a radically different meaning. There are several dramatic changes in the concept of nostalgia:

• from a spatial to a temporal concept;
• from a desire to an emotion;
• from a medical to a popular/literary term.

Yet scholars have not explained when and how this transformation happened due to difficulties tracking fine-grained changes in the concept. To address this problem, we acquired a data set of 360k files from Hathi Trust, which we used to build our targeted corpora of 3.5k ”nostalgia” files and later scaled up to a larger data set of 2.8 million files (also available from Hathi Trust). We then designed algorithms to measure the word frequency of nostalgia, the word-frequency of the alleged synonym for nostalgia, ”homesickness,” and the changing frequencies of nostalgias top co-occurring words.

Using this method, we were able to pinpoint specific periods of meaning shift as well as transferral between genres. We used supporting historical scholarship to formulate hypothesis as to the catalysts for these shifts. For example, genre analysis allowed us to pinpoint the mid-1860s as the period when nostalgia began to bleed over from the medical literature to the popular press, likely due to journalists reporting on the frequent diagnosis of “nostalgia” among Civil War soldiers.