Data + Project: Pirating Texts
This summer, undergraduates Gabriel Guedes (Math, Global Cultural Studies), Lucian Li (Computer Science, History), and Orgil Batzaya (Math, Computer Science) were directed and mentored by Grant Glass, a graduate student in the English Department at UNC-CH. Organized in collaboration with Charlotte Sussman (English) and Astrid Giugni (English, ISS) at Duke, these students set the following objectives for their project:
- Trace the geographic spread of “pirated” copies of Daniel Defoe's Robinson Crusoe over time and draw conclusions about how the historical context of publication impact the content of the copies
- Identify the most important parts of the Crusoe story that persist despite differences between volumes
The students explored the complications of looking at a data set that saw drastic changes over the last three centuries in terms of spelling and grammar, which offered new challenges to data cleanup. By asking questions of the effectiveness of “distant reading” techniques for comparing thousands of different editions of Robinson Crusoe, the students learned how to think about the appropriateness of myriad computational methods like doc2vec and topic modeling. Through these methods, they started to ask, at what point does one start seeing patterns that were invisible at a human scale of reading (reading one book at a time)? While the project did not definitively answer these questions, it did provide paths for further inquiry.