This content originally appeared on HackerNoon and was authored by Pair Programming AI Agent
Table of Links
3 A Virtual Learning Experience
3.1 The Team and 3.2 Course Overview
4 Feedback
6 Summary and Future Work, Acknowledgements, and References
A. Appendix: Three Stars and a Wish
3 A Virtual Learning Experience
3.1 The Team
Our team is made up of three early career academics at the University of Edinburgh. Two teaching fellows have a background in Natural Language Processing with PhDs in Computational Linguistics. The third teaching fellow has a PhD in Computer Science and frequently teaches programming to different types of audiences, including business students as well as students outside of higher education. The author list of this paper also includes a fourth (last) author who was a participant of our first pilot, is a lecturer herself, and who has provided us with useful feedback for future iterations of this course (see Section 4.2).
\
3.2 Course Overview
In our data-driven society, it is increasingly essential for people throughout the private, public and third sectors to know how to analyse the wealth of information society creates each day. Our TDM course gives participants who have no or very limited coding experience the tools they need to interrogate data. This course is designed to teach noncoders how to analyse textual data using Python as the main programming language. It takes them through the required steps needed to be able to analyse and visualise information in large sets of textual document collections, or corpora.
\ The course takes place over three three-hour sessions and each session introduces participants to a new topic through a short lecture. The topics build on the previous sessions and at the end of each session there is time for discussion and feedback. In the first session we start with Python for reading in and processing text and teach how individual documents are loaded and tokenised. We work with plain text files but do raise the issue that textual data can be stored in different formats. However, to keep things simple we do not cover other formats in detail in the practical sessions.
\ In the second session we show how this is done using much larger sets of text and add in visualisations. We used two data sets as examples, the Medical History of British India (of Scotland, 2019) made available by the National Library of Scotland[4] and the inaugural addresses of all American Presidents from 1789 to 2017. We show how participants can create concordance lists, token frequency distributions in a corpus and over time as well as lexical dispersion plots and how they can perform regular expression searches using Python. In this session we also explain that textual data can be messy and that a lot of time can be spent on cleaning and preparing data in a way that is most useful for further analysis. For example, we point students at stop words and punctuation in the results and explain how to filter them when creating frequency-based visualisations.
\ During the third session we cover POS-tagging and named entity recognition. This last session concludes with a lesson on visualisations of text and derived data by means of text highlighting, frequency graphs, word clouds and networks (see some examples in Figure 1). The underlying NLP tools used for this course are NLTK 3 and spaCy which are widely use for NLP research and development. This is also where we put some of the course material in context of our own research to show how it can be applied in practice in a real project. For example, we mentioned our previous work on collecting topic-specific Twitter datasets for further analysis (Llewellyn et al., 2015), on geoparsing historical and literary text (Clifford et al., 2016; Alex et al., 2019a) and on named entity recognition for radiology reports (Alex et al., 2019b; Gorinski et al., 2019).
\
 
\ In the two pilots, we ran this course over three afternoon sessions on Monday, Wednesday and Friday, with an office hour on the days in-between to sort out any potential technical issues and answer questions. The main learning outcome is that by the end of the course the participants will have acquired initial TDM skills which they can use in their own research and build on by taking more advanced NLP courses or tutorials. A main goal of this course is to teach the material in a clear stepby-step way so all Python code and the examples are specific to each task but do not go in-depth into complicated programming concepts which we believe would confuse complete novices.
\
:::info Authors:
(1) Amador Durán, SCORE Lab, I3US Institute, Universidad de Sevilla, Sevilla, Spain (amador@us.es);
(2) Pablo Fernández, SCORE Lab, I3US Institute, Universidad de Sevilla, Sevilla, Spain (pablofm@us.es);
(3) Beatriz Bernárdez, I3US Institute, Universidad de Sevilla, Sevilla, Spain (beat@us.es);
(4) Nathaniel Weinman, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA (nweinman@berkeley.edu);
(5) Aslı Akalın, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA (asliakalin@berkeley.edu);
(6) Armando Fox, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA (fox@berkeley.edu).
:::
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
[4] https://data.nls.uk/ data/digitised-collections/ a-medical-history-of-british-india/
This content originally appeared on HackerNoon and was authored by Pair Programming AI Agent
 
	
			Pair Programming AI Agent | Sciencx (2025-07-15T14:46:51+00:00) Unlocking Textual Data: A Beginner’s Journey Through Python, NLTK, and spaCy. Retrieved from https://www.scien.cx/2025/07/15/unlocking-textual-data-a-beginners-journey-through-python-nltk-and-spacy/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.
