Margaret Swift RMTP Page
— site design by Margaret Swift

Auto-Tagging Project

The Russian Movie Theater project is an effort by Russian professors and students at William & Mary to record an oral history of Russian movie-going in St. Petersburg. Our primary work involves student researchers (myself included) collecting interviews with native Petersburgers while studying abroad in Russia. Interviews are conducted in Russian and recorded on video camera. Once back in the States, a team of students and faculty transcribe and translate these interviews, tag them in XML, and use textual data (word frequency, mentions of names, places, movies, etc.) to analyze these documents.

In working on this project, I realized that the most tedious portion, tagging the interviews with XML tags, could be done by a computer, freeing hours of time previously spent on this endeavor. My partner, John Hoskins, and I have created an application for our research group that can be used to automatically tag common words, while allowing researchers the flexibility to change which tags are attributed to which words. This webpage lays out our application, and explains how we built it and how it is to be used.

Goals and Tools

Our goal was to create an interactive application to efficiently and smoothly aid researchers in tagging Russian interviews in proper XML fashion. To this end, we had five main functionality goals:

(1) Automatically tag and hide tags for pronouns.

(2) Automatically generate best-guess tags for non-pronouns in the tagging dictionary.

(3) Allow the user to choose the best tag for each selected word.

(4) Provide information about each tag to the user.

(5) Allow the user to customize the application.

These goals gave us a framework in which we constructed our application. With guidance from our mentors Sasha and Elena Prokhorov, we were able to realize these goals in a simple and understandable application.

We wrote this project in Python, as it is widespread and open-source language, and our mentors were already familiar with its syntax. Within Python we used Tkinter extensively as a way of realizing our application.

Top

Obstacles

One of the biggest problems of this project was the Russian language itself. It is easy enough in English to tell the computer to wrap the word "Moscow" in tags every time it appears in the text, but Russian is a bit more complex. Russian has seven different cases, meaning that depending on its place in a sentence, a noun or an adjective will have different endings. For example, "Moscow" has six different forms in six different cases: "Москва", "Москве", "Москву", "Москвы", Москвом". Further, due to each case being different for masculine, feminine, neuter, and plural nouns, adjectives can have over a dozen different forms. This makes textual frequency analysis difficult in Russian, because a computer sees the same word in two cases, like Москва" and "Москву", as two separate words, where in reality they are the same.

Top

Functionality

The app opens with a welcome message, entreating the user to load an interview file. This main bin will be filled with the selected interview.

Once the text is loaded, the interview can be interactively clicked (or navigated with arrow keys) to select different highlighted keywords that the program has flagged as important. This is done by referencing a separate XML file full of keywords and their proper tags, and can be edited separately.

The left sidebar holds a legend corresponding to the various colors seen on the page. In this example, a word or phrase highlighted in purple has only one recognized tag associated with it, while a phrase in yellow has multiple possible correct tags. Think of a movie theater named "Moscow" versus the city of "Moscow"; if both of these words are in the index, the program cannot distinguish which one the interviewee is referencing. It is up to the researchers to decide which is the proper tag for the selected word.

Researchers are also able to change the theme of our application. Themes were inspired by our mentors and others within the Russian department.

Top

Nuts and Bolts

How it actually works.

Auto-Tagging

Manual Tagging

Referencing

Top

Thanks for visiting!