Harnessing Folksonomies for Resource Classification

PhD Thesis by Arkaitz Zubiaga

Advisors: Víctor Fresno, and Raquel Martínez

Reviewing Committee

I successfully defended my PhD thesis on July 12th, 2011 in Madrid. The reviewing committee included the following members:

Manuel Palomar (Universitat d'Alacant), president.
Paul D. Clough (University of Sheffield)
Lourdes Araujo (UNED)
Pablo Castells (Universidad Autónoma de Madrid)
Julio Gonzalo (UNED), secretary.

In April 2012, my PhD thesis was granted an extraordinary PhD award by the university. [more info]

Download

The slides I used for the defense, as well as the dissertation, can be downloaded from the following links:

Abstract of the thesis

In our daily lives, organizing resources into a set of categories is a common task. Organizing resources into categories makes searching through those resources easier by limiting the focus to a specific category. Limiting the focus significantly reduces the amount of information one must search. Categorization becomes more useful as the collection of resources increases, when managing resources becomes more and more difficult if they are not organized appropriately. Large collections like those made up by books, movies, and web pages, for instance, are usually cataloged in libraries, organized in databases and classified in directories, respectively. However, the usual largeness of these collections requires a vast endeavor and an outrageous expense to organize manually.

Recent research is moving towards developing automated classifiers that reduce the increasing costs and effort of the task. Most of the research in this field has focused on self-content, where the publisher is the only author, as a data source to discover the aboutness of the resource. Self-content presents the problem that it is not always representative enough, and sometimes it is difficult to access depending on the type of resource. Little work has been done analyzing the appropriateness of and exploring how to harness the annotations provided by users on social tagging systems as a data source. Users on these systems save resources as bookmarks in a social environment by attaching annotations in the form of tags. It has been shown that these tags facilitate retrieval of resources not only for the annotators themselves but also for the whole community. Likewise, these tags provide meaningful metadata that refers to the content of the resources.

In this thesis, we deal with the utilization of these user-provided tags in search of the most accurate classification of resources as compared to expert-driven categorizations. After performing a set of experiments to choose a suitable classifier for this kind of task, we explore social annotations looking for a way to best use them. For this purpose, we have created three large-scale datasets including tagging data for resources from well-known social tagging systems: Delicious, LibraryThing, and GoodReads. Those resources are accompanied by categorization data from sound and consolidated expert-driven taxonomies. From these resources the appropriateness of social tags for predicting categories can be evaluated.

Specifically, we first study several ways of representing the massive number of social tags by amalgamating the contributions of large communities of users. We analyze their suitability for the classification task, upon both broader top level categories and narrower deep level categories. Then, we explore the nature, characteristics, and distributions of tags in folksonomies, in order to determine how the settings of each system affect the tagging behavior and the usefulness of tags for the classification task. We go deeper into tag distributions by analyzing the usefulness of weighting schemes based on inverse frequency values. Finally, using state-of-the-art user behavior detection processes, we identify users on social tagging systems who better fit the classification task.

To the best of our knowledge, this is the first research work performing actual classification experiments utilizing social tags. By exploring the characteristics and nature of these systems and the underlying folksonomies, this thesis sheds new light on the way of getting the most out of social tags for the sake of automated resource classification tasks. Therefore, we believe that the contributions in this work are of utmost interest for future researchers in the field, as well as for the scientific community in order to better understand these systems and further utilize the knowledge garnered from social tags.

Harnessing Folksonomies for Resource Classification by Arkaitz Zubiaga is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.