1. Introduction
The term Text retrieval can be stated as the matching of some user generated query against a set of text records. These records could be of any type such as: unstructured text, documentations, textual-reports, newspaper articles, paragraphs in a manual, etc. User queries can range from multi-sentence full descriptions or to just a few words. Text retrieval is a subset of Information retrieval systems. Search engines like Google, Bing, etc. are example of such systems.
2. Prerequisite
Python and Scikit-learn, Scipypackages are required to run through this article, as well as a corpus of text documents. This code can be adapted to work with other set of documents we collect.
3. Datset
We will use the well-known Reuters-21578 dataset. It includes 12,902 documents for 90 classes, with a fixed splitting between test and training data (3,299, 9,603). The dataset can be downloaded from (http://disi.unitn.it/moschitti/corpora.htm) for “Reuters21578-Apte-90Cat” category.
Once we have downloaded and unzipped the dataset, we can take a look inside the folder. It is split into two folders, “training” and “test”. Each of those contains 91 subfolders, corresponding to pre-labeled categories, which will be useful for us later when we want to try classifying the category of an unknown message. In this article, we are not worried about training a classifier, so we’ll end up using both sets together.
4. Methodology
In this section we will perform data exploration, data pre-processing, text-vectorization, and text-analysis steps.
4.1 Data Exploration
Let’s open up a single message and look at the contents. This is the very first message in the training folder, inside of the “acq” folder, which is a category apparently containing news of corporate acquisitions.