LSI EXAMPLE - INDEXING A DOCUMENT
Putting Theory into Practice
While a discussion of the mathematics behind singular value decomposition is beyond the scope of our article, it's worthwhile to follow the process of creating a term-document matrix in some detail, to get a feel for what goes on behind the scenes. Here we will process a sample wire story to demonstrate how real-life texts get converted into the numerical representation we use as input for our SVD algorithm.
The first step in the chain is obtaining a set of documents in electronic form. This can be the hardest thing about LSI - there are all too many interesting collections not yet available online. In our experimental database, we download wire stories from an online newspaper with an AP news feed. A script downloads each day's news stories to a local disk, where they are stored as text files.
Let's imagine we have downloaded the following sample wire story, and want to incorporate it in our collection:
O'Neill Criticizes Europe on Grants|
Treasury Secretary Paul O'Neill expressed
irritation Wednesday that European countries have refused to go along
with a U.S. proposal to boost the amount of direct grants rich nations
offer poor countries.
The first thing we do is strip all formatting from the article, including capitalization, punctuation, and extraneous markup (like the dateline). LSI pays no attention to word order, formatting, or capitalization, so can safely discard that information. Our cleaned-up wire story looks like this:
o'neill criticizes europe on grants treasury secretary paul o'neill expressed irritation wednesday that european countries have refused to go along with a us proposal to boost the amount of direct grants rich nations offer poor countries the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations
The next thing we want to do is pick out the content words in our
article. These are the words we consider semantically significant -
everything else is clutter. We do this by applying a stop list
of commonly used English words that don't carry semantic meaning. Using
a stop list greatly reduces the amount of noise in our collection, as
well as eliminating a large number of words that would make the
computation more difficult. Creating a stop list is something of an art
- they depend very much on the nature of the data collection. You can
see our full wire stories stop list here.
Here is our sample story with stop-list words highlighted:
o'neill criticizes europe on grants treasury secretary paul o'neill expressed irritation wednesday that european countries have refused to go along with a US proposal to boost the amount of direct grants rich nations offer poor countries the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations
Removing these stop words leaves us with an abbreviated version of the article containing content words only:
o'neill criticizes europe grants treasury secretary paul o'neill expressed irritation european countries refused US proposal boost direct grants rich nations poor countries bush administration pushing plan increase amount direct grants world bank poorest nations assistance loans nations
However, one more important step remains before our document is ready for indexing. Notice how many of our content words are plural nouns (grants, nations) and inflected verbs (pushing, refused). It doesn't seem very useful to have each inflected form of a content word be listed seperately in our master word list - with all the possible variants, the list would soon grow unwieldy. More troubling is that LSI might not recognize that the different variant forms were actually the same word in disguise. We solve this problem by using a stemmer.
While LSI itself knows nothing about language (we saw how it deals exclusively with a mathematical vector space), some of the preparatory work needed to get documents ready for indexing is very language-specific. We have already seen the need for a stop list, which will vary entirely from language to language and to a lesser extent from document collection to document collection. Stemming is similarly language-specific, derived from the morphology of the language. For English documents, we use an algorithm called the Porter stemmer to remove common endings from words, leaving behind an invariant root form. Here are some examples of words before and after stemming:
information -> inform presidency -> presid presiding -> presid happiness -> happi happily -> happi discouragement -> discourag battles -> battl
And here is our sample story as it appears to the stemmer:
o'neill criticizes europe grants
treasury secretary paul o'neill expressed irritation european countries refused US proposal boost direct grants rich nations poor countries
Note that at this point we have reduced the original natural-language news story to a series of word stems. All of the information carried by punctuation, grammar, and style is gone - all that remains is word order, and we will be doing away with even that by transforming our text into a word list. It is striking that so much of the meaning of text passages inheres in the number and choice of content words, and relatively little in the way they are arranged. This is very counterintuitive, considering how important grammar and writing style are to human perceptions of writing.
Having stripped, pruned, and stemmed our text, we are left with a flat list of words:
administrat amount assist bank boost bush countri (2) direct europ express grant (2) increas irritat loan nation (3) o'neill paul plan poor (2) propos push refus rich secretar treasuri US world
This is the information we will use to generate our term-document matrix, along with a similar word list for every document in our collection.
This work is licensed under a Creative Commons License. 2002 National Institute for Technology in Liberal Education. For more info, contact the author.
Gain a Competitive Advantage Today
Your top competitors have been investing into their marketing strategy for years.
Now you can know exactly where they rank, pick off their best keywords, and track new opportunities as they emerge.
Explore the ranking profile of your competitors in Google and Bing today using SEMrush.
Enter a competing URL below to quickly gain access to their organic & paid search performance history - for free.
See where they rank & beat them!