The MD-C Summarizer uses extraction based summarization to formulate summaries that best reflect the
article at the given URL.
The first step in summarizing the article is finding the article. The summarizer extracts the
article from the given webpage using text density analysis and HTML class frequency. This method has
proven to be very effective, and rarely lets through unwanted data.
The most complicated part of the summarization process is obviously the summarizer itself. The algorithm
used, created entirely by MD-C, factors in common phrases (n-grams, not just individual words) it finds
in the text, entity recurrence, entity referrance, length, density, location within the text, as well as
a few other features to extract the most important and relevant "modules" of text.
The algorithm starts by extracting common phrases (N-Grams) with cutting edge NLP techniques.
Once the ngrams have been extracted from the text, the article is tokenized into "modules." For most
articles, a module ends up being a sentence, however there are some exceptions, the most frequent being
quotes with multiple sentences within the quote.
Calculating Module Relevance
Once the article has been tokenized into modules, a relevance score is calculated for each module. The
algorithm factors in how many ngrams the module contains, the frequency of those ngrams, how many
entities are named within the module, how frequently those entities are referred to, the location of the
module within the text, the length of the module, as well as other "secret" factors.
Finding what's important
Once a relevance score has been calculated for each module, the average of all the module's scores is
found. Then, a cutoff threshold is calculated by multiplying the average module score by the user-given
Every module whose relevance is above the cutoff threshold is then returned to the user as the summary.
Lots of research has been done into what the best way to read text is for the web. Most studies have
found that the best way to read text on a computer is in small chunks, broken into lines roughly 75
characters long. So, we designed our summarizer interface with that in mind.