Semantic relatedness, or similarity between documents plays an important role in many textual applications such as Information Retrieval, Document Classification, Question Answering, and more. Text understanding starts with the challenge of finding machine-understandable representation that captures the semantics of texts.
Measurement of semantic relatedness comprises two constituents: - An effective representation of documents - A similarity measure between documents in terms of their respective representations.
In particular, we chose three semantic models (Word2Vec and Doc2Vec, Doc2VecC) and one frequency-based model (Tf-Idf), for extracting and representing document features. We explore and benchmark this issue of document similarity by experimenting with various existing language models, examining their performance in the task of computing document similarity by ranking different News Summary Bots. We rank eight bots in all - -Luhn -Edmundson -Latent Semantic Analysis, LSA -LexRank -TextRank -SumBasic -KL-Sum
We use CommonCrawl news dataset documents as our input to the Summary Bots, and the all the eight different summaries are rated according to their semantic relatedness to the input document. For semantic models, relatedness is calculated by converting each input document and all the summaries into vector embedding format and then taking the cosine similarity of each summary with the parent document.