Topic Modelling - Unsupervised ML Technique

The Tech Platform
Feb 16, 2022
7 min read

Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

Since topic classification models require training, they’re known as ‘supervised’ machine learning techniques. What does that mean? Well, as opposed to text modeling, topic classification needs to know the topics of a set of texts before analyzing them. Using these topics, data is tagged manually so that a topic classifier can learn and later make predictions by itself.

Topic modeling is an ‘unsupervised’ machine learning technique, in other words, one that doesn’t require training.

Topic classification is a ‘supervised’ machine learning technique, one that needs training before being able to automatically analyze texts.

Latent Semantic Analysis (LSA)
Probabilistic Latent Semantic Analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Correlated Topic Model (CTM).

Latent Semantic Analysis(LSA)

(LSA) is a mathematical method for computer modeling and simulation of the meaning of words and passages by analysis of representative corpora of natural text. LSA closely approximates many aspects of human language learning and understanding. It supports a variety of applications in information retrieval, educational technology and other pattern recognition problems where complex wholes can be treated as additive functions of component parts.

Latent Semantic Analysis (also called LSI, for Latent Semantic Indexing) models the contribution to natural language attributable to combination of words into coherent passages. It uses a long-known matrix-algebra method, Singular Value Decomposition (SVD), which became practical for application to such complex phenomena only after the advent of powerful digital computing machines and algorithms to exploit them in the late 1980s. To construct a semantic space for a language, LSA first casts a large representative text corpus into a rectangular matrix of words by coherent passages, each cell containing a transform of the number of times that a given word appears in a given passage. The matrix is then decomposed in such a way that every passage is represented as a vector whose value is the sum of vectors standing for its component words. Similarities between words and words, passages and words, and of passages to passages are then computed as dot products, cosines or other vector-algebraic metrics. (For word and passage in the above, any objects that can be considered parts that add to form a larger object may be substituted.)

Significance of Term Frequency/ Inverse Document Frequency in LSA

Term Frequency is defined as a number of times instance or keyword appears in a single document divided by the total number of words in that document.

As we know the length of the document is different in each case, so term frequency varies with the occurrence of term respectively.

Inverse Document Frequency(IDF), signifies how important the term is to be in the collection of documents. IDF calculates the weight of rare term of the text in a collection of documents. The formula of IDF is given by

Applications of LSA

The LSA is considered the pioneer for Latent Semantic Indexing (LSI) and Dimensionality Reduction algorithms.

The LSA can be used for dimensionality reduction. We can reduce the vector size drastically from millions to thousands without losing any context or information. As a result, it reduces the computation power and the time taken to perform the computation.
The LSA can be used in search engines. Latent Semantic Indexing(LSI) is the algorithm developed based on LSA, as the documents matching the given search query are found with the help of the vector that is developed from LSA.
LSA can also be used for document clustering. As we can see that the LSA assigns topics to each document based on the assigned topic so we can cluster the documents.

Advantages of LSA

1. It is efficient and easy to implement.
2. It also gives decent results that are much better compared to the plain vector space model.
3. It is faster compared to other available topic modeling algorithms, as it involves document term matrix decomposition only.

Disadvantages of LSA

Since it is a linear model, it might not do well on datasets with non-linear dependencies.
LSA assumes a Gaussian distribution of the terms in the documents, which may not be true for all problems.
LSA involves SVD, which is computationally intensive and hard to update as new data comes up.
Lack of interpretable embeddings (we don’t know what the topics are, and the components may be arbitrarily positive/negative)
Need for a really large set of documents and vocabulary to get accurate results
It provides less efficient representation

Probability Latent Semantic Analysis (pLSA)

Probabilistic Latent Semantic Analysis, uses a probabilistic method instead of SVD to tackle the problem. The core idea is to find a probabilistic model with latent topics that can generate the data we observe in our document-term matrix. In particular, we want a model P(D,W) such that for any document d and word w, P(d,w) corresponds to that entry in the document-term matrix.

Recall the basic assumption of topic models: each document consists of a mixture of topics, and each topic consists of a collection of words. pLSA adds a probabilistic spin to these assumptions:

given a document d, topic z is present in that document with probability P(z|d)
given a topic z, word w is drawn from z with probability P(w|z)

Formally, the joint probability of seeing a given document and word together is:

Intuitively, the right-hand side of this equation is telling us how likely it is see some document, and then based upon the distribution of topics of that document, how likely it is to find a certain word within that document.

In this case, P(D), P(Z|D), and P(W|Z) are the parameters of our model. P(D) can be determined directly from our corpus. P(Z|D) and P(W|Z) are modeled as multinomial distributions, and can be trained using the expectation-maximization algorithm (EM). Without going into a full mathematical treatment of the algorithm, EM is a method of finding the likeliest parameter estimates for a model which depends on unobserved, latent variables (in our case, the topics).

Interestingly, P(D,W) can be equivalently parameterized using a different set of 3 parameters:

We can understand this equivalency by looking at the model as a generative process. In our first parameterization, we were starting with the document with P(d), and then generating the topic with P(z|d), and then generating the word with P(w|z). In this parameterization, we are starting with the topic with P(z), and then independently generating the document with P(d|z) and the word with P(w|z).

The reason this new parameterization is so interesting is because we can see a direct parallel between our pLSA model our LSA model:

where the probability of our topic P(Z) corresponds to the diagonal matrix of our singular topic probabilities, the probability of our document given the topic P(D|Z) corresponds to our document-topic matrix U, and the probability of our word given the topic P(W|Z) corresponds to our term-topic matrix V.

So what does that tell us? Although it looks quite different and approaches the problem in a very different way, pLSA really just adds a probabilistic treatment of topics and words on top of LSA. It is a far more flexible model, but still has a few problems. In particular:

Because we have no parameters to model P(D), we don’t know how to assign probabilities to new documents
The number of parameters for pLSA grows linearly with the number of documents we have, so it is prone to overfitting

We will not look at any code for pLSA because it is rarely used on its own. In general, when people are looking for a topic model beyond the baseline performance LSA gives, they turn to LDA. LDA, the most common type of topic model, extends PLSA to address these issues.

Advantages:

Models word-document co-occurrences as a mixture of conditionally independent multinomial distributions
A mixture model, not a clustering model
Results have a clear probabilistic interpretation
Allows for model combination
Problem of polysemy is better addressed

Disadvantages:

Potentially higher computational complexity
EM algorithm gives local maximum
Prone to overfitting. Solution: Tempered EM
Not a well defined generative model for new documents. Solution: Latent Dirichlet Allocation

Latent Dirichlet Allocation(LDA)

Latent Dirichlet allocation is one of the most popular methods for performing topic modeling. Each document consists of various words and each topic can be associated with some words. The aim behind the LDA to find topics that the document belongs to, on the basis of words contains in it. It assumes that documents with similar topics will use a similar group of words. This enables the documents to map the probability distribution over latent topics and topics are probability distribution.

It has distinct K values for the number of k means the number of probabilities needed for example:

0.6 + 0.4 = 1 (k=2)

0.3 + 0.5 + 0.2 = 1 (k=3)

0.4 + 0.2 + 0.3 + 0.1 = 1 (k=4)

The LDA makes two key assumptions:

Documents are a mixture of topics, and
Topics are a mixture of tokens (or words)

And, these topics using the probability distribution generate the words. In statistical language, the documents are known as the probability density (or distribution) of topics and the topics are the probability density (or distribution) of words.

Applications Of LDA

Phenomenal results on a massive dataset of Gensim, VW and mallet which lead towards great accuracy.
Finding patterns that relate or distinguish the scenarios, or in general, helps in pattern recognition in between two documents.
Much of the research in topic modeling is done with the help of Dirichlet distribution, which also helped in developing some new algorithms.
One of its applications also includes network analysis, which includes network pattern analysis and assortative network mixing analysis.

Correlated Topic Model (CTM)

The Correlated Topic Model (CTM) is a hierarchical model that explicitly models the correlation of latent topics, allowing for a deeper understanding of relationships among topics (Blei and Lafferty 2007). It was created as an extension of LDA, which allows topic proportions to be correlated via the logistic normal distribution (Blei and Lafferty 2007). The CTM extends the LDA model by relaxing the independence assumption of LDA. As in the LDA model, CTM is a mixture model and documents belong to a mixture of topics. CTM uses the same methodological approach as LDA, but it creates a more flexible modeling approach than LDA by replacing the Dirichlet distribution with a logistic normal distribution and explicitly incorporating a covariance structure among topics (Blei and Lafferty 2007). While this method creates a more computationally expensive topic modeling approach, it allows for more realistic modeling by allowing topics to be correlated. Additionally, Blei and Lafferty (2007) show that the CTM model outperforms LDA.

Resource: KDNugget, Wikipedia, analyticvidya

The Tech Platform