Data Science Article Recommendation System — Data Collection, Topic Modelling and Deployment
End-to-end data science article recommendation system — from data collection, topic modelling to deployment on AWS Ubuntu instance
Introduction
Recommendation system is at the heart of any successful retail interface from Amazon to Netflix and LinkedIn to Coursera. Recommendation systems are considered among the most successful application of machine learning technologies to a wide array of businesses. Major areas of application are:
- eCommerce — recommending products to users
- Media — recommending music, videos, news, TV serials
- Job boards — matching skills, careers and aspirations to jobs
- Travel — matching flights, trip deals, events, car rentals, hotels and restaurants
- Real Estate — recommending properties ofr rentals and sale
- Education — personalized course recommendations
The primary aim of recommendation systems is to predict the affinity of the user towards a given item and maximise the likelihood of a user interacting and therefore buying / subscribing / applying to an item of interest, causing a near perfect match and an unforgettable (hence repeatable) buying or subscription experience !
Though there are few basic machine learning algorithms in recommendation systems, viz. content based and collaborative and the combinations thereof, the implementation could be very complex. The predictive power of algorithms in large eCommerce businesses are even used to optimize complex business aspects like proactive provisioning, logistic and warehouse management.
I have tried to describe in this article, how to implement a basic content based recommendation system that can suggest data science articles from Medium publications based on a reference article or user preferences.
Note: This implementation is not an end use system by itself, but a demonstration for applying recommendation system that can be a part of larger business use case.
Main Steps
1. Data source and pre-processing
2. Topic modelling using different algorithms
3. Content based recommendation
4. Exposing through web service
5. Deployment
Implementation Overview
1. Source: Medium articles
2. Topic modelling algorithms:
a. TFIDF-SVD(term frequency, inverse document frequency — singluar value decomposition)
b. NMF(non-negative matrix factorization)
c. LDA (latent dirichlet allocation)
3. Recommendation engine: Cosine similarity in Python
4. Web exposure: Streamlit
5. Deployed on: AWS Ubuntu instance
Source and data pre-processing
Medium articles dataset from Kaggle has been used as inspiration. See reference 1. Here, the data has been collected from the Medium archival pages that are categorized based on tags.
Following pre-processing steps have been used:
- Raw data analysis and re-shaping
- Missing data analysis
- Duplicate records processing
- Type conversion
- Data analysis for language and claps
- Filtering on language and claps
- Text cleaning for links, non alphanumeric and punctuation, returning lower case text
- Stop-words elimination
- Stemming
Topic Modelling Algorithms
TFIDF-SVD (term frequency, inverse document frequency — singular value decomposition)
It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.
(Source: https://en.wikipedia.org/wiki/Tf%E2%80%93idf )
NMF (non-negative matrix factorization)
NMF was first introduced by Paatero and Tapper in 1994, and popularized in a article by Lee and Seung in 1999. Since then, the number of publications referencing the technique has grown rapidly.
Nonnegative matrix factorization (NMF) has become a widely used tool for the analysis of high dimensional data as it automatically extracts sparse and meaningful features from a set of nonnegative data vectors. NMF has become so popular is because of its ability to automatically extract sparse and easily interpretable factors.
(Source: https://blog.acolyer.org/2019/02/18/the-why-and-how-of-nonnegative-matrix-factorization/ )
LDA (latent dirichlet allocation)
LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.
(Source: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation )
Topic Modelling Algorithm Conclusion
Topics from LDA were fine, but those obtained with NMF were slightly more distinct. Distinction is important consideration for content based recommender, as it allows the recommender to match an article to a user’s tastes in a better way.
Final selection was NMF model, as it seemed to produce the most clear, cohesive, and differentiated topic groups.
Experimentation on Ideal number of Topics
The topic modelling exercise with above three algorithms was repeated for various number of topics — 9, 10, 11 and 12.
Conclusion here was:
1. Lesser topic numbers seemed to merge the groups
2. Higher topic numbers seemed to have irrelevant / non-coherent / repeating topics
So the optimal topic number was found to be 10. Of course it would be a subjective decision, there can not be absolute right or wrong here.
Example of 10 Topics from NMF
Cosine Similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0, π] radians. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. The name derives from the term “direction cosine”: in this case, unit vectors are maximally “similar” if they’re parallel and maximally “dissimilar” if they’re orthogonal (perpendicular). This is analogous to the cosine, which is unity (maximum value) when the segments subtend a zero angle and zero (uncorrelated) when the segments are perpendicular.
These bounds apply for any number of dimensions, and the cosine similarity is most commonly used in high-dimensional positive spaces. For example, in information retrieval and text mining, each term is notionally assigned a different dimension and a document is characterized by a vector where the value in each dimension corresponds to the number of times the term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.
(Source: https://en.wikipedia.org/wiki/Cosine_similarity )
Web exposure: Why Streamlit
Streamlit is a completely free and open source app framework for ML engineers.
Streamlit works like this:
1. The entire script is run from scratch for each user interaction.
2. Streamlit assigns each variable an up-to-date value given widget states.
3. Caching allows Streamlit to skip redundant data fetches and computation.
Some Streamlit features that make it incredibly easy and effective:
1. Streamlit apps are pure Python files. So you can use your favorite editor and debugger with Streamlit.
2. Pure Python scripts work seamlessly with Git and other source control software, including commits, pull requests, issues, and comments. Because Streamlit’s underlying language is pure Python, you get all the benefits of these amazing collaboration tools for free
3. Streamlit provides an immediate-mode live coding environment. Just click Always rerun when Streamlit detects a source file change.
4. Caching simplifies setting up computation pipelines. Amazingly, chaining cached functions automatically creates efficient computation pipelines!
5. Streamlit is built for GPUs. Streamlit allows direct access to machine-level primitives like TensorFlow and PyTorch and complements these libraries.
6. Streamlit is a free and open-source library rather than a proprietary web app. You can serve Streamlit apps on-prem without contacting us. You can even run Streamlit locally on a laptop without an Internet connection! Furthermore, existing projects can adopt Streamlit incrementally.
(Source: https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace )
Functional design
Top 5 Medium article recommendations can be obtained in two ways:
1. Select weightages for one or more of the 10 topics — the most relevant 5 articles with maximum similarity to the 10 topic weightages will be returned.
2. Enter a URL of medium publication — the most relevant 5 articles with maximum similarity to the topic content of the input article will be returned.
Now let us see how this actually works!
Approach 1:
Let us select two topics — “Statistics” and “Career and Learning” and see the recommendations
Approach 2:
Let us select an article on Random Forest Algorithm and see the recommendations.
We can see in both the approaches above the most relevant top 5 recommendations are returned.
Summary
I have tried simulation for Content-based Recommendation System for Data Science articles from Medium.
Topics are derived from the topic modelling of entire corpus of thousands of articles. Based on the user-rated topic vector, top 5 best matching data science articles are returned.
Source: Medium articles
Topic modelling algorithms: TFIDF-SVD, NMF and LDA
Recommendation engine: Content based, using cosine similarity
Web exposure: Streamlit
Deployed on: AWS Ubuntu instance
I experimented with topic modelling for 9, 10, 11 and 12 topics. Optimal number of coherent and non-overlapping topics were 10. NMF provided the most coherent topics.
Hope you have liked this humble implementation !
References
1. Medium articles data: https://www.kaggle.com/aiswaryaramachandran/medium-articles-with-content
2. Using Streamlit on AWS: https://towardsdatascience.com/how-to-deploy-a-streamlit-app-using-an-amazon-free-ec2-instance-416a41f69dc3
3. Using topic modelling for medium articles: https://towardsdatascience.com/building-a-content-based-recommender-for-data-science-articles-728e5ec7d63d
4. Final dataset with Topic modelling: https://www.kaggle.com/sudarshanvaidya/medium-data-science-articles-topic-modelling
5. Demo video — Approach 1: https://youtu.be/SAX4IW4npI0
6. Demo video — Approach 2: https://youtu.be/fiIYk_oK_2Y