KSV Muralidhar

Projects

MRAG: No-Code RAG Platform

MRAG is a no-code platform to build RAG pipelines and chat with your documents. Its features include:
- Provides an interactive QA chatbot with voice support where a user can also see the retrieved context to verify the response.
- Provides options for query enrichment like splitting a user query into sub queries, rewriting the user query for better retrieval, clean a user query and self querying to pre-filter documents using their metadata.
- Displays evaluation metrics like answer relevance socre and reponse hallucination score for each response of the user's query.
- Enables context enrichment by generating Hypothetical Prompt Embeddings (HyPE) where hypothetical queries are generated from documents using an LLM and the embeddings of the questions are stored in vector index resulting in a query-to-query comparison for context retrieval.
- Provides multiple document splitters like Token Splitter, Sentence Splitter, Regex Splitter, PDF Font Splitter and Dummy Splitter.
- Provides metadata extractor for the user to define metadata schema to extract metadata from documents using regular expressions.
- User can customize QA chatbot settings like LLM, temperature, retriever settings, hybrid search settings and re-ranking settings.
- Provides document and chunk viewer to preview the raw document text and chunks generated by splitters.

Python
GenAI
RAG
Llamaindex
FastAPI
Celery
MongoDB
Redis
React
Docker

Product Website Documentation

News Aggregator

News aggregator is an AI-powered app that aggregates news from a few selected RSS news feeds. The details of the backend services are given below:
- News indexing service: Extracts news items from a list of RSS feeds, computes the sentence embeddings using sentence transformers and stores the embeddings in Milvus vector database for content-based recommendations using semantic search.
- Embedding deletion service: Articles and their embeddings older than a month are deleted from Milvus vector database.
- ETL service: Extracts news items from a list of RSS feeds in parallel using multiprocessing. The news items are processed and the categories of the news items are predicted using DistilBERT model that is fine-tuned (full fine-tuning) on news article headline classification task, the model is then quantized using TFLite. Top 5 similar news articles are identified from Milvus vector database using cosine similarity. The similar articles are then reranked using cross encoders. News articles are summarized using BART model that is fine-tuned on CNN and Daily Mail news articles. The transformed data is loaded into MongoDB Atlas (MongoDB-as-a-service).

The backend services are periodically triggered using a CRON job scheduled using GitHub Actions. The front end service reads the data from MongoDB Atlas and renders it in HTML when a user invokes it.

Python
NLP
Deep Learning
ETL
Text Classification
Text Summarization
Content-based Recommendation
BeautifulSoup
Hugging Face
DistilBERT
BART
MongoDB
Milvus Vector DB
Redis
Quart
Docker
CRON Job
GitHub Actions
Multiprocessing

Front-end code Architecture Model Monitor

Income Range Predictor

This project demonstrates the development of a production-grade end-to-end binary classification application. The process of developing the application is as follows:
* Data is ingested from a database, validated and cleaned.
* Data is preprocessed by removing id variables, variables with low variance, winsorizing outliers, imputing missing values, removing correlated features, marking rare categories, one-hot encoding, ordinal encoding and scaling. The hyperparameters of preprocessing steps are tuned using Hyperopt while training.
* New features are constructed as part of feature construction step.
* Data is clustered and clusters are added as features to data. Best 'k' is found using silhouette score.

Python
Machine Learning
FastAPI
Flask
MLflow Tracking
Hyperopt
Bitbucket
Unit Testing
Pytest
CI/CD
Docker

API code Front-end code Data Drift Monitor

News Article Classifier

This project demonstrates the development of a production-grade end-to-end multi-class text classification application using a small and imbalanced dataset. News Article Classifier classifies a news article into sport, entertainment, politics, tech and business classes. The model is trained on 1,500 BBC news articles. The data is augmented (using nlpaug) to increase the size of the dataset by 35X. The text is vectorized using Google Word2Vec and its dimension is reduced using PCA. Multiple machine learning models were auto tuned using Hyperopt to find the best performing model. The ML model is served as an API developed using FastAPI. The front-end application makes an API call upon a user request. Both API and front-end application are dockerized and deployed as individual web services to render.com.

Python
NLP
Word2Vec
Gensim
nlpaug
FastAPI
Flask
Hyperopt
Bitbucket
Unit Testing
MLflow Tracking
Pytest
CI/CD
Docker

API code Front-end code

Fine-Tuning HuggingFace Models

The project uses TensorFlow to fine-tune HuggingFace models for various tasks like text classification, text summarization and text translation. The models are fine-tuned for the following tasks:

- Text classification using DistilBERT.
- Text summarization using T5.
- Translation from English to French using T5.
- Translation from English to Hindi using ByT5.
- Text embedding extraction from DistilBERT and RoBERTa.
- Text classification using RoBERTa.
- Text classification using ALBERT & XAI with LIME.
- Hindi text summarization using mT5.
- Text summarization using Bard.
- Named entity recognition using DistilBERT and DeBERTa.

Python
NLP
TensorFlow
HuggingFace
Fine-Tuning

View Project View HF Spaces

Fine-Tuning YOLO For Object Detection

The project fine-tunes YOLO for object detction tasks like:

- Vehicle Number Plate Recognition
- Credit Card Detection & OCR
- Face Detection

Python
Computer Vision
Object Detection
YOLO
EasyOCR
Fine-Tuning

Data Analyzer

It is an auto EDA package to perform basic exploratory data analysis. The package enables user to derive data structure summary, get summary of categorical and numeric attributes, plot correlation matrix, derive chi-square test results, generate plots between different attribute types, identify mutual information and multicollinearity.

Python
Pandas
Matplotlib
Seaborn
Auto EDA

View Code Example View in pypi.org

Machine Learning From Scratch

Machine learning algorithms, data preprocessing functions, cross-validation functions developed from scratch using Python.

-   ML algorithms from scratch
-   Data preprocessing functions from scratch
-   Cross-validation functions from scratch

Python
Machine Learning
Cross-validation
Data Preprocessing

Incremental Machine Learning

This project demonstrates the process of preprocessing and incrementally training SGD classifer on a dataset having ~33.5M samples and 23 features (~9.5 GB in size). Parallelizing the EDA and data preprocessing tasks using multiprocessing package. Reducing the dimensionality using IncrementalPCA.

Python
Machine Learning
Multiprocessing
Pandas
EDA
IncrementalPCA

View Project

Vegetable Image Classifier

This project classifies the uploaded color images of vegetables into 15 classes. The model is trained using transfer learning technique (MobileNet-v2) and is quantized using TensorFlow Lite which resulted in 5x reduction in model size. The training dataset contained 15,000 (1,000 per class) vegetable images captured by a mobile phone camera.

Python
Deep Learning
Image Classification
CNN
TensorFlow
TensorFlow Lite
Distributed Training
Transfer Learning
MobileNet-v2

View Project

Time Series Forecasting

This project forecasts the hourly electricity consumption using LSTM model.

Python
Deep Learning
Time Series Forecasting
LSTM
TensorFlow

View Project

Export Excel/CSV to MySQL

This package enables a user to export Excel or CSV files to MySQL. This package automatically parses the data types. Creates a new database with the specified name or uses the existing one. Creates a new table with the specified name or uses the existing one. Inserts all the records from the specified CSV/Excel into the table.

Python
Pandas
MySQL

View Code View in pypi.org

Sentiment Analysis

Sentiment Classifier predicts the sentiment of a customer review. An LSTM model is trained on 750K+ amazon and Yelp reviews. The model is served as an API developed using FastAPI. The front-end application (developd using Flask) makes an API call upon a user request. Both API and front-end application are deployed as docker images to Docker Hub.

Python
Deep Learning
NLP
Text Classification
TensorFlow
LSTM
ONNX
FastAPI
Flask
Docker

View Project

Export Excel/CSV to MongoDB

This project enables a user to export Excel or CSV files to MongoDB. It automatically parses the data types. Creates a new database with the specified name or uses the existing one. Creates a new collection with the specified name or uses the existing one. Inserts all the records from the specified CSV/Excel as documents into the collection.

Python
Pandas
MongoDB

View Code

Digit Recognizer

Digit Recognizer lets a user to draw a digit (from 0 to 9) on a HTML canvas. It then uses a Convolutional Neural Network model trained on MNIST dataset. The application is dockerized and deployed to render.com.

Python
Deep Learning
Image Classification
TensorFlow
CNN
Flask
Docker

View Project

Linear Sequence Predictor

Linear Sequence Predictor trains a Batch Gradient Descent Regressor model (built from scratch) on-the-fly using the input sequence. The regressor is deliberately overfit to the training data until the training loss is > 0.009. Once the trainig loss is < 0.009, the next number is returned as a prediction. If the number of epochs reaches 100,000 and traning loss is > 0.009, the training is aborted and a "cannot predict" message is returned. This happens when the input sequence is non-linear. The application is dockerized and deployed to render.com.

Python
Machine Learning
Gradient Descent
Flask
Docker

View Project

Central Limit Theorem Visualizer

This application enables a user to visualize the changes in the sampling distribution upon varying the sample and population characteristics like population size, sample size and number of samples. The application is built with streamlit. It is dockerized and deployed to render.com.

Python
Statistics
Streamlit
Docker

View Project

Export Excel/CSV to PostgreSQL

This package enables a user to export Excel or CSV files to PostgreSQL. This package automatically parses the data types. Creates a new database with the specified name or uses the existing one. Creates a new table with the specified name or uses the existing one. Inserts all the records from the specified CSV/Excel into the table.

Python
Pandas
PostgreSQL

View Code View in pypi.org

Heart Failure Prediction

In this project a Random Forest Classifier model is trained on an imbalanced data set where the problem of class imbalance is addressed using SMOTE. The model predicts if a patient is suffering from a heart failure based on certain medical indicators given as inputs.

Python
Machine Learning
EDA
Scikit-Learn

View Project

Association Rule Mining of Kaggle Survey

This project applies Apriori algorithm to the ‘2020 Kaggle Machine Learning & Data Science Survey’ data to find out the associations among the technologies used by the respondents. The survey had 39+ questions asking the respondents about their demographics, technologies (programming languages, IDEs, algorithms, libraries and cloud products) used for data science and machine learning, technologies they plan to learn in future, etc.

Python
Unsupervised Learning
Apriori Algorithm
Kaggle Survey

View Project

Data Analysis of Space Missions

This project performs exploratory data analysis of space mission data from 1957 using Pandas, Matplotlib and Seaborn.

Python
EDA
Pandas
Matplotlib
Seaborn

View Project

Data Scientist Role EDA: Kaggle Survey 2020

This project analyzes the education, age, gender, experience, IDEs and technologies used by data scientists and ML engineers using the Kaggle 2020 survey data.

Python
EDA
Pandas
Matplotlib
Seaborn
Kaggle Survey

View Project

Web Scraping of cars-data.com

This project scrapes the data related to various cars listed on cars-data.com using Python and BeautifulSoup.

Python
Pandas
Web Scraping
BeautifulSoup

View Project

Welcome. I'm

KSV
Muralidhar

Data Scientist &

Azure 3x Certified

About Me

Projects

MRAG: No-Code RAG Platform

News Aggregator

Income Range Predictor

News Article Classifier

Fine-Tuning HuggingFace Models

Fine-Tuning YOLO For Object Detection

Data Analyzer

Machine Learning From Scratch

Incremental Machine Learning

Vegetable Image Classifier

Time Series Forecasting

Export Excel/CSV to MySQL

Sentiment Analysis

Export Excel/CSV to MongoDB

Digit Recognizer

Linear Sequence Predictor

Central Limit Theorem Visualizer

Export Excel/CSV to PostgreSQL

Heart Failure Prediction

Association Rule Mining of Kaggle Survey

Data Analysis of Space Missions

Data Scientist Role EDA: Kaggle Survey 2020

Web Scraping of cars-data.com

Skills

Certifications

Welcome. I'm

KSVMuralidhar

Data Scientist &

Azure 3x Certified

About Me

Projects

MRAG: No-Code RAG Platform

News Aggregator

Income Range Predictor

News Article Classifier

Fine-Tuning HuggingFace Models

Fine-Tuning YOLO For Object Detection

Data Analyzer

Machine Learning From Scratch

Incremental Machine Learning

Vegetable Image Classifier

Time Series Forecasting

Export Excel/CSV to MySQL

Sentiment Analysis

Export Excel/CSV to MongoDB

Digit Recognizer

Linear Sequence Predictor

Central Limit Theorem Visualizer

Export Excel/CSV to PostgreSQL

Heart Failure Prediction

Association Rule Mining of Kaggle Survey

Data Analysis of Space Missions

Data Scientist Role EDA: Kaggle Survey 2020

Web Scraping of cars-data.com

Skills

Certifications

KSV
Muralidhar