Portfolio

ALGOS: Leetcode problems

Solutions to various Leetcode problems, with various optimizations in speed, memory, or simply original (see README).

DS: Comparison of Supervised algorithms

DS: Comparison of Random Optimization algorithms

DS: Comparison of Unsupervised algorithms

DS: Importance of Development Set and Outliers using 1D Regression

DS: Function Approximation with regression

DS: Classification of (non)linearly separable datasets

ML: Building a Classification system for per-customer sales prediction

This project shows the full process of building a ML system: defining the problem as a classification task, loading, examining and pre-processing data, establishing a baseline, trying linear models, then non linear ones (kNN, DT,RandomForest, SVM, MLP), then the metrics to select the final model, including the precision-recall curve, pytests at every step, and finally insuring the model isn’t flawed by overfitting on a small sample of the training set on purpose, and feature ablations to check if the model slowly degrades but remains consistent.

ML: Multi-class classification on MNIST dataset with noisy labels

This project compares linear algorithms (logistic regression, Perceptron) to non-linear ones (Decision Trees, Random Forest, SVM, multi-layer perceptron) on the MNIST dataset, augmented to handle image rotations. It reaches accuracy of 85% while only using a training dataset of 5000 images due to Colab limitations. It also analyses the influence of noise on training labels, which reveals the good performance of SVM, with no degradation to 10-15% random or non-random noise.

RL: Frozen Lake with Q-Learning

DL: Vectorized convolutions

This article explains how to use Numpy meshgrids and broadcasting to vectorize the convolution between a matrix and a kernel like in a CNN, ie parallelize it without a GPU.

Basic EDA on a kaggle car database

Data processing with Pandas, then uni-bi-multivariate analysis. Explanations of skewness and kurtosis

Analytics Engineering with DBT

Performance analysis of a website (sales funnel, conversion etc), deployment.

Main technologies used - DBT (models, macros, ninjas, exposures)
- Snowflake
- SigmaComputing (dashboard, deployment)

GEN AI: Podcast transcription and info extraction with GPT3.5

This project shows how to download a podcast from a given RSS link, transcribe it into text, then process it with GPT3.5 to extract various data such as the podcast title, guest name/title/company, highlights and summary. It demonstrates the steps of a typical ‘GPT/LLM’ application, with different input data.

Main technologies used - openAI (GPT3.5-turbo)
- whisper
- streamlit
- modal (GPU cloud)

NLP: sentiment analysis: fine-tuning a BERT model vs training Word2vec

This project shows how to use Pytorch and Pytorch Lightning to compare the accuracy and methodology of fine-tuning a pre-trained BERT model vs Word2vec for sentiment analysis.

It is typical NLP text classification: ie predict if a sentence has positive or negative sentiment.

Main technologies used - BERT
- Word2vec
- Pytorch
- Pytorch Lightening

NLP: comparison of encoder, decoder, GPT3.5 and fine-tuned GPT3.5 for sentiment analysis

This project compares various decoders, encoders with OpenAI basic GPT3.5-turbo and it’s finetuned version for sentiment analysis on financial data.

Main technologies used - few-shot learning
- cardiffnlp/twitter-roberta-base-sentiment-latest
- ProsusAI/finbert
- microsoft/phi-1_5
- GPT3.5 Turbo
- GPT3.5 Turbo fine-tuning
- sentiment analysis
- Financial Phrasebank (Hugging Face)

LLM: Lora fine tuning of PHI-1.5, Llama2

This project uses parameter efficient fine tuning (PEFT) low rank adaptation (LORA) to fine-tune the microsoft/phi-1.5 and meta-llama/Llama-2-7b-chat-hf models for summarizing scientific papers.

Main technologies used - transformers
- PEFT (LoRa)
- TRL
- microsoft/phi-1_5
- meta-llama/Llama-2-7b-chat-hf

MLOps: model evaluation, data quality testing and behavioral testing

This project conducts detailed model evaluation, data quality testing and behavioral testing for a news classification model.

MLOPs: drift & model performance

This project focuses on model performance monitoring for a news classification model.

More details It monitors system health (traffic volume, latency, SLA violations), compute data and label drift for the inference traffic using different techniques (Chi-square, KS, classifier-based drift detection).
It measures performance as a function of time for the inference traffic, and any ties to detected drift.
It experiments with outlier detection techniques to understand the impact of outliers on model performance.

DevOps: Site reliability engineering

This project shows how to manage alerts from a simple website deployed in a container with Kubernetes. Basically, several alerts are declare with Prometheus, such as LowMemory, KubePodNotReady and how to send alerts to email and Slack channels. Several strategies to manage Toil are also listed.

Main technologies used - Prometheus
- Kubernetes

GEN AI: RAG over Impact Theory episodes - (app)

A sidebar provides manual settings and metrics to allow users to test various parameters, and get familiar with all the steps involved in, but usually hidden, a typical RAG system. Warning: the free vector db sandbox may have been removed by Weaviate. Also, the finetuning option is not active when the system runs online since it requires Modal credits.

Main technologies used - sentence transformers (LlamaIndex)
- embeddings finetuning
- Weaviate
- re-ranking with CrossEncoder
- OpenAI, Llama2
- Streamlit (on HuggingFace)
- Modal (GPU backend)

GEN AI: Hybrid Multi-agent / RAG advanced trip recommender system in a React chatbot

This project's code is protected by NDA but I can share the general architecture (below). It uses an advanced multi-agent system and a graphRAG to create detailed trip recommendations. It is deployed in a chatbot on a React website. Main technologies used - React frontend
- transformers
- LlamaIndex
- OpenAI, Llama3, Claude, Mixtral, Groq
- CrewAI
- Mongo (RAG)
- PostgreSQL
- graphRAG
- FAST API
- Pytorch
- fine-tuning
- Docker
- HuggingFace (deployment)

GEN AI: RAG financial analyst

This project retrieves information from financial reports in PDF format, chunks and embeds it using Sentence transformers and stores them in a vector database, which is queried by a LLM. It is accessed from notebooks (no UI), but it may not work since the Weaviate sandbox expires after 2 weeks.

Main technologies used - PyPDFLoader, LlamaParse for parsing the PDFs (see parsers comparison in the 'pdf_readers' notebook)
- sentence transformers (LlamaIndex)
- Weaviate
- OpenAI
- Docker
- HuggingFace (deployment)