Recent projects:
I. Machine Learning
Interest rates predictions
This project aims to predict the base interest rates in Australia, US, and UK in the coming months. I demonstrate a complete cycle in Machine Learning and Data Engineering project, including an ETL process to collect and transform data from different data sources. The data is then stored in Azure Storage and is used for model development. The model is then deployed and served as a web service, which pushes the prediction results onto a Github page for visualization.
Data pipelines and architecture
Modelling baseball players’s performance
This work is supported by Google Summer of Code project, which adds support for Multi-output Gaussian processes in PyMC.
We model the performances of different sport players by leveraging Multi-output Gaussian processes (MOGPs), which can simultaneously learn and infer many outputs with the same source of uncertainty. The following picture shows the estimated sprin rates of three top pitchers in different game dates. Please check the PyMC example for further details.
Aussie Social Sentiment Analysis
This project collects data from Twitter’s APIs, then cleans and stores in a sql database. The data is then used for model training and prediction of sentiment analysis for other tweets. The webapp uses Dash visualisations and is deployed on Herokuapp.
II. MLOps & Data Engineering
Accelerating testing CI pipelines
Continuous Integration - Continuous Development (CI/CD) plays an essential role in MLOps and DataOps. One of the popular CI/CD tools is GitHub Actions. To enhance CI/CD pipelines, we can use the matrix strategy in GitHub Actions to run parallel tasks.
For example, I have recently leveraged the strategy.matrix
feature to significantly accelerate the testing CI pipelines for PyTensor from 75 mins to around 26 mins (a 65% reduction in running time). Please check this pull request for further details.
Building Docker images for PyMC
I have built a docker file for PyMC v4, which support both GPUs and CPUs version. The dockerfile has been merged into PyMC project’s code base in this pull request.
The docker image is then published on Docker Hub, so users can easily pull the image and set up their environment.
III. Data Science
Data Visualisation: Melbournian Daily Activities
This project visualises the daily activities of Melbournians in different areas.
Domain Scenery Views - Telegram Bot
A Telegram bot in Python, which will automatically push the top listings into the Domain channel on Telegram each day. The top listings show properties with the most beautiful views like beaches, lakes, and city views in Australia cities. Customers can join Domain Scenery Views channel on Telegram to receive the news.
For more projects, please check my github@danhphan.