Recent projects:

I. Data Engineering & DataOps

Market Risk Monitoring

This project aims to monitor Financial Market Risk via various data sources. I demonstrate a complete cycle in Data Engineering and DataOps solution, including an data ingestion, data transformation, data flatform, using infra as code.

View codes on GitHub

Data pipelines and architecture


Interest rates ETLs and predictions

This project aims to predict the base interest rates in Australia, US, and UK in the coming months. I demonstrate a complete cycle in Machine Learning and Data Engineering project, including an ETL process to collect and transform data from different data sources. The data is then stored in Azure Storage and is used for model development. The model is then deployed and served as a web service, which pushes the prediction results onto a Github page for visualization.

View codes on GitHub

Project dashboard

Data pipelines and architecture


Accelerating testing CI/CD pipelines for PyTensor

Continuous Integration - Continuous Development (CI/CD) plays an essential role in MLOps and DataOps. One of the popular CI/CD tools is GitHub Actions. To enhance CI/CD pipelines, we can use the matrix strategy in GitHub Actions to run parallel tasks.

For example, I have recently leveraged the strategy.matrix feature to significantly accelerate the testing CI pipelines for PyTensor from 75 mins to around 26 mins (a 65% reduction in running time). Please check this pull request for further details.

View codes on GitHub


Building Docker images for PyMC into DockerHub

I have built a docker file for PyMC v4, which support both GPUs and CPUs version. The dockerfile has been merged into PyMC project’s code base in this pull request.

The docker image is then published on Docker Hub, so users can easily pull the image and set up their environment.

View codes on GitHub


II. Machine Learning & MLOps

Modelling baseball players’s performance using Probabilistic Programming with PyMC

This work is supported by Google Summer of Code project, which adds support for Multi-output Gaussian processes in PyMC.

We model the performances of different sport players by leveraging Multi-output Gaussian processes (MOGPs), which can simultaneously learn and infer many outputs with the same source of uncertainty. The following picture shows the estimated sprin rates of three top pitchers in different game dates. Please check the PyMC example for further details.

View on GitHub View on PyMC


Aussie Social Sentiment Analysis with Twitter’s APIs

This project collects data from Twitter’s APIs, then cleans and stores in a sql database. The data is then used for model training and prediction of sentiment analysis for other tweets. The webapp uses Dash visualisations and is deployed on Herokuapp.

View codes on GitHub View Demo


Data Visualisation: Mapping Melbournian Daily Activities

This project visualises the daily activities of Melbournians in different areas.

View codes on GitHub View Demo


Domain Scenery Views - Telegram Bot

A Telegram bot in Python, which will automatically push the top listings into the Domain channel on Telegram each day. The top listings show properties with the most beautiful views like beaches, lakes, and city views in Australia cities. Customers can join Domain Scenery Views channel on Telegram to receive the news.

View codes on GitHub


For more projects, please check my github@danhphan.