Multi-output Gaussian Processes in PyMC [GSoC Final Report]
This post summarise the project of incoporating Multi-output Gaussian Processes (MOGPs) into PyMC, including work has been done, and work needs to be done.
This work is supported by GSoC, NumFOCUS, and PyMC team.
1. Work has been done?
In the last 12 weeks, I focused on implementing the Intrinsic Coregionalization Model (ICM) and Linear Coregionalization Model (LCM) in PyMC. All the experimental codes are published on this Github repository.
- In Weeks 01-03, I've started with a small goal, which is to run an Intrinsic Coregionalization Model (ICM) in PyMC. The main part of codes was already developed in PyMC v3 by Bill Engels (one of my mentors), so I just need to convert the PyMC v3 notebook into a PyMC v4 notebook.
- In Weeks 04-06, I've implemented Linear Coregionalization Model (LCM) in PyMC. This follows by proposing several API options for ICM and LCM.
- Weeks 07-09 focus on implementing ICM and LCM using Hadamard (element-wise) product. This Hadamard product can work with same input data or different input data.
-
Weeks 10-12 focus on implementing ICM and LCM using Kronecker product. It is noted that this Kronecker product can ONLY work with same input data. In addition, The kernels for input data need to be stationary.
-
Create a PR on pymc-experimental github repo. This is a work-in-progress (WIP) PR as I still need to try and test different APIs options for different kinds of input data.
2. Work needs to be done
- Finish the draft PR, that includes the implementation of both
MultiOutputGP
for Hadamard product and Kronecker product. - Write tests and documentations for these functions and classes.
- Write two notebook example: One for Hadamard product using a baseball dataset (Thanks Chris for this data), and one for Kronecker product using the data sets here with 4 outputs: GOLD, OIL, NASDAQ, and USD.
The project allows me to learn more on Gaussian Process (GP), its advantages and also limitations. I think GP has a huge potential for spatial and temporal (time-series) data sets.
Besides, implementing GP helps me further understand on the Multivariate Normal distribution :) Although there are still a lot to learn and do. I'm especially interested in learning more on other methods for time-series data, and a comparison on the performance between these models.
Finally, I would like to thank the PyMC devs team, especially my mentors Chris Fonnesbeck, and Bill Engels for their great guidance and supports. I will definitely not able to perform the project well without their insightful suggestions. I would love to involve and contribute more to the PyMC community after this project. Also thank you NumFOCUS and GSoC program for providing me this opportunity to work on the Multi-output Gaussian Processes in PyMC project.