Using Deep Learning to Explore Ultra-Large Scale Astronomical Datasets

Smith, Michael J.

View/Open

Author

Smith, Michael J.

Abstract

In every field that deep learning has infiltrated we have seen a reduction in the use of specialist knowledge, to be replaced with knowledge automatically derived from data. We have already seen this process play out in many ‘applied deep learning’ fields such as computer Go, protein folding, natural language processing, and computer vision. This thesis argues that astronomy is no different to these applied deep learning fields. To this end, this thesis’ introduction serves as a historical background on astronomy’s ‘three waves’ of increasingly automated connectionism: initial work on multilayerperceptrons within astronomy required manually selected emergent properties as input; the second wave coincided with the dissemination of convolutional neural networks and recurrent neural networks, models where the multilayer perceptron’s manually selected inputs are replaced with raw data ingestion; and in the current third wave we are seeing the removal of human supervision altogether with deep learning methods inferring labels and knowledge directly from the data. §2, §3, and §4 of this thesis explore these waves through application. In §2 I show that a convolutional/recurrent encoder/decoder network is capable of emulating a complicated semi-manual galaxy processing pipeline. I find that this ‘Pix2Prof’ neural network can satisfactorily carry out this task over 100x faster than the method it emulates. §3 and §4 explore the application of deep generative models to astronomical simulation. §3 uses a generative adversarial network to generate mock deep field surveys, and finds it capable of generating mock images that are statistically indistinguishable from the real thing. Likewise, §4 demonstrates that a Diffusion model is capable of generating galaxy images that are both qualitatively and quantitatively indistinguishable from the training set. The main benefit of these deep learning based simulations is that they do not rely on a possibly flawed (or incomplete) physical knowledge of their subjects and observation processes. Also, once trained, they are capable of rapidly generating a very large amount of mock data. §5 looks to the future and predicts that we will soon enter a fourth wave of astronomical connectionism. If astronomy follows in the footsteps of other applied deep learning fields we will see the removal of expertly crafted deep learning models, to be replaced with finetuned versions of an all-encompassing ‘foundation’ model. As part of this fourth wave I argue for a symbiosis between astronomy and connectionism. This symbiosis is predicated on astronomy’s relative data wealth, and contemporary deep learning’s enormous data appetite; many ultra-large datasets in machine learning are proprietary or of poor quality, and so astronomy as a whole could develop and provide a high quality multimodal public dataset. In turn, this dataset could be used to train an astronomical foundation model that can be used for state-of-the-art downstream tasks. Due to the foundation models’ hunger for data and compute, a single astronomical research group could not bring about such a model alone. Therefore, I conclude that astronomy as a whole has slim chance of keeping up with a research pace set by the Big Tech goliaths—that is, unless we follow the examples of EleutherAI and HuggingFace and pool our resources in a grassroots open source fashion.