John and Marcia Price College of Engineering
17 Exploring the Embedding Methods in Genomic Language Models
Anisa Habib; Hari Sundar; and LeAnn Lindsey
Faculty Mentor: Hari Sundar (School of Computing, University of Utah)
Language Models have gained considerable popularity over the past years, owing to their capacity to be trained on unlabeled data and extract meaningful insights from human language. Recent models use the language-like structure of DNA to gain valuable insights from genomic data. However, these models are all trained with different tokenization methods, on different types and amounts of data, and finetuned on different tasks. In other words, each model was built using different data representations, the amount of information captured per token varying and requiring different levels of computing power and time for processing. We aim to investigate different encoding schemes for genomic data with the goal of obtaining more information per token and improving genomic analyses.