John and Marcia Price College of Engineering

17 Exploring the Embedding Methods in Genomic Language Models

Anisa Habib; Hari Sundar; and LeAnn Lindsey

Faculty Mentor: Hari Sundar (School of Computing, University of Utah)

 

Language Models have gained considerable popularity over the past years, owing to their capacity to be trained on unlabeled data and extract meaningful insights from human language. Recent models use the language-like structure of DNA to gain valuable insights from genomic data. However, these models are all trained with different tokenization methods, on different types and amounts of data, and finetuned on different tasks. In other words, each model was built using different data representations, the amount of information captured per token varying and requiring different levels of computing power and time for processing. We aim to investigate different encoding schemes for genomic data with the goal of obtaining more information per token and improving genomic analyses.


About the authors

License

Icon for the Creative Commons Attribution 4.0 International License

RANGE: Journal of Undergraduate Research (2024) Copyright © 2024 by Anisa Habib; Hari Sundar; and LeAnn Lindsey is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.

Share This Book