welcome
Quanta Magazine

Quanta Magazine

Science

Science

The Poetry Fan Who Taught an LLM to Read and Write DNA | Quanta Magazine

Quanta Magazine
Summary
Nutrition label

72% Informative

Brian Hie: A genomic large language model (LLM) has been trained on large volumes of DNA.

The model picks up patterns that humans can’t see in DNA.

It uses those patterns to predict how changes to DNA affect the function of its downstream products, RNA and proteins.

Hie became interested in using language models for biology during graduate school.

Evo was trained on a “novel” consisting of many genomes — the E. coli genome alone is 2 million to 4 million base pairs.

Its training data set was also important: Its exposure to 2.7 million genomes from bacteria, archaea and viruses.

It shows the model evolutionary alternatives for life — different ways of expressing the same idea.

Evo is trained only on genomes from the simplest organisms, prokaryotes.

We want to expand it to eukaryotes - organisms such as animals, plants and fungi whose cells have a nucleus.

The model generated a million tokens freely from scratch — essentially, an entire bacterial genome.