GGRaSP (Gaussian Genome Representative Selector with Prioritization) is an R-package that generates a reduced subset of genomes that prioritizes maintaining genomes of interest to the user as well as minimizing the loss of genetic variation. GGRaSP also allows for unsupervised clustering by modeling the genomic relationships using a Gaussian Mixture Model to select an appropriate cluster threshold, thus allowing for both generalizable high-throughput and more dataset specific use.

Key Features

  • Rapidly simplify large datasets containing up to multiple thousands of genomes.
  • Optional run without any a priori knowledge of the shape of the data.
  • Generation of images, tables, and annotation files enabling detailed analysis of the phylogeny and GGRaSP clusters.

Sample Output

Sample output from PanACEA

The capabilities of GGRaSP is demonstrated by generating a reduced list of 315 genomes from a genomic dataset of 4,600 Escherichia coli genomes, prioritizing selection by type strain and by genome completeness. Original 4,600 genome set (A), clustered using cut-off (B), and reduced to 315 representatives genomes (C).


Bioinformatics (Oxford, England). 2018-09-01; 34.17: 3032-3034.
GGRaSP: a R-package for selecting representative genomes using Gaussian mixture models
Clarke TH, Brinkac LM, Sutton G, Fouts DE
PMID: 29668840


This project has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under Award Number U19AI110819.

GGRaSP logo

Principal Investigator

Key Staff

  • Toby Clarke, MS


Related Research