Examples/Data

Influenza Hemagglutinin proteins

Following originally released on Figshare.com as separate files and used in publication Roca et al F1000Research (2013)

multiple sequence alignment
meta data

Bacterial RecA homologs

A ProfileGrid of 300 bacterial RecA homologs is below. The multiple sequence alignment (MSA) is an update of an earlier, smaller alignment (Roca & Cox 1997). The ProfileGrid is broken into four tiers of approximately 100 residues each. At the top of each tier is the template sequence which in this case is the E. coli RecA protein. Its length is 352 residues; and, the template sequence determines the position numbering of the ProfileGrid ruler also at the top of each tier.

The 21 rows under the template sequence represent the frequency of the amino acid and gap characters at the corresponding column position in the MSA. To facilitate inspection at this size, the frequency values are not shown in each ProfileGrid cell. However, each cell is colored according to its value within the following mutually exclusive bins: <10% (white), >=10% (gray), >=25% (yellow), >=50% (orange), >=70% (green), and >=90% (red). This is the default Frequency Colors Legend but the number, size, and color of bins are user-defined.

ProfileGrids have other useful features. First, in this example the 20 amino acid code characters are in alphabetical sort order and the gap character is the last row. Other residue chemical values (such as volume) can be used to sort the rows to allow a search for structural patterns. Second, the location of conserved regions (called similarity boxes) is determined by similarity plot calculations. Here, parameters used were a window size of 9 and the BLOSUM62 scoring matrix. A threshold value of 80% similarity marks the box endpoints. Third, the JProfileGrid graphical user interface allows one to identify the sequences in each ProfileGrid cell. Finally, JProfileGrid exports a spreadsheet file for final formatting to produce figures. For more details, see the JProfileGrid documentation.

In summary, a ProfileGrid concisely depicts all of the character information from a MSA. Other representations such as consensus sequences, Sequence Logos, and similarity plots only summarize alignment content.

ProfileGrid of bacterial RecA protein family (2007)

B. subtilis RecA protein specific features

This example demonstrates how to use ProfileGrids to detect and to represent unique features of one sequence with respect to a multiple sequence alignment (MSA). Such a feature may indicate specialization with respect to function or activity. The ProfileGrid below has the B. subtilis RecA homolog selected as the highlight sequence. The entire Bsubt RecA sequence is listed immediately below the template sequence (here E. coli RecA). Furthermore, a pairwise comparison is made such that the corresponding residue is boxed (cell with blue borders) if the highlight sequence is different from the template sequence. For clarity, the ProfileGrid frequency colors are not shown in this region of the first 100 residues of the MSA.

Comparison between *E. coli* versus *B. subtilis* RecA proteins. Blue boxes show differences.

Next, to determine which residues are unique to the Bsubt RecA homolog, the frequency colors are displayed in the ProfileGrid below (same color scheme as first example above). Any position where the highlight cell has a white frequency color means that fewer than 10% of the alignment sequences share the same residue as the B. subtilis RecA homolog.

Comparison between *E. coli* versus *B. subtilis* RecA proteins. Color shading of filled boxes show corresponding residue frequency.

Finally, a closer inspection below of the region around residue 87 reveals a potentially unique feature of the B. subtilis RecA homolog. Note that the amino acid frequency values are shown as integers; and, the rows are sorted according to amino acid volume with the largest residue (tryptophan) listed at top. The ProfileGrid clearly shows that the characters gln-85, gap-87, arg-88, and ser-90 are rarely found between the highly conserved positions 84 and 91. Arg-88 is significantly larger than the more frequently observed glycine. Cool image, huh?

Comparison between *E. coli* versus *B. subtilis* RecA proteins. Zoomed view of positions 78 to 98.

References

A.I.Roca, A.C.Abajian, D.J. Vigerust. ProfileGrids solve the large alignment visualization problem: influenza hemeagglutinin example.
F1000Research 2: 2 (2013)
A.I.Roca, A.E.Almada, and A.C.Abajian, ProfileGrids as a new visual representation of large multiple sequence alignments: a case study of the RecA protein family,
BMC Bioinformatics 9: 554 (2008)
Roca and Cox, RecA protein: structure, function, and role in recombinational DNA repair, Progress in Nucleic Acid Research and Molecular Biology 56: 129-223 (1997)

Last updated: 23-Dec-2019