Index


  1. Description of the algorithm
  2. Comparison Options
  3. Input format
  4. Parameters
  5. Datasets
  6. Output

Description of the algorithm

Back to the top

The BEAGLE (BEar Alignment Global and Local) web-server performs pairwise alignments of RNA secondary structure. The method exploits a new encoding for RNA secondary structure (BEAR) and a substitution matrix for RNA structural elements (MBR) (Mattei et al., 2014). The BEAR encoding allows to include structural information within a string of characters where each character of the encoding stores the information about the type and length of the secondary structure elements the nucleotide belongs to (Fig. 1).

Figure 1. The BEAR encoding. (A) The BEAR structural alphabet. Different sets of characters are associated with the different RNA basic structures(loop, internal loop, stem and bulge on the right side of a stem, and bulge on the left side, denoted here as L, I, S, BL and BR, respectively), with different characters used for basic structures of different length. (B) RNA hairpin with the constituent substructures (loop, stem, bulges and internal loop) highlighted in different colors. On the top right, the BEAR characters corresponding to each substructure, shown with the same colors. On the bottom right, the hairpin RNA sequence is shown associated with its dot-bracket and its BEAR secondary structure descriptions. (C) Conversion into BEAR of an RNA secondary structure. An RNA secondary structure, extracted from Rfam, is shown, containing four non-branching structures depicted in boxes. The resulting BEAR conversion of the non-branching and branching structures is shown in blue below the secondary structure. A ‘:’ character is assigned to the remaining nucleotides that do not belong to non-branching or branching structures (reported in black).

Transition rates between secondary structure elements were computed on a set of evolutionally related BEAR-encoded RNAs (Fig.2).

Figure 2. Graphical representation of the MBR. This figure shows a subset of rows/columns of the MBR matrix, using a color-coding to show substitution rate patterns: color scale represents log-odds scores from lower (blue) to higher (red). Rows and columns are elements of RNA secondary structure of different length and every cell stores the log-odds value for the substitution of one element with another element. The cells in the principal diagonal always have the highest value in the respective row and column. Substitutions between elements belonging to the same group (i.e. stems, loops and interior loops) display higher log-odds values than substitutions between elements belonging to different groups. The ‘. . . ’ notation indicates that some rows/columns were omitted from the graphical matrix representation.

The BEAR encoding uses an alphabet of 83 characters so the size of the MBR is 83x83. The total number of possible pairs is 3486 (83*(83+1)/2) among which 221 has a score higher than 1, corresponding to pairs occurring more than expected. The plot below shows the distribution of the score in the matrix (Fig. 3).


The BEAGLE method implements a modified version of the the Needleman-Wunsch algorithm for global alignment and the Smith-Waterman algorithm for local alignment, for the comparison of BEAR-encoded RNA secondary structures, using the MBR (or any other user-provided substitution matrix for BEAR characters) to guide the alignment.


Comparison options

Back to the top
BEAGLE offers three kinds of comparisons:
  1. One set
  2. This option accepts one set of RNAs as input and the server will perform all the possible pairwise alignments among those RNAs. Maximum 300 RNAs accepted in input. The input sequences must be supplied using the textarea.

  3. Two sets
  4. This option accepts two sets of RNAs, namely the query and target set. The sequences in the query set will be aligned to the sequences in the target set. For more information about the comparison modes available see the parameters section. The input sequences must be supplied using the two textareas.

  5. Search
  6. Using this option, an input RNA will be compared to one of the five pre-compiled RNA datasets, namely human lncRNAs, mouse lncRNAs, human 3' UTR, mouse 3'UTR and structured Rfam. For more information about the available datasets see the datasets section.


Input format

Back to the top

All the comparison options required the input sequences to be supplied using the textarea in the home page.

The input sequences are accepted in FASTA format:
-The line containing the name and/or the description of the sequence starts with a ">";
-The words following the ">" are interpreted as the RNA id;
-The following line reports the RNA nucleotide sequence; -The subsequent line characters are interpreted as secondary structure information (Optional)

or

FASTB format:
-The line containing the name and/or the description of the sequence starts with a ">";
-The words following the ">" are interpreted as the RNA id;
- The following line reports the RNA nucleotide sequence;
-The subsequent line characters are interpreted as secondary structure information in the BEAR alphabet.

The IUPAC notation is accepted for nucleotides (case-insensitive).
The secondary structure must be supplied using dot-bracket notation; only '( . )' characters will be accepted by the program.

Example of a well formatted input file:
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
In this case the secondary strucure for the sequence will be computed on the fly using RNAfold (Vienna package), with the minimum free energy prediction method.

or
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
(((((((..((((..........)))).(((((.......))))).....(((((.......)))))))))))).
or
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
(((((((..((((..........)))).(((((.......))))).....(((((.......)))))))))))).
GGGGGGG::ddddssssssssssdddd:eeeeepppppppeeeee:::::eeeeepppppppeeeeeGGGGGGG:
The input may contain many sequences e.g. :
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
(((((((..((((..........)))).(((((.......))))).....(((((.......)))))))))))).
GGGGGGG::ddddssssssssssdddd:eeeeepppppppeeeee:::::eeeeepppppppeeeeeGGGGGGG:
>AP000063.1/5917959095
GCGGGGGUGCCCGAGCCUGGCCAAAGGGGUCGGGCUCAGGACCCGAUGGCGUAGGCCUGCGUGGGUUCAAAUCCCACCCCCCGCA
(((((((..(((.............))).(((((.......)))))..............(((((.......)))))))))))).
>AP000989.1/7327973354
GCGGCCGUCGUCUAGUCUGGAUUAGGACGCUGGCCUUCCAAGCCAGUAAUCCCGGGUUCAAAUCCCGGCGGCCGCA
(((((((..((((...........)))).(((((.......))))).....(((((.......)))))))))))).
>AE006696.1/291218
GCCGCCGUAGCUCAGCCCGGGAGAGCGCCCGGCUGAAGACCGGGUUGUCCGGGGUUCAAGUCCCCGCGGCGGCA
(((((((..((((.........)))).(((((.......))))).....(((((.......)))))))))))).

Parameters

Back to the top

GAP INSERTION
Cost of starting a gap in the alignment

GAP EXTENSION
Cost of extending an alignment gap.

SEQUENCE BONUS
Extra score for aligning two identical nucleotides.

GLOBAL/LOCAL
Allows the user to choose between global and local alignment

DATASETS (only for "Search" comparison option)
Allows the user to choose one of the pre-compiled datasets. Graphical output will not be available for this option.

COMPARISON METHOD (only for "Two sets" comparison option)
In the "All vs. All" mode, each RNA in the query set will be aligned with each RNA in the target set producing n x m alignments where n is the cardinality of the query set and m is the cardinality of the target set.
In the "One to One" mode, the first RNA in the query set will be aligned with the first RNA in the target, the second RNA in the query set with the second of the target set and so on and so forth. The cardinalities of the two sets must be equal. In both cases max 10 000 alignments are allowed.

Datasets

Back to the top
  1. Human lncRNAs
  2. This dataset consists in all the Human lncRNAs smaller than 10000 nucleotides, folded using RNAfold (minimum free energy method) program from Vienna package. The lncRNA were retrieved from the GENCODE website; release 22 (GRCh38.p2).

  3. Mouse lncRNAs
  4. This dataset consists in all the Mouse lncRNAs smaller than 10000 nucleotides, folded using RNAfold (minimum free energy method) program from Vienna package. The lncRNA were retrieved from the GENCODE website; release M4 (GRCm38.p3).

  5. Human 3' UTR
  6. This dataset consists in all the Human 3' UTR. The sequences along with their secondary structures were downloaded using the "Table Browser" tool from UCSC. Assembly:GRCh38/hg38; Track:UCSC Genes; Table:foldUTR3.

  7. Mouse 3' UTR
  8. This dataset consists in all the Mouse 3' UTR. The sequences along with their secondary structures were downloaded using the "Table Browser" tool from UCSC. Assembly:GRCm38/mm10; Track:UCSC Genes; Table:foldUTR3.

  9. Structured Rfam
  10. This dataset consists in all the RNAs from Rfam (v.11) belonging to a family annotated with a consensus secondary structure. We used RNAfold (minimum free energy method) to fold all the Rfam RNAs. In order to improve the prediction accuracy, for each RNA we used as structural constraints for the folding the consensus secondary structure of its belonging family as described previously (Mattei et al., 2014).



Output

Back to the top

The results page reports a table containing all the computed pairwise alignments. Each row of the table contains the two input RNA ids and alignment statistics such as sequence and structural identity percentages and the structural similarity percentage. Moreover, also two measures for the statistical significance of the alignments are reported: p-value and z-score. Results can be sorted according to one of the previous parameters by clicking the selected column header (one of the parameters inside the red box in the figure below).

By clicking on the "plus image" (blue bloxes in the figure above), the sequence and structure color-coded alignment for that specific pair of RNAs appears along with a color-coded graphical representation of the RNA secondary structures. The color code helps the user to identify ungapped structural regions between the two RNAs (see figure below).

The purpose of the colors is to help the user to identify common un-gapped sub-structures between the RNAs. Different colors are associated to different sub-structures. These colors are not to be intended as conservation measures.

Description of the alignment scores:

The Str Id (structural identity percentage) is computed as the fraction of paired bases encoded with an identical BEAR character.
The Str Sim (structural similarity percentage) is computed as the fraction of paired bases encoded with two different BEAR characters belonging to the same RNA structural element (e.g. two different characters ancoding for stems with different lengths).
The p-value and z-score are computed using as background the distribution of the scores obtained aligning unrelated RNA sequences(as detailed in the Supplementary Material of Mattei et al., sumbitted). We suggest to consider the z-score as the reference statistic measure and consider as significant all the alignments having a z-score higher than 3.

By clicking on export results, it is possible to download all the pairwise alignments. The exported file will be formatted in a FASTA-like format as follow:

-The first line containing starts with a ">" followed by the name of the first sequence, the name of the second and the alignment scores divided by '|'
-Next line represents the aligned nucleotide sequence of the first RNA
-Next line represents the aligned secondary structure of the first RNA
-Next line represents the aligned BEAR characters of the first RNA
-Next line represents the aligned nucleotide sequence of the second RNA
-Next line represents the aligned secondary structure of the second RNA
-Next line represents the aligned BEAR characters of the second RNA

Example:

Input1:
>X06054
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAA
(((((((..((((..........)))).(((((.......)))))..
Input2:
>AP000063
GCGGGGGUGCCCGAGCCUGGCCAAAGGGGUCGGGCUCAGGACCCGAU
(((((((..(((.............))).(((((.......))))).
Output:
>X06054|AP000063|NW:83.56|SeqIdentity:44.71|StrIdentity:67.06|StrSimilarity:84.71|P-value:0.007|Z-score:2.80855
# the name of the first sequence, the name of the second, the alignment score, the sequence identity
 percentage, the structural identity percentage, the structural similarity percentage, P-value,
  Z-score, divided by '|'
GGGCCCGUCGUCUAGCCUGG-UUAGGACGCUGCCCUGACGCGGCAGAA #primary sequence first sequence
(((((((..((((.......-...)))).(((((.......))))).. #secondary structure first sequence
GGGGGGG::ddddsssssss-sssdddd:eeeeepppppppeeeee:: #BEAR encoding first sequence
GCGGGGGUGCCCGAGCCUGGCCAAAGGGGUCGGGCUCAGGACCCGAU  #primary sequence second sequence
(((((((..(((.............))).(((((.......))))).  #secondary structure second sequence
GGGGGGG::cccvvvvvvvvvvvvvccc:eeeeepppppppeeeee:  #BEAR encoding second sequence