Available here is tool that is used to generate a list of features for use in machine learning applications. It is split up into two major categories: Primary, and Secondary. The Primary features are all sequence based features while the Secondary features are based on structural attributes. It is recommended to get the structural features for the premature microRNA sequence, the mature sequence, and the seed region, and get the Secondary features for only the premature sequence.
Incl. | Primary Features | # | Description | Reference |
---|---|---|---|---|
Single Nucleotide Frequency | 4 | The frequency of a single nucleotide (A, U, G, C) with respect to the length of the sequence. | Link | |
Dinucleotide Frequency | 16 | The frequency of nucleotide pairs (AA, AU, AG, AC, UA...) with respect to the length of the sequence. | Link Link | |
Trinucleotide Frequency | 64 | The frequency of nucleotide triplets (AAA, AAU, AAG, AAU, AUA...) with respect to the length of the sequence. | Link | |
Quadnucleotide Frequency | 256 | The frequency of nucleotide quadruplets (AAAA, AAAU, AAAG, AAAC, AAUA...) with respect to the length of the sequence. | Link | |
Two Nucleotide Frequency | 2 | The frequency of A plus the frequency of U, and the frequency of G plus the frequency of C. | Link Link | |
Pair Composition | 1 | Percentage of sequence composed of pairs. | ||
Pairs Frequency | 3 | Frequency of AU, GC, and GU pairs with respect to all pairs. | ||
Pair Occurrences | 4 | Number of times AU, GC, and GU occurs and the total number of pairs. | ||
Pair Exclusion | 4 | The number of times each nucleotide occurs outside of pairs. | ||
Number of Palindromes | 1 | The frequency of palindromes with length of greater than 3 occurring in the sequence. | ||
Length | 1 | The length of the sequence. | Link | Incl. | Secondary Features | # | Description | Reference |
RNAfold Structure | 32 | The frequency of each nucleotide to each triplet fold combination.
Command: RNAfold -p < inputSequences | Link | |
Minimum Free Energy | 3 | The minimum free energy, normalized minimum free energy, and frequency of minimum fold energy structures of the sequence. | See RNAfold | |
Ensemble Free Energy | 2 | The ensemble free energy, normalized ensemble free energy. | See RNAfold | |
Stem Statistics | 6 | Number of Stems, Average and Maximum length of stems, and occurrences of AU, GC, and GU pairs in stems. | See RNAfold | |
Minimum Free Energy Statistics | 6 | See RNAfold | ||
RNAshape Shapes | 5 | Probability of folding into 5 different shapes as provided by RNAshapes. (Max length of sequence = 220)
Command: RNAshapes -t [1, 2, 3, 4, 5] < inputSequences | Link | |
Stoat Statistics | 4 | Shannon Entropy, Frobenius Norm, Mean Stem Length, and Base Pairing Propensity
Command: stoat -x 31 -v -i [inputFile] | Link |