Feature Generation

Available here is tool that is used to generate a list of features for use in machine learning applications. It is split up into two major categories: Primary, and Secondary. The Primary features are all sequence based features while the Secondary features are based on structural attributes. It is recommended to get the structural features for the premature microRNA sequence, the mature sequence, and the seed region, and get the Secondary features for only the premature sequence.

Incl.Primary Features#DescriptionReference
Single Nucleotide Frequency4The frequency of a single nucleotide (A, U, G, C) with respect to the length of the sequence.Link
Dinucleotide Frequency16The frequency of nucleotide pairs (AA, AU, AG, AC, UA...) with respect to the length of the sequence.Link
Trinucleotide Frequency64The frequency of nucleotide triplets (AAA, AAU, AAG, AAU, AUA...) with respect to the length of the sequence.Link
Quadnucleotide Frequency256The frequency of nucleotide quadruplets (AAAA, AAAU, AAAG, AAAC, AAUA...) with respect to the length of the sequence.Link
Two Nucleotide Frequency2The frequency of A plus the frequency of U, and the frequency of G plus the frequency of C.Link
Pair Composition1Percentage of sequence composed of pairs.
Pairs Frequency3Frequency of AU, GC, and GU pairs with respect to all pairs.
Pair Occurrences4Number of times AU, GC, and GU occurs and the total number of pairs.
Pair Exclusion4The number of times each nucleotide occurs outside of pairs.
Number of Palindromes1The frequency of palindromes with length of greater than 3 occurring in the sequence.
Length1The length of the sequence.Link
Incl.Secondary Features#DescriptionReference
RNAfold Structure32The frequency of each nucleotide to each triplet fold combination.
Command: RNAfold -p < inputSequences
Minimum Free Energy3The minimum free energy, normalized minimum free energy, and frequency of minimum fold energy structures of the sequence.See RNAfold
Ensemble Free Energy2The ensemble free energy, normalized ensemble free energy.See RNAfold
Stem Statistics6Number of Stems, Average and Maximum length of stems, and occurrences of AU, GC, and GU pairs in stems.See RNAfold
Minimum Free Energy Statistics6See RNAfold
RNAshape Shapes5Probability of folding into 5 different shapes as provided by RNAshapes. (Max length of sequence = 220)
Command: RNAshapes -t [1, 2, 3, 4, 5] < inputSequences
Stoat Statistics4Shannon Entropy, Frobenius Norm, Mean Stem Length, and Base Pairing Propensity
Command: stoat -x 31 -v -i [inputFile]