LIRBase: a comprehensive collection of long inverted repeats in eukaryotes

LIRBase houses a total of 6,619,473 long inverted repeats (LIR, longer than 800 nt) in 424 eukaryotic genomes and provides various functionalities to facilitate the annotation and functional studies of LIRs and their derived small RNAs.

Statistics

6,619,473

Long inverted repeats

424

Eukaryotic genomes

374

Species

LIRs in 77 metazoa genomes

LIRs in 139 plant genomes

LIRs in 208 vertebrate genomes

Functionalities of LIRBase

Browse LIRBase by species/genomes

Search LIRBase by genomic locations

Search LIRBase by the identifiers of LIRs

Search LIRBase by sequence similarity using BLAST

Detect and annotate long inverted repeats in user-uploaded DNA sequences

Identify candidate LIRs encoding long hpRNAs by aligning sRNA sequencing data to LIRs

Differential expression analysis of LIRs or small RNAs between different biological samples/tissues

Identify protein-coding genes targeted by the small RNAs derived from a LIR

Predict and visualize the secondary structure of potential long hpRNA encoded by a LIR

Browse long inverted repeats identified in 77 Metazoa genomes
Loading...

Loading...



Browse long inverted repeats identified in 139 Plant genomes
Loading...

Loading...



Browse long inverted repeats identified in 208 Vertebrate genomes
Loading...

Loading...



Search by genomic location

Calculation In progress...



Search result



Search by LIR identifier

Calculation In progress...


Search result



Search LIRBase by sequence similarity using BLAST
Example BLAST input data

Calculation In progress...


BLAST output


Loading...

                      







Identify candidate LIRs encoding long hpRNAs by aligning sRNA sequencing data to LIRs
Example input data


Calculation In progress...

Calculation In progress...




Alignment summary of sRNA sequencing data to each LIR


Loading...
Loading...
Loading...



Differentially expressed LIRs/sRNAs





Why LIR and LIRBase?

Long inverted repeat, long hpRNA and siRNA

An inverted repeat is a single stranded nucleotide sequence followed by its reverse complement at the downstream. The intervening sequence between the initial sequence and the reverse complement can be any length including zero. When transcribed, long inverted repeat can form long hairpin RNA genes (hpRNAs), which are much longer than typical animal or plant pre-miRNAs.

Henderson et al. reported the biogenesis of small interfering RNAs (siRNAs) from long inverted repeat in Arabidopsis thaliana for the first time (Henderson et al. 2006 Nature genetics). This siRNA biogenesis pathway was soon verified in Drosophila (Czech et al. 2008 Nature). In 2008, Okamura et al. systematically characterized the genes and mechanisms underlying the biogenesis of 21-22-nucleotide siRNAs from long hpRNAs encoded by LIRs in Drosophila (Okamura et al. 2008 Nature). They found that Dicer-2, Hen1 and Argonaute 2 played vital roles in this siRNA biogenesis pathway. This siRNA biogenesis pathway was further characterized in Arabidopsis soon (Dunoyer et al. 2010 EMBO J).

LIRs can act as functional genomic elements in eukaryotic genomes.
A typical long inverted repeat and the small RNAs originated from the LIR analyzed utilizing LIRBase are demonstrated in the following image.

siRNA derived from long inverted repeats play important biological roles

In 2018, Lin et al. identified two long hpRNAs in Drosophila simulans, which could be processed into 21-nt siRNAs (Tao et al. 2007a PLOS Biology; Tao et al. 2007b PLOS Biology; Lin et al. 2018 Developmental Cell). These siRNAs could then repress the expression of the Dox and MDox genes which promotes X chromosome transmission by suppressing Y-bearing sperm. As a result, the two long hpRNAs and the derived siRNAs are critical to the maintenance of balanced sex ratio in the offsprings of Drosophila simulans.

The biological functions of siRNAs derived from long inverted repeats in plants and animals were also reported in recent years.

In mouse, siRNAs derived from LIRs were reported to regulate gene expression in oocytes (Tam et al. 2008 Nature; Watanabe et al. 2008 Nature).

In Drosophila, another hpRNA and the derived siRNAs were reported to regulate testis gene expression and control male fertility (Wen et al. 2015 Molecular Cell).

In apple, a long hpRNA and the generated siRNAs contributed to the resistance of apple to leaf spot disease (Zhang et al. 2018 Plant Cell).

In soybean, a long hpRNA and the derived 22-nt siRNAs regulate the seed coat color of soybean (Tuteja et al. 2009 Plant Cell; Cho et al. 2013 PLOS ONE; Jia et al. 2020 Plant Cell).

In rice, we previously found that several LIRs were present in one parental genome of an elite hybrid but were absent from the other parental genome (Yao et al. 2020 Computational and Structural Biotechnology Journal). As a result, siRNAs derived from the LIRs were detected and expressed in only one parental genome. The association between the LIRs and siRNAs were further detected and verified in an F2 population derived from a self-cross of the elite hybrid.

Comprehensive genome-wide identification of LIRs and long hpRNAs in eukaryotic genomes are urgently needed

In 2013, Axtell urgently called on the comprehensive genome-wide identification and annotation of long inverted repeats and long hpRNAs (Axtell et al. 2013 Annual Review of Plant Biology). However, genome-wide identification and annotation of long inverted repeats were only conducted in very few organisms. None database or web server for annotation and analysis of long inverted repeats and long hpRNAs exist up to now.

Using Inverted Repeats Finder (IRF) (Warburton et al. 2004 Genome Research), we identified a total of 6,619,473 long inverted repeats in the whole genomes of 424 eukaryotes, including 297,317 LIRs in 77 metazoa genomes, 1,731,978 LIRs in 139 plant genomes and 4,590,178 LIRs in 208 vertebrate genomes. We requested a minimum length of 400 nt for both arms of the long inverted repeat identified by IRF, to remove potential miniature inverted-repeat transposable element (MITE) or Alu element from the result of IRF.

Nomenclature of a long inverted repeat in LIRBase

Each long inverted repeat has a unique identifier in LIRBase determined by the species name and several features of the LIR including the chromosome ID, the start coordinate of the left arm, the end coordinate of the left arm, the start coordinate of the right arm, the end coordinate of the right arm.

Please be noted that the sequence of a LIR in LIRBase is composed of the left arm sequence, the loop sequence, the right arm sequence, as well as two 200-bp sequences flanking the LIR (the left flanking sequence and the right flanking sequence). The genomic coordinates of both arms of the LIR are reflected in the identifier of the LIR, while the flanking sequences are not denoted in the identifier of the LIR.

References




Tutorial of LIRBase

  LIRBase is a database with comprehensive collection of long inverted repeats in 424 eukaryotic genomes.

  Using IRF (https://tandem.bu.edu/irf/irf.download.html), we identified a total of 6,619,473 long inverted repeats in the whole genomes of 424 eukaryotes, including 297,317 LIRs in 77 metazoa genomes, 1,731,978 LIRs in 139 plant genomes and 4,585,178 LIRs in 208 vertebrate genomes. LIRBase is deployed at https://venyao.xyz/lirbase/ for online use.

  The homepage of LIRBase displays the main functionalities of LIRBase (Figure 1).

  1. Browse long inverted repeats (LIR) identified in 424 eukaryotic genomes for the sequences, structures of LIRs, and the overlaps between LIRs and genes.
  2. Search LIRBase for long inverted repeats in a specific genome by genomic locations.
  3. Search LIRBase for long inverted repeats in a specific genome by the identifiers of long inverted repeats.
  4. Search LIRBase by sequence similarity using BLAST.
  5. Detect and annotate long inverted repeats in user-uploaded DNA sequences.
  6. Align small RNA sequencing data to long inverted repeats of a specific genome to detect the origination of small RNAs from long inverted repeats and quantify the expression level of long inverted repeats.
  7. Perform differential expression analysis of long inverted repeats or small RNAs between different biological samples/tissues.
  8. Identify protein-coding genes targeted by the small RNAs derived from a LIR through detecting the complementary matches between small RNAs and the cDNA sequence of protein-coding genes.
  9. Predict and visualize the secondary structure of potential hpRNA encoded by a LIR using RNAfold.
Figure 1. The homepage of LIRBase.

1. Browse LIRBase for long inverted repeats identified in 424 eukaryotic genomes

  The images and the species names of 77 metazoa genomes are listed in the Species panel of the Metazoa submenu under the Browse menu (Figure 2).   The images and the species names of 139 plant genomes are listed in the Species panel of the Plant submenu under the Browse menu.   The images and the species names of 208 vertebrate genomes are listed in the Species of under the Vertebrate submenu under the Browse menu.

Figure 2. Species name and images of 77 metazoa genomes listed in the Species panel of the Metazoa submenu under the Browse menu.

  Click of the image or the species name of any genome would take you to the LIRs of Metazoa panel of the Metazoa submenu under the Browse menu, which displays all the LIRs identified in the selected genome (Figure 3). A brief summary of all the LIRs identified in the selected genome and a table of all the LIRs showing the structure of each LIR are demonstrated in the LIRs of Metazoa panel. Three buttons would be displayed below the table if any row of the table was clicked (Figure 3).

Figure 3. List of all the LIRs identified for a selected genome.

  The three buttons below the table can be clicked to display the sequence, structure of the selected LIR and the overlaps between the selected LIR and genes, respectively (Figure 4).

Figure 4. Detailed information of a selected LIR.

2. Search LIRBase by genomic locations

  LIRBase allows searching for LIRs identified in any of the 424 eukaryotic genomes by genomic locations (Figure 5).

Figure 5. The Search by genomic location submenu under the Search menu.

  The detailed steps to search LIRBase by genomic locations are shown in Figure 6. The search results are displayed as a data table (Figure 6). Each row of the data table represents a LIR in the search result. Three buttons would be displayed below the table if any row of the table was selected. The detailed information of a LIR including the sequence, structure of the selected LIR and the overlaps between the selected LIR and genes, can be viewed by clicking the three buttons (Figure 6)

Figure 6. Steps to search LIRBase by genomic location.

3. Search LIRBase by the identifiers of LIRs

  LIRBase allows searching for LIRs identified in any of the 424 eukaryotic genomes by the identifiers (IDs) of long inverted repeats (Figure 7).

Figure 7. The Search by LIR identifier submenu under the Search menu.

  The detailed steps to search LIRBase by the identifiers of LIRs are shown in Figure 8.

Figure 8. Steps to search LIRBase by LIR identifiers.

  After clicking the Search button in the Input panel shown in Figure 8, the results would be displayed as a data table in the Output panel (Figure 9). The result can also be downloaded by clicking the download buttons on top of the data table. Each row of the data table represents a LIR in the search result. Three buttons would be displayed below the table if any row of the table was clicked. The detailed information of a LIR including the sequence, structure of the selected LIR and the overlaps between the selected LIR and genes, can be viewed by clicking the three buttons (Figure 9).

Figure 9. The Output panel of the Search by LIR identifier submenu.

4. Search LIRBase using BLAST

  Users can also search LIRBase by sequence similarity using BLAST (Figure 10). A graphical interface was implemented in LIRBase for users to perform BLAST alignment through the NCBI BLAST+ program. BLASTN databases were constructed for all the LIRs identified in each of the 424 eukaryotic genomes. Users can choose to BLAST against any one or more BLASTN databases. The detailed steps to perform BLAST in LIRBase in shown in Figure 10.

Figure 10. Steps to BLAST in LIRBase.

  Once the BLAST alignment is finished, you would be taken to the Output panel of the Blast menu, which displays the BLAST result in a data table (Figure 11). The whole BLAST results can be downloaded by clicking the download buttons on top of the data table. Each row of the data table represents a BLAST hit. By clicking a row of this table, the detailed information of the selected BLAST hit would be displayed, including the alignment of a query sequence and a subject LIR sequence in the BLAST database. Three buttons would also be displayed below the table if any row of the table was clicked. The structure, sequence of the LIR in this BLAST hit and the overlaps between this LIR and genes in the corresponding genome, can be viewed by clicking the three buttons (Figure 11).

Figure 11. The Output panel of the Blast submenu.

5. Annotate long inverted repeats in user-uploaded DNA sequences

  The software IRF (https://tandem.bu.edu/irf/irf.download.html) was utilized to identify long inverted repeats in the 424 eukaryotic genomes collected in LIRBase. IRF can only be used in the command line. We implemented a graphical interface for users to annotate long inverted repeats in user-uploaded DNA sequences by IRF (Figure 12). The detailed steps to annotate LIRs in user-uploaded DNA sequences are shown in Figure 12. The input DNA sequences for IRF can be pasted in a text area or be uploaded from a local text file. The input data must be DNA sequences in fasta format. Each sequence should have a unique ID start with “>”.

Figure 12. The Annotate menu of LIRBase to annotate LIRs in user-uploaded DNA sequences.

  The sequences and structures of LIRs identified by IRF can be downloaded as text files (Figure 12). The result of IRF are listed in a data table (Figure 12). Each row shows the structure of an identified long inverted repeat. Two buttons would be displayed below the table if any row of the table was clicked. The detailed information of the selected LIR including the sequence, structure of the selected LIR, can be viewed by clicking the two buttons (Figure 12).

6. Identify candidate LIRs encoding long hpRNAs by aligning sRNA sequencing data to LIRs

  When transcribed, long inverted repeat can form long hairpin RNA genes (hpRNAs), which are much longer than typical animal or plant pre-miRNAs. Henderson et al. (2006) reported the biogenesis of small interfering RNAs (siRNAs) from long inverted repeat in Arabidopsis thaliana for the first time. This siRNA biogenesis pathway was soon reported and verified in other animals and plants.

  To facilitate the annotation of small RNAs derived from LIRs, we implemented a functionality in LIRBase allowing alignment of user-uploaded small RNA sequencing data to all the identified LIRs of a genome by Bowtie (Figure 13). The input data should be read count of small RNAs rather than the raw small sequencing data as shown in Figure 13. The input small RNA read count data can be pasted in a text area provided or be uploaded from a local text file.

Figure 13. The Quantify submenu under the Expression menu of LIRBase to align small RNA sequencing data to a LIR database.

  After clicking the Align! button, the alignment would be performed by Bowtie. The alignment results would be displayed in the Output panel of the Quantify submenu (Figure 14). The detailed alignment result, the summary of the alignment and the sRNA read count of aligned LIRs can be downloaded. What's more, the summary of alignment result for all aligned LIRs can be viewed as a data table. By clicking on a single row of the data table, the size distributions of sRNAs and the alignment of sRNAs to the LIR would be plotted. The detailed information of the chosen LIR would be displayed by clicking the four buttons below the table of sRNA alignment summary.

Figure 14. The Output panel of the Quantify submenu of LIRBase.

  For each LIR in the data table of sRNA alignment summary, the following information are displayed as different columns.

  1. The number of sRNAs aligned to the LIR.
  2. The number of sRNA sequencing reads aligned to the LIR.
  3. The percentage of 21-nt and 22-nt sRNAs among all sRNAs aligned to the LIR.
  4. The percentage of 24-nt sRNAs among all sRNAs aligned to the LIR.
  5. The percentage of sRNAs aligned to the arms of the LIR among all sRNAs aligned to the LIR.
  6. The percentage of sRNAs aligned to the loop of the LIR among all sRNAs aligned to the LIR.
  7. The percentage of sRNAs aligned to the flanking sequences of the LIR among all sRNAs aligned to the LIR.

  At the top of this table, we can set the values of different columns to identify candidate LIRs encoding long hpRNAs. For example, we can identify LIRs encoding candidate long hpRNAs in the genome of Minghui 63 with the following request: (1) a minimum of 90 sRNAs aligned to the LIR, (2) a minimum of 80% sRNAs aligned to the arms of the LIR, (3) a minimum of 50% sRNAs should be 21 or 22 nt (Figure 15).

Figure 15. Set the values of different columns of the table of sRNA alignment summary to identify LIRs encoding candidate long hpRNAs.

7. Differential expression analysis of long inverted repeats and small RNAs

  By aligning small RNA sequencing data to LIRBase, we can obtain the small RNA read count for each LIR in a genome. With multiple biological samples/tissues, we can perform differential expression analysis of long inverted repeats between different biological samples/tissues (Figure 16). The R package DESeq2 (http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) was utilized to perform differential expression analysis. A read count matrix and a sample information table are required as input data for the differential expression analysis. The sample names in the count matrix and the sample names in the information table must be in the same order. Check the example data provided by LIRBase for the format of a sample information table.

Figure 16. The DESeq submenu under the Expression menu of LIRBase to perform differential expression analysis of LIRs/sRNAs.

  The results of DESeq2 can be downloaded as a plain text file or can be viewed in a data table (Figure 16). In addition, the MA-plot and the volcano plot showing the identified differentially expressed LIRs/sRNAs are also generated. A heatmap displaying the sample-to-sample distances can be viewed by clicking the button at the bottom of the sidebar panel under the DESeq submenu.

8. Predict mRNA targets of small RNAs encoded by a LIR

  An analysis module was implemented to predict the mRNA targets of small RNAs encoded by a LIR through the detection of complementary matches between small RNAs and the cDNA sequence of protein-coding genes. The input data should be all the small RNAs encoded by a LIR in FASTA format or sequences only (Figure 17). Then the small RNA sequences can be aligned to the cDNA sequences of a specific genome by Bowtie. The alignments were processed to identify complementary matches between small RNAs and the cDNA sequences. An example output is shown in Figure 17.

Figure 17. The Target menu of LIRBase.

9. Predict and visualize the secondary structure of the potential hpRNA encoded by a LIR

  We utilized the RNAfold software to predict and visualize the secondary structure of the potential hpRNA encoded by a LIR (Figure 18). The input data should be the DNA sequence of a single LIR. The secondary structure in dot-bracket notation is displayed in the main panel. The secondary structure in PNG image can be viewed by clicking the button on the top of the main panel. High-resolution image of the predicted secondary structure can be downloaded as PDF files.

Figure 18. The Visualize menu of LIRBase.

10. Information of 424 genomes collected in LIRBase

  The information of 424 genomes collected in LIRBase is displayed in the Download menu of LIRBase (Figure 19).

Figure 19. The Download menu of LIRBase.

11. Download LIRs identified in 424 eukaryotic genomes, and the corresponding BLAST/Bowtie index database

  In addition to be used online at https://venyao.xyz/lirbase/, LIRBase can be deployed on a personal local or web Linux server. Deployment of LIRBase is platform independent, i.e., LIRBase can be deployed on any platform with the R environment available. The detailed steps are described in the Installation submenu under the Help menu of LIRBase (Figure 20).

Figure 20. The Installation submenu under the Help menu of LIRBase.

The source code of LIRBase is deposited in GitHub (https://github.com/venyao/LIRBase). As the file size of identified LIRs and the corresponding BLAST/Bowtie databases of the 424 eukaryotic genomes are too large, these datasets were not deposited in GitHub. Instead, these data can be downloaded from https://venyao.xyz/lirbase/ through the Download menu (Figure 21).

Figure 21. The Download menu of LIRBase.

12. About LIR and LIRBase

  The definition of long inverted repeat, the biogenesis pathway of siRNAs from long inverted repeat and the biological roles of siRNAs generated in this pathway are elaborated in the About submenu under the Help menu of LIRBase (Figure 22). These results implied that a platform for comprehensive annotation and analysis of siRNAs derived from long inverted repeat is in urgent need.

Figure 22. The About submenu under the Help menu of LIRBase.


LIRBase

A total of 424 eukaryote genomes were collected and the long inverted repeats (LIR, longer than 800 nt) in these genomes were systematically identified. The following functionalities are implemented in LIRBase.

  1. Browse LIRs identified in 424 eukaryotic genomes for the sequences, structures of LIRs, and the overlaps between LIRs and genes.
  2. Search LIRBase for LIRs in a specific genome by genomic locations.
  3. Search LIRBase for LIRs in a specific genome by the identifiers of LIRs.
  4. Search LIRBase by sequence similarity using BLAST.
  5. Detect and annotate LIRs in user-uploaded DNA sequences.
  6. Align small RNA sequencing data to LIRs of a specific genome to detect the origination of small RNAs from LIRs and quantify the expression level of small RNAs and LIRs.
  7. Perform differential expression analysis of LIRs or small RNAs between different biological samples/tissues.
  8. Identify protein-coding genes targeted by the small RNAs derived from a LIR through detecting the complementary matches between small RNAs and the cDNA sequence of protein-coding genes.
  9. Predict and visualize the secondary structure of potential hpRNA encoded by a LIR using RNAfold.

Use LIRBase online

LIRBase is deployed at venyao.xyz/lirbase/ for online use.


Deploy LIRBase on local or web Linux server

Step 1: Install R

Please check CRAN (cran.r-project.org) for the installation of R.

Step 2: Install the R Shiny package and other packages required by LIRBase

Start an R session and run these lines in R:

# try an http CRAN mirror if https CRAN mirror doesn't work  
install.packages("data.table")
install.packages("DT")
install.packages("ggplot2")
install.packages("grid")
install.packages("gridExtra")
install.packages("htmlwidgets")
install.packages("pheatmap")
install.packages("RColorBrewer")
install.packages("shiny")
install.packages("shinyBS")
install.packages("shinycssloaders")
install.packages("shinydashboard")
install.packages("shinydisconnect")
install.packages("shinyjqui")
install.packages("shinyWidgets")
install.packages("stringr")
install.packages("tidyr")
install.packages("dplyr")
install.packages("XML")

install.packages("BiocManager")
BiocManager::install("apeglm")
BiocManager::install("Biostrings")
BiocManager::install("DESeq2")
BiocManager::install("GenomicRanges")

# install shinysky
install.packages("devtools")
devtools::install_github("venyao/ShinySky", force=TRUE)

For more information, please check the following pages:
cran.r-project.org/web/packages/shiny/index.html
github.com/rstudio/shiny
shiny.rstudio.com

Step 3: Install Shiny-Server

Please check the following pages for the installation of shiny-server.
rstudio.com/products/shiny/download-server/
github.com/rstudio/shiny-server/wiki/Building-Shiny-Server-from-Source

Step 4: Install BLAST+

Download and install BLAST+ on your system PATH. Check opensource.com/article/17/6/set-path-linux for the setting of system PATH in Linux.
Please check blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download for the download and installation of BLAST+.

Step 5: Install Bowtie

Download and install Bowtie on your system PATH. Check opensource.com/article/17/6/set-path-linux for the setting of system PATH in Linux.
Please check bowtie-bio.sourceforge.net/index.shtml and github.com/BenLangmead/bowtie for the download and installation of Bowtie.

Step 6: Upload source files of LIRBase

Put the directory containing the code and data of LIRBase to /srv/shiny-server.
The BLASTN database files downloaded from the Data menu of LIRBase should be placed in the LIRBase_blastdb directory under the www directory of LIRBase.
The downloaded Bowtie index files downloaded from the Data menu of LIRBase should be placed in the LIRBase_bowtiedb directory under the www directory of LIRBase.
The downloaded Inverted_repeat_structure files downloaded from the Data menu of LIRBase should be placed in the Table directory under the www directory of LIRBase.
The downloaded Inverted_repeat_sequence files downloaded from the Data menu of LIRBase should be placed in the Fasta directory under the www directory of LIRBase.
The downloaded IRF_stem_alignment files downloaded from the Data menu of LIRBase should be placed in the HTML directory under the www directory of LIRBase.

Step 7: Configure shiny server (/etc/shiny-server/shiny-server.conf)

# Define the user to spawn R Shiny processes
run_as shiny;

# Define a top-level server which will listen on a port
server {  
  # Use port 3838  
  listen 3838;  
  # Define the location available at the base URL  
  location /lirbase {  
    # Directory containing the code and data of LIRBase  
    app_dir /srv/shiny-server/LIRBase;  
    # Directory to store the log files  
    log_dir /var/log/shiny-server;  
  }  
}  

Step 8: Change the owner of the LIRBase directory

$ chown -R shiny /srv/shiny-server/LIRBase  

Step 9: Start Shiny-Server

$ start shiny-server  

Now, the LIRBase app is available at http://IPAddressOfTheServer:3838/LIRBase/ (Remeber to replace IPAddressOfTheServer as the actual IP address of your Linux server).

Wen Yao, PhD, Professor

College of Life Sciences
Henan Agricultural University
Zhengzhou 450002, China

yaowen (AT) henau.edu.cn
venyao (AT) qq.com

Agricultural Road No. 63 (450002), Zhengzhou, Henan, China


Zhang Zhang, PhD, Professor

Associate Director of National Genomics Data Center (NGDC)
China National Center for Bioinformation (CNCB)
Beijing Institute of Genomics (BIG)
Chinese Academy of Sciences (CAS)

zhangzhang (AT) big.ac.cn

No.1 Beichen West Road, Chaoyang District, Beijing 100101, China