GENOME LAB

Welcome to the Galaxy Genome Lab. Get quick access to tools, workflows and tutorials for genome assembly and annotation.
What is this page?

Data import and preparation

If you are new to galaxy, uploading your data is a good place to start!
Check out the Tools and Workflows tabs for different approaches to uploading data.

You can upload your data to Galaxy using the Upload tool from anywhere in Galaxy. Just look for the "Upload data" button at the top of the tool panel.

We recommend subsampling large data sets to test tools and workflows. A useful tool is seqtk_seq, setting the parameter at "Sample fraction of sequences".

BioPlatforms Australia allows data downloads via URL. Once you have generated one of these URLs in the BPA portal, you can import it into Galaxy using the "Fetch data" feature of the Upload tool.

No, do not upload personal or sensitive, such as human health or clinical data. Please see our Data Privacy page for definitions of sensitive and health-related information.

Please also make sure you have read our Terms of Service, which covers hosting and analysis of research data.

Please read our Privacy Policy for information on your personal data and any data that you upload.

Please submit a quota request if your Galaxy Australia account reaches its data storage limit. Requests are usually provisioned quickly if you provide a reasonable use case for your request.

Quality control and data cleaning is an essential first step in any NGS analysis. This tutorial will show you how to use and interpret results from FastQC, NanoPlot and PycoQC.

This practical aims to familiarize you with the Galaxy user interface. It will teach you how to perform basic tasks such as importing data, running tools, working with histories, creating workflows, and sharing your work.

Any user of Galaxy Australia can request support through an online form.

Common tools are listed here, or search for more in the full tool panel to the left.

Standard upload of data to Galaxy, from your computer or from the web.

Before using your sequencing data, it's important to ensure that the data quality is sufficient for your analysis.

Input data:

fastq
bam
sam

Faster run than FastQC, this tool can also trim reads and filter by quality.

Input data:

fastq

A plotting suite for Oxford Nanopore sequencing data and alignments.

Input data:

fastq
fasta
vcf_bgzip

A set of metrics and graphs to visualize genome size and complexity prior to assembly.

Input data:

tabular Output from Meryl or Jellyfish histo

Prepare kmer count histogram for input to GenomeScope.

Input data:

fastq
fasta

A workflow is a series of Galaxy tools that have been linked together to perform a specific analysis. You can use and customize the example workflows below. Learn more.

Report statistics from sequencing reads.

Tools: nanoplot fastqc multiqc

Estimates genome size and heterozygosity based on counts of kmers.

Tools: meryl genomescope

Trims and filters raw sequence reads according to specified settings.

Tools: fastp

Genome assembly

Common tools are listed here, or search for more in the full tool panel to the left.

A haplotype-resolved assembler for PacBio HiFi reads.

Input data:

fasta

de novo assembly of single-molecule sequencing reads, designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies.

Input data:

fasta
fastq

Hybrid assembly pipeline for bacterial genomes, uses both Illumina reads and long reads (PacBio or Nanopore).

Input data:

fastq

YAHS is a scaffolding tool based on a computational method that exploits the genomic proximity information in Hi-C data sets for long-range scaffolding of de novo genome assemblies. Inputs are the primary assembly (or haplotype 1), and HiC reads mapped to the assembly. See this tutorial to learn how to create a suitable BAM file.

Input data:

fasta Primary assembly or Haplotype 1 genome.fasta
bam HiC reads mapped to assembly mapped_reads.bam

QUAST = QUality ASsessment Tool. The tool evaluates genome assemblies by computing various metrics. If you have one or multiple genome assemblies, you can assess their quality with Quast. It works with or without reference genome.

Input data:

fasta

BUSCO: assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs. The tool attempts to provide a quantitative assessment of the completeness in terms of the expected gene content of a genome assembly, transcriptome, or annotated gene set.

Input data:

fasta

Assemble mitochondrial genomes from PacBio HiFi reads. Run first to find a related mitogenome, then run to assemble the genome. Inputs are PacBio HiFi reads in fasta or fastq format, and a related mitogenome in both fasta and genbank formats.

Input data:

fasta
fastq
genbank

A workflow is a series of Galaxy tools that have been linked together to perform a specific analysis. You can use and customize the example workflows below. Learn more.

TSI assembly workflows - PacBio HiFi or Nanopore data

This How-to-Guide will describe the steps required to assemble your genome on the Galaxy Australia platform, using multiple workflows. There is also a guide about the Genome Assessment workflow, and the HiC Scaffolding workflow.

Convert a BAM file to FASTQ format to perform QC analysis (required if your data is in BAM format).

Input data:

bam PacBio subreads.bam

Assemble a genome using PacBio HiFi reads.

Input data:

fastqsanger HiFi reads

Optional workflow to purge duplicates from the contig assembly.

Input data:

fastqsanger HiFi reads
fasta Primary assembly contigs

Assemble a genome using Nanopore reads.

Input data:

fastqsanger
fastq
Nanopore reads

Evaluate the quality of your genome assembly with a comprehensive report including FASTA stats, BUSCO, QUAST, Meryl and Merqury.

Input data:

fasta Primary assembly contigs

If you have HiC data, scaffold your assembly using YAHS.

Input data:

fasta Primary or Hap1 assembly
fastqsanger.gz HiC forward reads, HiC reverse reads

General assembly workflows - Nanopore and Illumina data

This tutorial describes the steps required to assemble a genome on Galaxy with Nanopore and Illumina data.

Assemble Nanopore long reads. This workflow can be run alone or as part of a combined workflow for large genome assembly.

Input data:

fastqsanger Long reads (may be raw, filtered and/or corrected)

Polishes (corrects) an assembly, using long reads (Racon and Medaka) and short reads (Racon).

Input data:

fasta Assembly to polish
fastq Long reads (those used in assembly)
fastq Short reads to be used for polishing (R1 only)

Assesses the quality of the genome assembly. Generates statistics, determines if expected genes are present and align contigs to a reference genome.

Input data:

fasta Polished assembly
fasta Reference genome assembly (e.g. related species)

VGP assembly workflows - PacBio HiFi and (optional) HiC data

These workflows have been developed as part of the global Vertebrate Genome Project (VGP). A guide to using these in Galaxy Australia can be found here. A complete guide to the individual workflows and sample results can be found here. There are many different ways that these workflows can be used in practice - for a comprehensive example, check out this Galaxy tutorial.

This workflow produces a Meryl database and Genomescope outputs that will be used to determine parameters for following workflows, and assess the quality of genome assemblies. Specifically, it provides information about the genomic complexity, such as the genome size and levels of heterozygosity and repeat content, as well about the data quality.

Input data:

fastq PacBio HiFi reads

This workflow uses hifiasm (HiC mode) to generate HiC-phased haplotypes (hap1 and hap2). This is in contrast to its default mode, which generates primary and alternate pseudohaplotype assemblies. This workflow includes three tools for evaluating assembly quality: gfastats, BUSCO and Merqury.

Note: if you have multiple input files for each HiC set, they need to be concatenated. The forward set needs to be concatenated in the same order as reverse set.

Input data:

fasta PacBio HiFi reads
fastq PacBio HiC reads (forward)
fastq PacBio HiC reads (reverse)
meryldb Meryl kmer database
txt GenomeScope genome profile summary

This workflow uses hifiasm to generate primary and alternate pseudohaplotype assemblies. This workflow includes three tools for evaluating assembly quality: gfastats, BUSCO and Merqury.

Input data:

fasta PacBio HiFi reads
meryldb Meryl kmer database
txt GenomeScope genome profile summary

This workflow scaffolds the assembly contigs using information from HiC data.

Input data:

gfa Assembly of haplotype 1
fastq HiC forward reads
fastq HiC reverse reads

This workflow identifies and removes contaminants from the assembly.

Input data:

fasta Assembly

Yes. Galaxy Australia has assembly tools for small prokaryote genomes as well as larger eukaryote genomes. We are continually adding new tools and optimising them for large genome assemblies - this means adding enough computer processing power to run data-intensive tools, as well as configuring aspects such as parallelisation.

Please contact us if:

  • you need to increase your data storage limit
  • there is a tool you wish to request
  • a tool appears to be broken or running slowly

  • See the tutorials in this Help section. They cover different approaches to genome assembly.
  • Read the methods in scientific papers about genome assembly, particularly those about genomes with similar characteristics to those in your project
  • See the Workflows section for examples of different approaches to genome assembly - these cover different sequencing data types, and a variety of tools.

Genome assembly can be a very involved process. A typical genome assembly procedure might look like:

  • Data QC - check the quality and characteristics of your sequencing reads.
  • Kmer counting - to determine genome characteristics such as ploidy and size.
  • Data preparation - trimming and filtering sequencing reads if required.
  • Assembly - for large genomes, this is usually done with long sequencing reads from PacBio or Nanopore.
  • Polishing - the assembly may be polished (corrected) with long and/or short (Illumina) reads.
  • Scaffolding - the assembly contigs may be joined together with other sequencing data such as HiC.
  • Assessment - at any stage, the assembly can be assessed for number of contigs, number of base pairs, whether expected genes are present, and many other metrics.
  • Annotation - identify features on the genome assembly such as gene names and locations.
Genome assembly flowchart

A graphical representation of genome assembly

There is no best set of tools to recommend - new tools are developed constantly, sequencing technology improves rapidly, and many genomes have never been sequenced before and thus their characteristics and quirks are unknown. The "Tools" tab in this section includes a list of commonly-used tools that could be a good starting point. You will find other tools in recent publications or used in workflows.

You can also search for tools in Galaxy's tool panel. If they aren't installed on Galaxy Australia, you can request installation of a tool.

We recommend testing a tool on a small data set first and seeing if the results make sense, before running on your full data set.

Once a genome has been assembled, it is important to assess the quality of the assembly, and in the first instance, this quality control (QC) can be achieved using the workflow described here.

Any user of Galaxy Australia can request support through an online form.

Genome annotation

Common tools are listed here, or search for more in the full tool panel to the left.

MAKER is able to annotate both prokaryotes and eukaryotes. It works by aligning as many evidences as possible along the genome sequence, and then reconciling all these signals to determine probable gene structures.

The evidences can be transcript or protein sequences from the same (or closely related) organism. These sequences can come from public databases (like NR or GenBank) or from your own experimental data (transcriptome assembly from an RNASeq experiment for example). MAKER is also able to take into account repeated elements.

Input data:

fasta Genome assembly
fasta Protein evidence (optional)

Funannotate predict performs a comprehensive whole genome gene prediction. Uses AUGUSTUS, GeneMark, Snap, GlimmerHMM, BUSCO, EVidence Modeler, tbl2asn, tRNAScan-SE, Exonerate, minimap2. This approach differs from Maker as it does not need to train ab initio predictors.

Input data:

fasta Genome assembly (soft-masked)
bam Mapped RNA evidence (optional)
fasta Protein evidence (optional)

RepeatMasker is a program that screens DNA for repeated elements such as tandem repeats, transposons, SINEs and LINEs. Galaxy AU has installed the full and curated DFam screening databases, or a custom database can be provided in fasta format. Additional reference data can be downloaded from RepBase.

Input data:

fasta Genome assembly

Interproscan is a batch tool to query the InterPro database. It provides annotations based on multiple searches of profile and other functional databases.

Input data:

fasta Genome assembly

Funannotate compare compares several annotations and outputs a GFF3 file with the best gene models. It can be used to compare the results of different gene predictors, or to compare the results of a gene predictor with a reference annotation.

Input data:

fasta Genome assemblies to compare

Input data:

fasta Genome assembly
gff
gff3
bed
Annotations
bam Mapped RNAseq data (optional)

Input data:

fasta Genome assembly

Annotate an assembled genome and output a GFF3 file. There are several modules that do different things - search for FGENESH in the tool panel to see them.

Note: you must apply for access to this tool before use.

Input data:

fasta Genome assembly
fasta Repeat-masked (hard) genome assembly

A workflow is a series of Galaxy tools that have been linked together to perform a specific analysis. You can use and customize the example workflows below. Learn more.

General use

Annotates a genome using multiple rounds of Maker, including gene prediction using SNAP and Augustus.

Tools: maker snap augustus busco jbrowse

Input data:

fasta Genome assembly
fastq RNAseq Illumina reads
fasta Proteins

Annotates a genome using Funannotate, includes RNAseq data with RNAstar, and protein predictions from EggNOG.

Tools: RNAstar funannotate eggnog busco jbrowse aegean parseval

Input data:

fasta Genome assembly (soft-masked)
fastq RNAseq Illumina reads
gff3 Alternative annotation
gbk Alternative annotation

Transcript alignment

This How-to-Guide will describe the steps required to align transcript data to your genome on the Galaxy Australia platform, using multiple workflows. The outputs from these workflows can then be used as inputs into the next annotation workflow using FgenesH++.

Mask repeats in the genome.

Input data:

fasta Assembled genome genome.fasta

Trim and merge RNAseq reads.

Input data:

fastqsanger.gz For each tissue: RNAseq R1 files in a collection R1.fastqsanger.gz; RNAseq R2 files in a collection R2.fastqsanger.gz

Align RNAseq to genome to find transcripts.

Input data:

fasta For each tissue: Trimmed and merged RNAseq R1 files R1.fastqsanger.gz; Trimmed and merged RNAseq R2 files R2.fastqsanger.gz

Merge transcriptomes from different tissues, and filter out non-coding sequences.

Input data:

fasta Coding and non-coding sequences from NCBI coding_seqs.fna.gz non-coding_seqs.fna.gz

Extract longest transcripts.

Input data:

fasta Merged transcriptomes merged_transcriptomes.fasta

Convert formats for FgenesH++

Input data:

fasta Transdecoder peptides transdecoder_peptides.fasta

Annotation with FgenesH++

This How-to-Guide will describe the steps required to annotate your genome on the Galaxy Australia platform, using multiple workflows.

Annotate the genome using outputs from the TSI transcriptome workflows.

Note: you must apply for access to this tool before use.

Input data:

fasta Assembled genome
fasta Masked genome
fasta Outputs from TSI convert formats workflow (file.cdna file.pro file.dat)

These slides from the Galaxy training network explain the process of genome annotation in detail. You can use the and keys to navigate through the slides.

The flowchart below shows how you might use your input data (in green) with different Galaxy tools (in blue) to annotate a genome assembly. For example, one pathway would be taking an assembled genome, plus information about repeats, and data from RNA-seq, to run in the Maker pipeline. The annotatations can then be viewed in JBrowse.

Genome annotation flowchart

A graphical representation of genome annotation

Fgenesh++ is a bioinformatics pipeline for automatic prediction of genes in eukaryotic genomes. It is now installed in Galaxy Australia. Australian researchers can apply for access through the Australian BioCommons.

Apollo is web-browser accessible system that lets you conduct real-time collaborative curation and editing of genome annotations.

The Australian BioCommons and our partners at QCIF and Pawsey provide a hosted Apollo Portal service where your genome assembly and supporting evidence files can be hosted. All system administration is taken care of, so you and your team can focus on the annotation curation itself.

This Galaxy tutorial provides a complete walkthrough of the process of refining eukaryotic genome annotations with Apollo.

Genome annotation with Maker

Genome annotation of eukaryotes is a little more complicated than for prokaryotes: eukaryotic genomes are usually larger than prokaryotes, with more genes. The sequences determining the beginning and the end of a gene are generally less conserved than the prokaryotic ones. Many genes also contain introns, and the limits of these introns (acceptor and donor sites) are not highly conserved. This Galaxy tutorial uses MAKER to annotate the genome of a small eukaryote: Schizosaccharomyces pombe (a yeast).


Genome annotation with Funannotate

This Galaxy tutorial provides a complete walkthrough of the process of annotation with Funannotate, including the preparation of RNAseq data, structural annotation, functional annotation, visualisation, and comparing annotations.

Any user of Galaxy Australia can request support through an online form.

Contributors

Github avatar for AnnaSyme
Github avatar for neoformit