Genome Lab

GENOME LAB

Welcome to the Galaxy Genome Lab. Get quick access to tools, workflows and tutorials for genome assembly and annotation.
What is this page?

Data import and preparation

If you are new to galaxy, uploading your data is a good place to start!
Check out the Tools and Workflows tabs for different approaches to uploading data.

You can upload your data to Galaxy using the Upload tool from anywhere in Galaxy. Just look for the "Upload data" button at the top of the tool panel.

More info

We recommend subsampling large data sets to test tools and workflows. A useful tool is seqtk_seq, setting the parameter at "Sample fraction of sequences".

BioPlatforms Australia allows data downloads via URL. Once you have generated one of these URLs in the BPA portal, you can import it into Galaxy using the "Fetch data" feature of the Upload tool.

More info

No, do not upload personal or sensitive, such as human health or clinical data. Please see our Data Privacy page for definitions of sensitive and health-related information.

Please also make sure you have read our Terms of Service, which covers hosting and analysis of research data.

Please read our Privacy Policy for information on your personal data and any data that you upload.

Please submit a quota request if your Galaxy account reaches its data storage limit. Requests are usually provisioned quickly if you provide a reasonable use case for your request.

Request

Quality control and data cleaning is an essential first step in any NGS analysis. This tutorial will show you how to use and interpret results from FastQC, NanoPlot and PycoQC.

Tutorial

This practical aims to familiarize you with the Galaxy user interface. It will teach you how to perform basic tasks such as importing data, running tools, working with histories, creating workflows, and sharing your work.

Tutorial

Any user of Galaxy can request support through an online form.

Request support

Common tools are listed here, or search for more in the full tool panel to the left.

Standard upload of data to Galaxy, from your computer or from the web.

play_arrow

Before using your sequencing data, it's important to ensure that the data quality is sufficient for your analysis.

Input data:

fastq
bam
sam

play_arrow

Faster run than FastQC, this tool can also trim reads and filter by quality.

Input data:

fastq

play_arrow

A plotting suite for Oxford Nanopore sequencing data and alignments.

Input data:

fastq
fasta
vcf_bgzip

play_arrow

A set of metrics and graphs to visualize genome size and complexity prior to assembly.

Input data:

tabular Output from Meryl or Jellyfish histo

play_arrow

Prepare kmer count histogram for input to GenomeScope.

Input data:

fastq
fasta

play_arrow

A workflow is a series of Galaxy tools that have been linked together to perform a specific analysis. You can use and customize the example workflows below. Learn more.

Report statistics from sequencing reads.

Tools: nanoplot fastqc multiqc

play_arrow

Estimates genome size and heterozygosity based on counts of kmers.

Tools: meryl genomescope

play_arrow

Trims and filters raw sequence reads according to specified settings.

Tools: fastp

play_arrow

Genome assembly

Common tools are listed here, or search for more in the full tool panel to the left.

A haplotype-resolved assembler for PacBio HiFi and/or Oxford Nanopore reads.

Input data:

fasta
fastq

play_arrow

de novo assembly of single-molecule sequencing reads, designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies.

Input data:

fasta
fastq

play_arrow

Uses both Hifi and, optionally, corrected Nanopore reads for assembly.

Input data:

fastq
fastqsanger

play_arrow

Hybrid assembly pipeline for bacterial genomes, uses both Illumina reads and long reads (PacBio or Nanopore).

Input data:

fastq

play_arrow

YAHS is a scaffolding tool based on a computational method that exploits the genomic proximity information in Hi-C data sets for long-range scaffolding of de novo genome assemblies. Inputs are the primary assembly (or haplotype 1), and HiC reads mapped to the assembly. See this tutorial to learn how to create a suitable BAM file.

Input data:

`fasta`	Primary assembly or Haplotype 1 `genome.fasta`
`bam`	HiC reads mapped to assembly `mapped_reads.bam`

play_arrow

QUAST = QUality ASsessment Tool. The tool evaluates genome assemblies by computing various metrics. If you have one or multiple genome assemblies, you can assess their quality with Quast. It works with or without reference genome.

Input data:

fasta

play_arrow

BUSCO: assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs. The tool attempts to provide a quantitative assessment of the completeness in terms of the expected gene content of a genome assembly, transcriptome, or annotated gene set.

Input data:

fasta

play_arrow

Assemble mitochondrial genomes from PacBio HiFi reads. Run first to find a related mitogenome, then run to assemble the genome. Inputs are PacBio HiFi reads in fasta or fastq format, and a related mitogenome in both fasta and genbank formats.

Input data:

fasta
fastq
genbank

play_arrow

Uses the HERRO algorithm to correct raw Nanopore reads.

Input data:

fastq
paf

play_arrow

A workflow is a series of Galaxy tools that have been linked together to perform a specific analysis. You can use and customize the example workflows below. Learn more.

TSI assembly workflows - PacBio HiFi or Nanopore data

This How-to-Guide will describe the steps required to assemble your genome on the Galaxy Australia platform, using multiple workflows. There is also a guide about the Genome Assessment workflow, and the HiC Scaffolding workflow.

Convert a BAM file to FASTQ format to perform QC analysis (required if your data is in BAM format).

Input data:

bam PacBio subreads.bam

play_arrow

Assemble a genome using PacBio HiFi reads.

Input data:

fastqsanger HiFi reads

play_arrow

Optional workflow to purge duplicates from the contig assembly.

Input data:

`fastqsanger`	HiFi reads
`fasta`	Primary assembly contigs

play_arrow

Assemble a genome using Nanopore reads.

Input data:

fastqsanger
fastq Nanopore reads

play_arrow

Evaluate the quality of your genome assembly with a comprehensive report including FASTA stats, BUSCO, QUAST, Meryl and Merqury.

Input data:

fasta Primary assembly contigs

play_arrow

If you have HiC data, scaffold your assembly using YAHS.

Input data:

`fasta`	Primary or Hap1 assembly
`fastqsanger.gz`	HiC forward reads, HiC reverse reads

play_arrow

General assembly workflows - Nanopore and Illumina data

This tutorial describes the steps required to assemble a genome on Galaxy with Nanopore and Illumina data.

Assemble Nanopore long reads. This workflow can be run alone or as part of a combined workflow for large genome assembly.

Input data:

fastqsanger Long reads (may be raw, filtered and/or corrected)

play_arrow

Polishes (corrects) an assembly, using long reads (Racon and Medaka) and short reads (Racon).

Input data:

`fasta`	Assembly to polish
`fastq`	Long reads (those used in assembly)
`fastq`	Short reads to be used for polishing (R1 only)

play_arrow

Assesses the quality of the genome assembly. Generates statistics, determines if expected genes are present and align contigs to a reference genome.

Input data:

`fasta`	Polished assembly
`fasta`	Reference genome assembly (e.g. related species)

play_arrow

VGP assembly workflows - PacBio HiFi and (optional) HiC data

These workflows have been developed as part of the global Vertebrate Genome Project (VGP). A guide to using these in Galaxy Australia can be found here. A complete guide to the individual workflows and sample results can be found here. There are many different ways that these workflows can be used in practice - for a comprehensive example, check out this Galaxy tutorial.

This workflow produces a Meryl database and Genomescope outputs that will be used to determine parameters for following workflows, and assess the quality of genome assemblies. Specifically, it provides information about the genomic complexity, such as the genome size and levels of heterozygosity and repeat content, as well about the data quality.

Input data:

fastq PacBio HiFi reads

play_arrow

This workflow uses hifiasm (HiC mode) to generate HiC-phased haplotypes (hap1 and hap2). This is in contrast to its default mode, which generates primary and alternate pseudohaplotype assemblies. This workflow includes three tools for evaluating assembly quality: gfastats, BUSCO and Merqury.

Note: if you have multiple input files for each HiC set, they need to be concatenated. The forward set needs to be concatenated in the same order as reverse set.

Input data:

`fasta`	PacBio HiFi reads
`fastq`	PacBio HiC reads (forward)
`fastq`	PacBio HiC reads (reverse)
`meryldb`	`Meryl` kmer database
`txt`	`GenomeScope` genome profile summary

play_arrow

This workflow uses hifiasm to generate primary and alternate pseudohaplotype assemblies. This workflow includes three tools for evaluating assembly quality: gfastats, BUSCO and Merqury.

Input data:

`fasta`	PacBio HiFi reads
`meryldb`	`Meryl` kmer database
`txt`	`GenomeScope` genome profile summary

play_arrow

This workflow scaffolds the assembly contigs using information from HiC data.

Input data:

`gfa`	Assembly of haplotype 1
`fastq`	HiC forward reads
`fastq`	HiC reverse reads

play_arrow

This workflow identifies and removes contaminants from the assembly.

Input data:

fasta Assembly

play_arrow

Yes. Galaxy has assembly tools for small prokaryote genomes as well as larger eukaryote genomes. We are continually adding new tools and optimising them for large genome assemblies - this means adding enough computer processing power to run data-intensive tools, as well as configuring aspects such as parallelisation.

Please contact us if:

you need to increase your data storage limit
there is a tool you wish to request
a tool appears to be broken or running slowly

Request support

See the tutorials in this Help section. They cover different approaches to genome assembly.
Read the methods in scientific papers about genome assembly, particularly those about genomes with similar characteristics to those in your project
See the Workflows section for examples of different approaches to genome assembly - these cover different sequencing data types, and a variety of tools.

Genome assembly can be a very involved process. A typical genome assembly procedure might look like:

Data QC - check the quality and characteristics of your sequencing reads.
Kmer counting - to determine genome characteristics such as ploidy and size.
Data preparation - trimming and filtering sequencing reads if required.
Assembly - for large genomes, this is usually done with long sequencing reads from PacBio or Nanopore.
Polishing - the assembly may be polished (corrected) with long and/or short (Illumina) reads.
Scaffolding - the assembly contigs may be joined together with other sequencing data such as HiC.
Assessment - at any stage, the assembly can be assessed for number of contigs, number of base pairs, whether expected genes are present, and many other metrics.
Annotation - identify features on the genome assembly such as gene names and locations.

A graphical representation of genome assembly

There is no best set of tools to recommend - new tools are developed constantly, sequencing technology improves rapidly, and many genomes have never been sequenced before and thus their characteristics and quirks are unknown. The "Tools" tab in this section includes a list of commonly-used tools that could be a good starting point. You will find other tools in recent publications or used in workflows.

You can also search for tools in Galaxy's tool panel. If they aren't installed on Galaxy , you can request installation of a tool.

We recommend testing a tool on a small data set first and seeing if the results make sense, before running on your full data set.

Find 15+ Galaxy training tutorials here.

Introduction to genome assembly and annotation (slides)

Vertebrate genome assembly pipeline (tutorial)

Nanopore and illumina genome assembly (tutorial)

Share workflows and results with workflow reports (tutorial)

Once a genome has been assembled, it is important to assess the quality of the assembly, and in the first instance, this quality control (QC) can be achieved using the workflow described here.

Workflow tutorial

Any user of Galaxy can request support through an online form.

Request support

Genome annotation

Common tools are listed here, or search for more in the full tool panel to the left.

MAKER is able to annotate both prokaryotes and eukaryotes. It works by aligning as many evidences as possible along the genome sequence, and then reconciling all these signals to determine probable gene structures.

The evidences can be transcript or protein sequences from the same (or closely related) organism. These sequences can come from public databases (like NR or GenBank) or from your own experimental data (transcriptome assembly from an RNASeq experiment for example). MAKER is also able to take into account repeated elements.

Input data:

`fasta`	Genome assembly
`fasta`	Protein evidence (optional)

play_arrow

Funannotate predict performs a comprehensive whole genome gene prediction. Uses AUGUSTUS, GeneMark, Snap, GlimmerHMM, BUSCO, EVidence Modeler, tbl2asn, tRNAScan-SE, Exonerate, minimap2. This approach differs from Maker as it does not need to train ab initio predictors.

Input data:

`fasta`	Genome assembly (soft-masked)
`bam`	Mapped RNA evidence (optional)
`fasta`	Protein evidence (optional)

play_arrow

RepeatMasker is a program that screens DNA for repeated elements such as tandem repeats, transposons, SINEs and LINEs. Galaxy AU has installed the full and curated DFam screening databases, or a custom database can be provided in fasta format. Additional reference data can be downloaded from RepBase.

Input data:

fasta Genome assembly

play_arrow

Red detects and masks repeats. It needs no additional input data.

Input data:

fasta Genome assembly

play_arrow

Interproscan is a batch tool to query the InterPro database. It provides annotations based on multiple searches of profile and other functional databases.

Input data:

fasta Genome assembly

play_arrow

Funannotate compare compares several annotations and outputs a GFF3 file with the best gene models. It can be used to compare the results of different gene predictors, or to compare the results of a gene predictor with a reference annotation.

Input data:

fasta Genome assemblies to compare

play_arrow

Gene calling based on deep learning. Four models currently available: land plant, fungi, vertebrate and invertebrate.

Input data:

fasta
fasta.gz Genome assembly

play_arrow

Input data:

`fasta`	Genome assembly
`gff` `gff3` `bed`	Annotations
`bam`	Mapped RNAseq data (optional)

play_arrow

Input data:

fasta Genome assembly

play_arrow

Input data:

fasta
fasta.gz Genome assembly

play_arrow

Annotate an assembled genome and output a GFF3 file. There are several modules that do different things - search for FGENESH in the tool panel to see them.

Note: you must apply for access to this tool before use.

Input data:

`fasta`	Genome assembly
`fasta`	Repeat-masked (hard) genome assembly

play_arrow

A workflow is a series of Galaxy tools that have been linked together to perform a specific analysis. You can use and customize the example workflows below. Learn more.

General use

Annotates a genome using multiple rounds of Maker, including gene prediction using SNAP and Augustus.

Tools: maker snap augustus busco jbrowse

Input data:

`fasta`	Genome assembly
`fastq`	RNAseq Illumina reads
`fasta`	Proteins

play_arrow

Annotates a genome using Funannotate, includes RNAseq data with RNAstar, and protein predictions from EggNOG.

Tools: RNAstar funannotate eggnog busco jbrowse aegean parseval

Input data:

`fasta`	Genome assembly (soft-masked)
`fastq`	RNAseq Illumina reads
`gff3`	Alternative annotation
`gbk`	Alternative annotation

play_arrow

Transcript alignment

This How-to-Guide will describe the steps required to align transcript data to your genome on the Galaxy platform, using multiple workflows. The outputs from these workflows can then be used as inputs into the next annotation workflow using FgenesH++.

Mask repeats in the genome.

Input data:

fasta Assembled genome genome.fasta

play_arrow

Trim and merge RNAseq reads.

Input data:

fastqsanger.gz For each tissue: RNAseq R1 files in a collection R1.fastqsanger.gz; RNAseq R2 files in a collection R2.fastqsanger.gz

play_arrow

Align RNAseq to genome to find transcripts.

Input data:

`fasta`	Masked genome `masked_genome.fasta`
`fastqsanger.gz`	For each tissue: Trimmed and merged RNAseq R1 files `R1.fastqsanger.gz`; Trimmed and merged RNAseq R2 files `R2.fastqsanger.gz`

play_arrow

Merge transcriptomes from different tissues, and filter out non-coding sequences.

Input data:

`fasta`	Masked genome
`gtf`	Multiple transcriptomes in a collection
`fasta.gz`	Coding and non-coding sequences from NCBI

play_arrow

Extract longest transcripts.

Input data:

fasta Merged transcriptomes

play_arrow

Convert formats for FgenesH++

Input data:

fasta Transdecoder nucleotides/peptides

play_arrow

Annotation with FgenesH++

This How-to-Guide will describe the steps required to annotate your genome on the Galaxy platform, using multiple workflows.

Annotate the genome using outputs from the TSI transcriptome workflows.

Note: you must apply for access to this tool before use.

Input data:

`fasta`	Assembled genome
`fasta`	Masked genome
`fasta`	Outputs from TSI convert formats workflow (`file.cdna` `file.pro` `file.dat`)

play_arrow

These slides from the Galaxy training network explain the process of genome annotation in detail. You can use the ← and → keys to navigate through the slides.

The flowchart below shows how you might use your input data (in green) with different Galaxy tools (in blue) to annotate a genome assembly. For example, one pathway would be taking an assembled genome, plus information about repeats, and data from RNA-seq, to run in the Maker pipeline. The annotatations can then be viewed in JBrowse.

A graphical representation of genome annotation

Fgenesh++ is a bioinformatics pipeline for automatic prediction of genes in eukaryotic genomes. It is now installed in Galaxy . Australian researchers can apply for access through the Australian BioCommons.

Apply

Apollo is web-browser accessible system that lets you conduct real-time collaborative curation and editing of genome annotations.

The Australian BioCommons and our partners at QCIF and Pawsey provide a hosted Apollo Portal service where your genome assembly and supporting evidence files can be hosted. All system administration is taken care of, so you and your team can focus on the annotation curation itself.

This Galaxy tutorial provides a complete walkthrough of the process of refining eukaryotic genome annotations with Apollo.

More info

Genome annotation with Maker

Genome annotation of eukaryotes is a little more complicated than for prokaryotes: eukaryotic genomes are usually larger than prokaryotes, with more genes. The sequences determining the beginning and the end of a gene are generally less conserved than the prokaryotic ones. Many genes also contain introns, and the limits of these introns (acceptor and donor sites) are not highly conserved. This Galaxy tutorial uses MAKER to annotate the genome of a small eukaryote: Schizosaccharomyces pombe (a yeast).

Genome annotation with Funannotate

This Galaxy tutorial provides a complete walkthrough of the process of annotation with Funannotate, including the preparation of RNAseq data, structural annotation, functional annotation, visualisation, and comparing annotations.

Any user of Galaxy can request support through an online form.

Request support

Galaxy Training Network

Galaxy

Data import and preparation

How can I import my genomics data?

How can I subsample my data?

How can I import data from the BPA portal?

Can I upload sensitive data?

Is my data private?

How can I increase my storage quota?

Tutorial: Quality Control

Tutorial: introduction to Genomics and Galaxy

Galaxy support

Import data to Galaxy

FastQC - sequence quality reports

FastP - sequence quality reports, trimming & filtering

NanoPlot - visualize Oxford Nanopore data

GenomeScope - estimate genome size

Meryl - count kmers

Data QC

Kmer counting to estimate genome size

Trim and filter reads

Genome assembly

Hifiasm - assembly with PacBio HiFi and/or Nanopore data

Flye - assembly with PacBio or Nanopore data

Verkko - assembly with PacBio and Nanopore data

Unicycler - assembly with Illumina, PacBio or Nanopore data - bacteria only

YAHS - scaffold assembly with HiC data

Quast - assess genome assembly quality

Busco - assess genome assembly quality

MitoHiFi - assemble mitochondrial genomes

Dorado correct - correct Nanopore reads

About these workflows

BAM to FASTQ + QC v1.0

PacBio HiFi genome assembly using hifiasm v2.1

Purge duplicates from hifiasm assembly v1.0

Nanopore genome assembly using Flye

Genome assessment post-assembly

Optional HiC scaffolding workflow

About these workflows

Flye assembly with Nanopore data

Assembly polishing

Assess genome quality

About these workflows

Kmer profiling

Hifi assembly and HiC phasing

Hifi assembly without HiC data

HiC scaffolding

Decontamination

Can I use Galaxy to assemble a large genome?

How can I learn about genome assembly?

Genome assembly overview

Which tools should I use?

Tutorials

How can I assess the quality of my genome assembly?

Galaxy support

Genome annotation

MAKER - genome annotation pipeline

Funannotate predict - predicted gene annotations

RepeatMasker - screen DNA sequences for interspersed repeats and low complexity regions

Red - fast de-novo repeat masking

InterProScan - Scans InterPro database and assigns functional annotations

Funannotate compare - compare several annotations

Helixer - structural genome annotation

JBrowse - Genome browser to visualize annotations

Prokka - Genome annotation, prokaryotes only

Bakta - Genome annotation for bacteria, MAGs and plasmids

FGenesH - Genome annotation

Annotation with Maker

Annotation with Funannotate

About these workflows

Repeat masking

QC and trimming of RNAseq

Find transcripts

Combine transcripts

Extract transcripts

Convert formats

About these workflows

Annotation with FgenesH++

What is genome annotation?

Genome annotation overview