![ONT_logo](/ONT_logo.png)

-----------------------------

pinfish
=======

Pinfish is a collection of tools helping to make sense of long transcriptomics data (long cDNA reads, direct RNA reads). The toolchain is composed of the following tools:

- `spliced_bam2gff` - a tool for converting sorted BAM files containing spliced alignments (generated by [minimap2](https://github.com/lh3/minimap2) or [GMAP](http://research-pub.gene.com/gmap/src/README)) into GFF2 format. Each read will be represented as a distinct transcript. This tool comes handy when visualizing spliced reads at particular loci and to provide input to the rest of the toolchain.
- `cluster_gff` - this tool takes a sorted GFF2 file as input and clusters together reads having similar exon/intron structure and creates a rough consensus of the clusters by taking the median of exon boundaries from all transcripts in the cluster.
- `polish_clusters` - this tool takes the cluster definitions generated by `cluster_gff` and for each cluster creates an error corrected read by mapping all reads on the read with the median length (using `minimap2`) and polishing it using `racon`. The polished reads can be mapped to the genome using `minimap2` or `GMAP`.
- `collapse_partials` - this tool takes GFFs generated by either `cluster_gff` or `polish_clusters` and filters out transcripts which are likely to be based on RNA degradation products from the 5' end. The tool clusters the input transcripts into "loci" by the 3' ends and discards transcripts which have a compatible transcripts in the loci with more exons. 

Pinfish is largely inspired by the [Mandalorion](https://www.nature.com/articles/ncomms16027) pipeline. It is meant to provide a quick way for generating annotations from long reads only and it is not meant to provide the same functionality as pipelines using a broader strategy for annotation (such as [LoReAn](https://www.biorxiv.org/content/early/2017/12/08/230359)).

The pinfish tools can be run via a [Snakemake](https://snakemake.readthedocs.io/en/stable/) [pipeline](https://github.com/nanoporetech/pipeline-pinfish-analysis) which handles the alignment tasks using `minimap2`.

Getting Started
===============

## Installation

The static linux binaries for the x86_64 platform are included in the respective subdirectories of the source tree. To install them simply copy them somewhere in your path.

The `polish_clusters` tool depends on the following software:

- [minimap2](https://github.com/lh3/minimap2)
- [samtools](https://github.com/samtools/samtools)
- [racon](https://github.com/isovic/racon) - please install from source!

## Dependencies and compiling from source

Compiling the tools from source require a working go compiler [installation](https://golang.org/doc/install) and the following packages installed via `go get`:

- [bíogo](https://github.com/biogo/biogo)
- [bíogo/hts](https://github.com/biogo/hts)
- [gonum](https://github.com/gonum/gonum)
- [google/uuid](https://github.com/google/uuid)

After installing dependencies simply issue `make` in the respective subdirectory.

## Usage

### spliced_bam2gff

```
Usage of spliced_bam2gff:
  -M    Input is from minimap2.
  -V    Print out version.
  -g    Use strand tag as feature orientation then read strand if not available.
  -h    Print out help message.
  -s    Use read strand (from BAM flag) as feature orientation.
  -t int
        Number of cores to use. (default 4)
```

The tool is looking by default for the `XS` tag in order to determine transcript orientation, unless the `-M` flag is specified in which case it is assumed that the input is from `minimap2` and the `ts` tag is used instead (with different rules to determine the final orientation).

If no orientation tag is found, then the orientation is set to `.`, unless the `-g` flag is provided, in which case the read orientation from the BAM flag is used.

If the `-s` flag is specified all the rules above are ignored and the orientation is set to the read strand from the BAM flag (appropriate for stranded protocols).

Example run with `minimap2` input:

```bash
spliced_bam2gff -M minimap_sorted.bam > raw_transcripts.gff
```

Example run with `minimap2` input, stranded mode:

```bash
spliced_bam2gff -s minimap_sorted.bam > raw_transcripts.gff
```

Example run with `GMAP` input:

```bash
spliced_bam2gff gmap_sorted.bam > raw_transcripts.gff
```

### cluster_gff

```
Usage of ./cluster_gff:
  -V    Print out version.
  -a string
        Write clusters in tabular format in this file.
  -c int
        Minimum cluster size. (default 10)
  -d int
        Exon boundary tolerance. (default 10)
  -e int
        Terminal exons boundary tolerance. (default 30)
  -h    Print out help message.
  -p float
        Minimum isoform percentage. (default 1)
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)
```

The `-e` parameter is the maximum distance tolerated at the start of the first exon and the end of last exon, while `-d` is the tolerance
for all other exon boundaries.

*Transcript clusters having size less than the `-c` parameter are discarded. This parameter has the largest effect on the sensitivity and specificity of transcript reconstruction. Larger values usually lead to higher specificity at the expense of lowering sensitivity.*

Example run with default minimum cluster size and tolerance values:

```bash
cluster_gff -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff
```

Example run with custom parameters:

```bash
cluster_gff -c 5 -e 50 -d 5 -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff
```

### polish_clusters

```
Usage of ./polish_clusters:
  -V    Print out version.
  -a string
        Read cluster memberships in tabular format.
  -c int
        Minimum cluster size. (default 1)
  -d string
        Location of temporary directory.
  -h    Print out help message.
  -m    Do not load all reads in memory (slower).
  -o string
        Output fasta file.
  -t int
        Number of cores to use. (default 4)
  -x string
        Arguments passed to minimap2.
  -y string
        Arguments passed to racon.
```

Example run:

```bash
polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 40 sorted.bam
```

The resulting consensus transcripts can be mapped to the genome using `minimap2`.

### collapse_partials

```
Usage of ./collapse_partials:
  -M    Discard monoexonic transcripts.
  -U    Discard transcripts which are not oriented.
  -V    Print out version.
  -d int
        Internal exon boundary tolerance. (default 5)
  -e int
        Three prime exons boundary tolerance. (default 30)
  -f int
        Five prime exons boundary tolerance. (default 5000)
  -h    Print out help message.
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)
```

The `-d` parameter is the exon boundary difference tolerated at internal splice sites, while `-e` and `-f` are the tolerance values at the 3' and 5' end 
respectively. Transcripts which are not oriented are all assigned to distinct "loci" and left untouched by default (but see the `-U` flag).  

Example run:

```bash
collapse_partials -d 10 -e 35 -f 1000 input.gff > collapsed_output.gff
```

Running tests
============

For running tests the following dependencies have to be installed:

- [minimap2](https://github.com/lh3/minimap2)
- [gffcompare](https://github.com/gpertea/gffcompare)

Both are easy to install using [bioconda](https://bioconda.github.io). 
Look into the `Makefiles` for targets testing the tools on simulated and real data.

Help
====

## Licence and Copyright

(c) 2018 Oxford Nanopore Technologies Ltd.

This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, You can obtain one at http://mozilla.org/MPL/2.0/.

## FAQs and tips

- The [GFF2](https://www.ensembl.org/info/website/upload/gff.html) files can be visualised using [IGV](http://software.broadinstitute.org/software/igv).
- The GFF2 files can be converted to GFF3 or GTF using the [gffread](https://bioconda.github.io/recipes/gffread/README.html) utility.

## References and Supporting Information

See the post announcing the tool at the Oxford Nanopore Technologies community [here](https://community.nanoporetech.com/posts/new-transcriptomics-analys).
