SPALN information




Overview

Spaln (space-efficient spliced alignment) is a stand-alone program that maps and aligns a set of cDNA or protein sequences onto a whole genomic sequence in a single job. Spaln also performs spliced or ordinary alignment after rapid similarity search against a protein sequence database, if a genomic segment or an amino acid sequence is given as a query. From Version 1.4, spaln supports a combination of protein sequence database and a given genomic segment. From Version 2.2, spaln also performs rapid similarity search and (semi-)global alignment of a set of protein sequence queries again a protein sequence database. Spaln adopts multi-phase heuristics that makes it possible to perform the job on a conventional personal computer running under Unix/Linux with limited memory. The program is written in C++ and distributed as source codes and also as executables for a few platforms. Unless binaries are not provided, users must compile the program on their own system. Although the program has been tested only on a Linux operating system, it is likely to be portable to most Unix systems with little or no modifications. The accessory program sortgrcd sorts the gene loci found by spaln in the order of chromosomal position and orientation. From version 2.3.2, spaln and sortgrcd can handle some gzipped files without prior expansion if USE_ZLIB mode is activated upon compilation. From version 2.3.2a, compressed query sequence file(s) may also be accepted. From version 2.4.0, multiple files corresponding to different output forms can be generated at a single run.

References

[1] Gotoh, O. " A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence" Nucleic Acids Research 36 (8) 2630-2638 (2008).
[2] Gotoh, O. " Direct mapping and alignment of protein sequences onto genomic sequence" Bioinformatics 24 (21) 2438-2444 (2008).
[3] Iwata, H. and Gotoh, O. " Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features" Nucleic Acids Research 40 (20) e161 (2012)
[4] Gotoh, O. " Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps" Bioinformatics 16 (3) 190-202 (2000)
[5] Nagasaki, H., Arita, M., Nishizawa, T., Suwa, M., Gotoh, O. " Automated classification of alternative splicing and transcriptional initiation and construction of a visual database of the classified patterns" Bioinformatics 22 (10) 1211-1216 (2006)
[6] Gotoh, O. Cooperation of Spaln and Prrn5 for construction of gene-structure-aware multiple sequence alignment. Methods in Molecular Biology, in press.

Present Version: 2.4.1, Last update: 2020-10-09

Install

From source

To compile the source codes in the default settings, follow the instructions below. If you download the source file (spaln2.4.0) in the directory download, five directories will be generated under download/spalnXX/ after installation, where XX is a version code. We assume work is your workspace, which may or may not be identical to download.

To modify the location of executables and/or other settings, run 'configure --help' at step 6 below. (Warning: Full path name rather than relative path name must be given for executables or other directories as the arguments of the configure command.) These locations are hard coded in spaln. The locations of the 'seqdb' and 'table' directories will be respectively denoted by seqdb and table below. Hence, seqdb=download/spalnXX/seqdb, and table=download/spalnXX/table in the default settings.

  1. % mkdir download
  2. % cd download
  3. Download spalnXX.tar.gz
  4. % tar xfz spalnXX.tar.gz
  5. % cd ./spalnXX/src
  6. % ./configure [--help]
  7. % make
  8. % make install
  9. % make clearall
  10. Add download/spalnXX/bin to your PATH Preferably, you may add the above line in your start up rc file (e.g. ~/.bashrc)
    Alternatively, move or copy download/spalnXX/bin/* to a directory on your PATH, if you have not specified the location of executables at step 6 above.
  11. If you have changed the location of table and/or seqdb directory after installation, set the env variables ALN_TAB and/or ALN_DBS as explained in the following subsection.
  12. Proceed to Sequence data formation.

From binaries

Binaries for a 32 bit (spaln2.0.4.linux32) or 64 bit (spaln2.4.0.linux64) Linux machine are available. The executable will run on 64-bit Windows10 WSL environment without any modification. To use the binaries, follow the instructions below.

Case I: Assume the directory work is your workspace where every material is stored. In this case, seqdb=work.

  1. % mkdir work
  2. % cd work
  3. Download spalnXX.PC.tar.gz, where PC is a platform code
  4. % tar xfz spalnXX.PC.tar.gz
  5. Add work/bin to your PATH
    Or move or copy work/bin/* to a directory on your PATH
  6. % mv ./table/* .; rmdir ./table
  7. % mv ./seqdb/* .; rmdir ./seqdb
  8. Now proceed to Sequence data formation.

Case II: Assume your workspace work is distinct from seqdb

  1. % mkdir download
  2. % cd download
  3. Download spalnXX.PC.tar.gz, where PC is a platform code
  4. % tar xfz spalnXX.PC.tar.gz
  5. Add download/bin to your PATH
    Or move or copy download/bin/* to a directory on your PATH
  6. % setenv ALN_TAB download/table (csh/tsh)
    $ export ALN_TAB=download/table (sh/bsh)
  7. % setenv ALN_DBS download/seqdb (csh/tsh)
    $ export ALN_DBS=download/seqdb (sh/bsh)
  8. Add the above lines to your rc file, so that you don't have to repeat the commands at every login time.
  9. Now proceed to Sequence data formation

Sequence data formation

If you do not need genome mapping or database search, you may skip this section. All sequence files should be in (multi-)fasta format.
To perform genome mapping, the genomic sequence must be formatted before use. Formatting is optional for amino acid sequence database search.
  1. % cd seqdb
  2. Download or copy genomic sequences or protein database sequence in multi-fasta format. If spaln is accordingly compiled, gzipped file need not be uncompressed (the file name must be X.gz).
  3. To use 'makeidx.pl' command, chromosomal sequences must be concatenated into a single file. The extension of the genomic sequence file must be '.mfa' or '.gf', and protein database sequence must be '.faa', to render 'make' command effective. With 'spaln -W' command, these restrictions are not obligatory. Hereafter, the file name is assumed to be xxxgnm.gf or prosdb.faa. The total number of residues in a file must not be greater than or equal to 2**32.
  4. To format xxxgnm.gf(.gz), run either of the following two commands, which are equivalent to each other except that the former is faster, accepts multiple input files, and does not need Makefile. To format protein database sequence, use either of the following two commands: As Spaln -W command accepts multiple input files and generates all necessary files in a single operation, you can skip following instructions.

    Makeidx.pl command performs the following series of operations 5-6, if the input is a single sequence file.
  5. % make xxxgnm.idx (for genomic sequence) or
    % make prosdb.idx (for protein database sequence)
  6. % make xxxgnm.bkn (for cDNA queries) or
    % make xxxgnm.bkp (for protein queries) or
    % make prosdb.bka (for protein database)
  7. It is possible to generate xxxgnm.idx and other three files directly from the input files without concatenation:
    This method is particularly useful when the concatenation might yield a file too large to be dealt with by the OS.

Execution

  1. Prepare protein, cDNA, or genomic segment sequence(s) in (multi-)fasta or extended (multi-)fasta (see -O6 option) format (denoted by query below). From 2.3.2a, gzipped fasta file(s) may be used as the query without prior expansion. Note, however, that compressed query can considerably slow down the execution rate.
  2. Store query to work
  3. % cd work
  4. Run spaln in one of the following four modes. Spaln does not support comparison between two genomic segments.
    (A) % spaln -Q[0|1|2|3] [-ON] [other options] genome_segment query
    (B) % spaln -Q[4|5|6|7] [-ON] [other options] -[d|D] xxxgnm query
    (C) % spaln -Q[4|5|6|7] [-ON] [other options] -[a|A] prosdb query
    (D) % spaln -Q[4|5|6|7] [-ON] [other options] prosdb.faa query
    In the last case, prosdb.faa will be internally formatted, and the formatted results will be discarded after the end of execution.

    Only a subset of queries may be examined if query is replaced with 'query (from to)', where 'from' and 'to' are the first and last entry numbers in query to be examined.
    To run spaln on multiple CPUs, for example, the following commands may be used and the results may be summarized with sortgrcd, as explained later.
    (a) % spaln -Q7 -O12 -oxxxO1 -dxxxgnm 'query (1 1000)'
    (b) % spaln -Q7 -O12 -oxxxO2 -dxxxgnm 'query (1001 2000)'
    (c) % spaln -Q7 ...
    However, the procedure will be simplified if a multi-thread operation is used as follows:
    (d) % spaln -Q7 -ON -oxxx -dxxxgnm -t[N] query

    Options: (default value)

  5. Sortgrcd
  6. Sortgrcd is used to recover the output of spaln with -O12 option, to apply some filtering, and also to rearrange the output of multiple spaln runs. It is invoked by:
    % sortgrcd [options] X.grd [Y.grd ...] or
    % sortgrcd [options] X.grd.gz [Y.grd.gz ...]

Example

Change from previous versions

Added/modified in Ver. 2.4.1 (2020-10-09):

  1. The algorithm for delimiting a genic region has been modified to find remote terminal coding exon(s) separated by long (up to 99.6% quantile) intron(s) from the main body of the gene.
  2. The -yx0 option now tries to search for missing internal micro exons and terminal very short coding exons.
  3. Selenocysteine (denoted by U) is now regarded as the 21th amino acid which favorably matches an in-frame TGA termination codon (U in the Tron code) upon DNA vs amino acid sequence alignment.
  4. Gene candidates are now sorted according to the final alignment score rather than the intermediate chained HSP score. This modification has improved the chance of true orthologous hits rather than paralog hits at an expense of a slight increase in computational load.
  5. Compared with the previous versions, a larger number of species-specific parameter sets (247 <- 102) are provided to support more species (1495 <- 688). Note that some parameter-set identifiers are changed. Please use eight-digit species identifies (e.g. zea_mays) rather than former parameter-set identifiers (e.g. Magnolio) as the argument of -T option.

Added/modified in Ver. 2.4.0 (2019-11-18):

  1. Spaln can now directly format genomic sequences without relying on 'make' command. See Sequence data formation.
  2. The internal format of index files is slightly modified. Although previously-formatted files can be used by the new version, the opposite is not true. Note that use of older files with the new version can lead to a slight loss in sensitivity.
  3. The above change has been done to facilitate multi-thread operation at the format time.
  4. Multiple output forms can be produced at a single run. See -O and -o options.
  5. The traditional bidirectional Hirschberg algorithm is changed to the unidirectional variant.
  6. Also, the bidirectional 'sandwich' or 'attack by both sides' spliced alignment algorithm has been changed to unidirectional 'skipped' spliced alignment algorithm. This and the preceding changes have considerably reduced code complexity.
  7. Local lookup table (xxxgnm.lun or xxxgnm.lup) is generated and used with -E option. Be cautious to use this option, as a large disk space is required to store the generated file, and a large memory is required at the runtime.
  8. Many small bugs have been fixed.

Copyright (c) 2007-2020 Osamu Gotoh all rights reserved