Community analysis¶
alpha diversity¶
Phylogenetic Diversity (PD)¶
For each environment (i.e. sample), calculates the amount of branch length in a phylogenetic tree that lead to its sequences.
First we will load in a Newick formatted tree.
>>> from cogent.parse.tree import DndParser
>>> from cogent.maths.unifrac.fast_tree import UniFracTreeNode
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> tree = DndParser(tree_in, UniFracTreeNode)
Next we will load information on which sequences in the tree come from which environment.
>>> from cogent.maths.unifrac.fast_unifrac import count_envs
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> envs = count_envs(envs_in)
Finally, we can calculate the PD values for each environment in the tree
>>> from cogent.maths.unifrac.fast_unifrac import PD_whole_tree
>>> envs, PD_values = PD_whole_tree(tree, envs)
>>> print envs
['E_FL', 'E_PA', 'O_FL', 'O_UN', 'R_FL', 'R_PA']
>>> print PD_values
[ 5.85389 7.60352 2.19215 2.81821 3.93728 3.7534 ]
PD_vals
is a numpy
array
with the values representing each environment in envs.
beta diversity¶
Unifrac¶
The Fast UniFrac implementation in PyCogent is the source code for the Fast UniFrac web tool and the QIIME pipeline for Microbial community analysis.
Calculate a UniFrac Distance Matrix and apply PCoA and UPGMA¶
The UniFrac analysis is run on open tree and environment file objects. The resulting dictionary has a distance matrix of pairwise UniFrac values (‘distance_matrix’), a Newick string representing the results of performing UPGMA clustering on this distance matrix (‘cluster_envs’) and the results of running Principal Coordinates Analysis on the distance matrix (‘pcoa’). One can specify weighted UniFrac with weighted=True
. Here we run an unweighted analysis.
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_file(tree_in, envs_in, weighted=False)
>>> print result['cluster_envs']
((((('E_FL':0.339607103063,'R_FL':0.339607103063):0.0279991540511,'R_PA':0.367606257114):0.0103026524101,'E_PA':0.377908909524):0.0223322024492,'O_UN':0.400241111973):0.00976759866402,'O_FL':0.410008710637);
>>> print result['pcoa']
=================================================================================================
Type Label vec_num-0 vec_num-1 vec_num-2 vec_num-3 vec_num-4 vec_num-5
-------------------------------------------------------------------------------------------------
Eigenvectors E_FL 0.05 0.22 -0.09 -0.26 -0.29 -0.00
Eigenvectors E_PA -0.36 0.24 0.21 -0.08 0.18 -0.00
Eigenvectors O_FL 0.32 -0.26 0.30 -0.13 0.05 -0.00
Eigenvectors O_UN -0.28 -0.40 -0.24 -0.04 0.01 -0.00
Eigenvectors R_FL 0.29 0.18 -0.28 0.09 0.22 -0.00
Eigenvectors R_PA -0.02 0.02 0.11 0.42 -0.17 -0.00
Eigenvalues eigenvalues 0.40 0.36 0.29 0.27 0.19 -0.00
Eigenvalues var explained (%) 26.34 23.84 19.06 18.02 12.74 -0.00
-------------------------------------------------------------------------------------------------
Perform pairwise tests of whether samples are significantly different with UniFrac¶
The analysis is run on open tree and environment file objects. In this example, we use unweighted unifrac (weighted=False
), we permute the environment assignments on the tree 50 times (num_iters=50
) and we perform UniFrac on all pairs of environments (test_on="Pairwise"
). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=False, num_iters=50, test_on="Pairwise")
>>> print result[0]
('E_FL', 'E_PA', 0.17999999999999999, 1.0)
Perform a single UniFrac significance test on the whole tree¶
The analysis is run on open tree and environment file objects. In this example, we use weighted unifrac (weighted=True
), we permute the environment assignments on the tree 50 times (num_iters=50
) and we perform a unifrac significance test on the whole tree (test_on="Tree"
). The resulting list has only one item since a single test was performed. It is a 3 item tuple where the second and third values are the p-value.
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=True, num_iters=50, test_on="Tree")
>>> print result
[('whole tree', 0.56000000000000005, 0.56000000000000005)]
P-test¶
Perform pairwise tests of whether samples are significantly different with the P-test (Martin, 2002)¶
The analysis is run on open tree and environment file objects. In this example, we permute the environment assignments on the tree 50 times (num_iters=50
) and perform the p test for all pairs of environments (test_on="Pairwise"
). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.
>>> from cogent.maths.unifrac.fast_unifrac import fast_p_test_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_p_test_file(tree_in, envs_in, num_iters=50, test_on="Pairwise")
>>> print result[0]
('E_FL', 'E_PA', 0.040000000000000001, 0.59999999999999998)
Taxon-based¶
Computing a distance matrix between samples¶
PyCogent provides many different ways to compute pairwise distances between objects. cogent/maths/distance_transform.py
provides a set of functions to calculate dissimilarities/distances between samples, given an abundance matrix. Here is one example:
>>> from cogent.maths.distance_transform import dist_euclidean
>>> from numpy import array
>>> abundance_data = array([[1, 3],
... [5, 2],
... [0.1, 22]],'float')
Note
see distance_transform.py
for other metrics than euclidean
We now have 3 samples, and the abundance of each column (e.g.: species) in that sample. The first sample has 1 individual of species 1, 3 individuals of species 2. We now compute the relatedness between these samples, using euclidean distance between the rows:
>>> dists = dist_euclidean(abundance_data)
>>> print str(dists.round(2))
[[ 0. , 4.12, 19.02]
[ 4.12, 0. , 20.59 ]
[ 19.02, 20.59 , 0. ]]
this distance matrix can be visualized via multivariate reduction techniques such as Multivariate data analysis.