Title: | Phylogenetic Clustering (Phyloclustering) |
---|---|
Description: | Phylogenetic clustering (phyloclustering) is an evolutionary Continuous Time Markov Chain model-based approach to identify population structure from molecular data without assuming linkage equilibrium. The package phyclust (Chen 2011) provides a convenient implementation of phyloclustering for DNA and SNP data, capable of clustering individuals into subpopulations and identifying molecular sequences representative of those subpopulations. It is designed in C for performance, interfaced with R for visualization, and incorporates other popular open source programs including ms (Hudson 2002) <doi:10.1093/bioinformatics/18.2.337>, seq-gen (Rambaut and Grassly 1997) <doi:10.1093/bioinformatics/13.3.235>, Hap-Clustering (Tzeng 2005) <doi:10.1002/gepi.20063> and PAML baseml (Yang 1997, 2007) <doi:10.1093/bioinformatics/13.5.555>, <doi:10.1093/molbev/msm088>, for simulating data, additional analyses, and searching the best tree. See the phyclust website for more information, documentations and examples. |
Authors: | Wei-Chen Chen [aut, cre], Karin Dorman [aut], Yan-Han Chen [ctb] |
Maintainer: | Wei-Chen Chen <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1-34 |
Built: | 2024-11-06 04:26:51 UTC |
Source: | https://github.com/snoweye/phyclust |
This package phyclust (Chen 2011) implements an novel approach combining model-based clusterings and phylogenetics to classify DNA sequences and SNP sequences. Based on evolution models, sequences are assumed to follow a mutation process/distribution clouding around an unknown center ancestor. Based on Continuous Time Markov Chain Theory, mixture distributions are established to model/classify subpopulations or population structures.
The kernel part of the package are implemented in C. EM algorithms are performed to find the maximum likelihood estimators. Initialization methods for EM algorithms are also established. Several evolution models are also developed.
ms
(Hudson 2002) and seq-gen
(Rambaut and Grassly 1997)
are two useful programs to generate coalescent trees and sequences, and
both are merged into phyclust. baseml
of PAML (Yang 1997, 2007)
is also ported into phyclust and it is a program to find a phylogenetic
tree by maximizing likelihood. Hap-Clustering method (Tzeng 2005) for
haplotype grouping is also incorporated into phyclust.
Type help(package = phyclust)
to see a list of major
functions for which further documentations are available. The on-line
detail instructions are also available and the link is given below in the
‘References’ section.
Some C and R functions and R classes of the ape package are also required and modified in phyclust.
The main function is phyclust
controlled by an object
.EMC
generated by a function .EMControl
,
and find.best
can find the best solution by repeating
phyclust
with different initializations.
ms
and seqgen
can generate trees and sequences
based on varied conditions, and they can jointly perform simulations.
paml.baseml
can estimate trees based on sequences.
haplo.post.prob
is a modified version of Tzeng's method
for haplotype grouping which uses a evolution approach to group SNP
sequences.
Some tool functions of the ape package are utilized in this package to perform trees in plots, check object types, and read sequence data.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Chen, W.-C. (2011) “Overlapping codon model, phylogenetic clustering, and alternative partial expectation conditional maximization algorithm”, Ph.D. Diss., Iowa Stat University.
Hudson, R.R. (2002) “Generating Samples under a Wright-Fisher Neutral Model of Genetic Variation”, Bioinformatics, 18, 337-338. http://home.uchicago.edu/~rhudson1/source.html
Rambaut, A. and Grassly, N.C. (1997) “Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees”, Computer Applications In The Biosciences, 13:3, 235-238. http://tree.bio.ed.ac.uk/software/seqgen/
Yang, Z. (1997) “PAML: a program package for phylogenetic analysis by maximum likelihood”, Computer Applications in BioSciences, 13, 555-556. http://abacus.gene.ucl.ac.uk/software/paml.html
Yang, Z. (2007) “PAML 4: a program package for phylogenetic analysis by maximum likelihood”, Molecular Biology and Evolution, 24, 1586-1591.
Tzeng, J.Y. (2005) “Evolutionary-Based Grouping of Haplotypes in Association Analysis”, Genetics Epidemiology, 28, 220-231. https://www4.stat.ncsu.edu/~jytzeng/software.php
Paradis E., Claude J., and Strimmer K. (2004) “APE: analyses of phylogenetics and evolution in R language”, Bioinformatics, 20, 289-290. http://ape-package.ird.fr/
phyclust
,
.EMC
,
.EMControl
,
find.best
.
## Not run: library(phyclust, quiet = TRUE) demo(package = "phyclust") demo("ex_trees", package = "phyclust") ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) demo(package = "phyclust") demo("ex_trees", package = "phyclust") ## End(Not run)
Methods used in EM Algorithms to deal with boundary problems of
population proportions, .
The first element is the default value.
This is a read-only object and the elemental order is followed in C.
.boundary.method
.boundary.method
A character vector contains implemented boundary methods in C.
The boundary value 0 of the population proportions makes the log
likelihood as -Inf. Since degeneracy of subpopulations can affect
the maximizing processes in EM steps. This problem is usually caused
by bad initializations, and may suggest that number of cluster
may be too large.
Two methods have been implemented when any less than
the lower bound (
or
).
The
ADJUST
(default) will adjust the to the
lower bound, and the
IGNORE
will stop the iterations and return
errors.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.show.option
,
.init.procedure
,
.init.method
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .boundary.method ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .boundary.method ## End(Not run)
Indicate the types of codes for datasets and functions. The first element is the default value. This is a read-only object and the elemental order is followed in C.
.code.type
.code.type
A character vector contains implemented code types in C.
Two possible types are implemented, "NUCLEOTIDE" (default) and "SNP", used in data transfers and indicating substitution models.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.show.option
,
.substitution.model
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .code.type ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .code.type ## End(Not run)
Color themes as used in the package lattice.
.Color
.Color
A character vector contains colors used in plots to identify clusters.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) .Color ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .Color ## End(Not run)
Evolution distances based on certain evolution models as in ape.
The implemented model is used in phyclust.edist
and initializations
of EM algorithms. The first element is the default value.
This is a read-only object and the elemental order is followed in C.
.edist.model
.edist.model
A character vector contains implemented evolution distances in C.
This vector stores possible evolution distances implemented in phyclust.
The default value is D_JC69
computed form the probability of
JC69 model.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) .edist.model ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .edist.model ## End(Not run)
The varied EM algorithms are implemented in C. The first element is the default value. This is a read-only object and the elemental order is followed in C.
.em.method
.em.method
A character vector contains implemented EM algorithms in C.
EM
(default) stands for the standard EM algorithm, ECM
stands for Expectation/Conditional Maximization algorithm, and AECM
stands for Alternating ECM algorithm.
The performance is roughly about AECM
> EM
~ ECM
which
are dependent on the separations of data set.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Dempster, A. and Laird, N. and Rubin, D. (1977) “Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society Series B, 39:3, 1-38.
Meng, X.-L. and Rubin, D. (1993) “Maximum likelihood estimation via the ECM algorithm: A general framework”, Biometrika, 80:2, 511-567.
Meng, X.-L. and van Dyk, D. (1997) “The EM Algorithm — an Old Folk-song Sung to a Fast New Tune (with discussion)”, Journal of the Royal Statistical Society Series B, 59, 511-567.
.show.option
,
.EMC
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .em.method ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .em.method ## End(Not run)
An default template object stores controlling options for phyclust
to perform EM algorithms. This object combines all other read-only
objects and more required options for EM algorithms. This is essential
for phyclust
and other related functions.
.EMC
.EMC
A list contains all controlling options
A list created by .EMControl
contains all controlling options for
EM algorithms. This list will be directly passed to C codes and control
the every things of EM algorithms.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.show.option
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .EMC ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .EMC ## End(Not run)
Generate an EM control (.EMC
) controlling the options,
methods, conditions and models of EM algorithms.
As .EMC
, this function generate a default template.
One can either modify .EMC
or employ this function to
control EM algorithms.
.EMControl(exhaust.iter = 1, fixed.iter = 5, short.iter = 100, EM.iter = 1000, short.eps = 1e-2, EM.eps = 1e-6, cm.reltol = 1e-8, cm.maxit = 5000, nm.abstol.Mu.given.QA = 1e-8, nm.reltol.Mu.given.QA = 1e-8, nm.maxit.Mu.given.QA = 500, nm.abstol.QA.given.Mu = 1e-8, nm.reltol.QA.given.Mu = 1e-8, nm.maxit.QA.given.Mu = 5000, est.non.seg.site = FALSE, max.init.iter = 50, init.procedure = .init.procedure[1], init.method = .init.method[1], substitution.model = .substitution.model$model[1], edist.model = .edist.model[1], identifier = .identifier[1], code.type = .code.type[1], em.method = .em.method[1], boundary.method = .boundary.method[1], min.n.class = 1, se.type = FALSE, se.model = .se.model[1], se.constant = 1e-2)
.EMControl(exhaust.iter = 1, fixed.iter = 5, short.iter = 100, EM.iter = 1000, short.eps = 1e-2, EM.eps = 1e-6, cm.reltol = 1e-8, cm.maxit = 5000, nm.abstol.Mu.given.QA = 1e-8, nm.reltol.Mu.given.QA = 1e-8, nm.maxit.Mu.given.QA = 500, nm.abstol.QA.given.Mu = 1e-8, nm.reltol.QA.given.Mu = 1e-8, nm.maxit.QA.given.Mu = 5000, est.non.seg.site = FALSE, max.init.iter = 50, init.procedure = .init.procedure[1], init.method = .init.method[1], substitution.model = .substitution.model$model[1], edist.model = .edist.model[1], identifier = .identifier[1], code.type = .code.type[1], em.method = .em.method[1], boundary.method = .boundary.method[1], min.n.class = 1, se.type = FALSE, se.model = .se.model[1], se.constant = 1e-2)
exhaust.iter |
number of iterations for "exhaustEM", default = 1. |
fixed.iter |
number of iterations for "RndpEM", default = 5. |
short.iter |
number of short-EM steps, default = 100. |
EM.iter |
number of long-EM steps, default = 1000. |
short.eps |
tolerance of short-EM steps, default = 1e-2. |
EM.eps |
tolerance of long-EM steps, default = 1e-6. |
cm.reltol |
relative tolerance for a CM step, default = 1e-8 |
cm.maxit |
maximum number iteration for a CM step, default = 5000. |
nm.abstol.Mu.given.QA |
see ‘Details’, default = 1e-8 |
nm.reltol.Mu.given.QA |
see ‘Details’, default = 1e-8 |
nm.maxit.Mu.given.QA |
see ‘Details’, default = 500. |
nm.abstol.QA.given.Mu |
see ‘Details’, default = 1e-8 |
nm.reltol.QA.given.Mu |
see ‘Details’, default = 1e-8 |
nm.maxit.QA.given.Mu |
see ‘Details’, default = 5000. |
est.non.seg.site |
estimate non-segregation sites, default = FALSE. |
max.init.iter |
maximum number of initialization iteration, default = 50. |
init.procedure |
initialization procedure, default = "exhaustEM". |
init.method |
initialization method, default = "randomMu". |
substitution.model |
substitution model, default = "JC69". |
edist.model |
evolution distance, default = |
identifier |
identifier, default = "EE". |
code.type |
code type, default = "NUCLEOTIDE". |
em.method |
EM method, default = "EM". |
boundary.method |
boundary method, default = |
min.n.class |
minimum number of sequences in a cluster, default = 1. |
se.type |
sequencing error type, default = FALSE. |
se.model |
sequencing error model, default = "CONVOLUTION". |
se.constant |
constrained constant, default = 1e-2. |
exhaust.iter
, fixed.iter
, short.iter
, and
short.eps
are used to control the iterations of initialization
procedures and methods.
EM.iter
and EM.eps
are used to control the EM iterations.
cm.reltol
and cm.maxit
are used to control the ECM
iterations.
Arguments starting with nm.
are options for the Nelder-Mead
method as in optim
. The C codes of Nelder-Mead are modified
from the R math library and the options are all followed.
abstol
and reltol
are for absolute and relative tolerances.
Mu.given.QA
is for maximizing the profile function of
given
, and
QA.given.Mu
is for maximizing the profile function of
given
.
est.non.seg.site
indicates whether to estimate the states of
center sequences. If FALSE, the states will be fixed as the
non segregating sites. Usually, there is no need to estimate.
max.init.iter
is for certain initialization methods, e.g.
randomNJ
and K-Medoids
need few tries to obtain an
appropriate initial state.
init.procedure
and init.method
are for initializations.
min.n.class
is the minimum number of sequences in a cluster to
avoid bad initialization state and degenerated clusters.
se.type
, se.model
, and se.constant
which are used only
for sequencing error models and only for nucleotide data without labels.
This function returns a list as .EMC
.
The sequencing error controls are stored in
se.type
, se.model
, and se.constant
, for
sequencing error type, model, and constrained constant of errors,
respectively.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.show.option
,
.EMC
,
.boundary.method
,
.code.type
,
.edist.model
,
.em.method
,
.identifier
,
.init.method
,
.init.procedure
,
.substitution.model
,
optim
,
phyclust
,
phyclust.se
.
## Not run: library(phyclust, quiet = TRUE) # The same as .EMC .EMControl() # Except code.type, all others are the same as .EMC .EMControl(code.type = "SNP") .EMControl(code.type = .code.type[2]) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # The same as .EMC .EMControl() # Except code.type, all others are the same as .EMC .EMControl(code.type = "SNP") .EMControl(code.type = .code.type[2]) ## End(Not run)
Identifiers for evolution models identify the matrix and
evolution time
for subpopulations.
The first element is the default value.
This is a read-only object and the elemental order is followed in C.
.identifier
.identifier
A character vector contains implemented identifiers in C.
Four major identifiers are implemented in C, EE
, EV
,
VE
, and VV
. The first letter indicates the structure
for matrix, and the second letter indicates the
evolution time
for subpopulations.
E
and
V
indicate equal and varied for all subpopulations.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.show.option
,
.EMC
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .identifier ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .identifier ## End(Not run)
The varied initialization methods are implemented in C. The first element is the default value. This is a read-only object and the elemental order is followed in C.
.init.method
.init.method
A character vector contains implemented initialization methods in C.
randomMu
, NJ
, randomNJ
, PAM
,
K-Medoids
and manualMu
are implemented where
the codes for the NJ are modified from ape, and the codes
for the PAM method are modified from cluster.
These methods are only provide initializations for EM algorithms.
'randomMu'randomly picks centers and assigns all sequences near by the center according an evolution distance.
'NJ'bases on a neighbor-joining tree and partitions the tree by the long branches into subtrees to form clusters.
'randomNJ'randomly partition a neighbor-joining tree into subtrees to form clusters.
'PAM'uses the partition around medoids algorithm to locate the centers of dataset.
'K-Medoids'performs K-Means types algorithms to randomly and roughly locate centers and form clusters.
'manualMu'requires a vector containing class ids for all sequences.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Saitou, N. and Nei, M. (1987), “The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees”, Molecular Biology and Evolution, 4:4, 406-425.
Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, Wiley.
Theodoridis, S. and Koutroumbas, K. (2006), Pattern Recognition, 3rd ed., pp. 635.
.show.option
,
.init.procedure
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .init.method ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .init.method ## End(Not run)
The varied initialization procedures are implemented in C. The first element is the default value. This is a read-only object and the elemental order is followed in C.
.init.procedure
.init.procedure
A character vector contains implemented initialization procedures in C.
exhaustEM
, emEM
, RndEM
, and RndpEM
are
implemented. Based on initialization states given by a initialization
method, see .init.method
for more information. These procedures
will search a better starting states for final EM steps.
'exhaustEM'runs each initialization with EM steps until convergent, and pick the best one of the convergence as the return result.
'emEM'uses few short EM steps to improve initialization, then pick the best of initialization state for long EM steps, and returns the final result.
'RandEM'bases on initialization methods to generate initialization states, the number is equal to short EM steps, then pick the best of initialization state for long EM steps, and returns the final result.
'RandEM'bases on initialization methods to generate initialization states and run a fixed number of EM steps, until total steps exhaust short EM steps, then pick the best of initialization state for long EM steps, and returns the final result.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Biernacki, C. and Celeux, G. and Govaert, G. (2003) “Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models”, Computational Statistics and Data Analysis, 41:3, 561-575.
Maitra, R. (2009) “Initializing partition-optimization algorithms”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6:1, 114-157.
.show.option
,
.init.method
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .init.procedure ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .init.procedure ## End(Not run)
An object stores label method for un-, semi-, and general semi- supervised clustering.. This is a read-only object and the elemental order is followed in C.
.label.method
.label.method
A character vector contains implemented evolution distances in C.
This vector stores possible label methods implemented in phyclust.
The default value is NONE
for unsupervised clustering.
SEMI
is for semi-supervised clustering, and
GENERAL
is for general semi-supervised clustering.
Only un- and semi-supervised clustering are implemented.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) .label.method ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .label.method ## End(Not run)
An object stores sequencing error models.
.se.model
.se.model
A character vector contains all possible sequencing models.
Currently, only a CONVOLUTION
model is implemented.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) .se.model ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .se.model ## End(Not run)
This function show available options for functions in phyclust.
.show.option()
.show.option()
This function show some available options for functions in phyclust.
They are used in .EMControl
, phyclust
, ... etc, and
options are stored in several objects separately. They will be passed
into C, so the elemental order are important. Basically, they are all
read-only objects.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.boundary.method
,
.code.type
,
.edist.model
,
.em.method
,
.EMC
,
.EMControl
,
.identifier
,
.init.method
,
.init.procedure
,
.nucleotide
,
.snp
,
.substitution.model
,
## Not run: library(phyclust, quiet = TRUE) .show.option() ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .show.option() ## End(Not run)
An object stores substitution models for mutation processes for Continuous Time Markov Chain theory. This is a read-only object and the elemental order is followed in C.
.substitution.model
.substitution.model
A data frame contains two character vectors, mode
and code.type
.
This data frame indicates substitution models implemented in C.
'model': names of substitution models for mutations.
'code.type': code types of substitution models, either NUCLEOTIDE or SNP.
The major models are:
Model | Author and Publication | Parameter |
JC69 | Jukes and Cantor 1969. | |
K80 | Kimura 1980. | |
F81 | Felsenstein 1981. | |
HKY85 | Hasegawa, Kishino, and Yano 1985. | |
Other models starting with E_
use empirical frequencies for
equilibrium probabilities.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Jukes, T. H. and Cantor, C. R. (1969) “Evolution of Protein Molecules”, Mammalian Protein Metabolism, 3, 21-132
Kimura, M. (1980) “A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide Sequences”, Journal of Molecular Evolution, 16, 111-120
Felsenstein, J. (1981) “Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach”, Journal of Molecular Evolution, 17, 368-376
Hasegawa, M. and Kishino, H. and Yano, T. (1985) “Dating of the Human-Ape Splitting by a Molecular Clock of Mitochondrial DNA”, Journal of Molecular Evolution, 22:2, 160-174
.show.option
,
.code.type
,
.identifier
,
.EMControl
,
phyclust
.
## Not run: library(phyclust, quiet = TRUE) .substitution.model ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .substitution.model ## End(Not run)
Coerce a rooted tree generating by ms
to a star tree and
maintain a bifurcation structure.
as.star.tree(rooted.tree, keep.bifurcation = TRUE)
as.star.tree(rooted.tree, keep.bifurcation = TRUE)
rooted.tree |
a rooted tree in |
keep.bifurcation |
keep a bifurcation structure. |
A tree with a star shape means that all internal branches are 0 and all leaf branches are equal.
The rooted.tree
should be in a phylo
class of ape,
and may be created by ms
.
Basically, it is a list with an attribute that the class is phylo, and the other elements are:
'edge'edge ids.
'Nnode'number of internal nodes.
'tip.lab'number of tips (leaves).
'edge.length'length of edges.
If keep.bifurcation
is TRUE, then internal branches are set to be 0
and leaves branches are set to the original tree height. Otherwise,
the internal branches will be dropped from rooted.tree.
Return a rooted tree in Class phylo
with a star shape.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
ms
,
read.tree
,
as.phylo
,
plot.phylo
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(5, 1, opts = paste("-T", sep = " ")) tree.ms <- read.tree(text = ret.ms[3]) str(tree.ms) (tree.star <- as.star.tree(tree.ms)) # Plot results par(mfrow = c(1, 2)) plot(tree.ms, type = "u", main = "original tree") plot(tree.star, type = "u", main = "as star tree") ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(5, 1, opts = paste("-T", sep = " ")) tree.ms <- read.tree(text = ret.ms[3]) str(tree.ms) (tree.star <- as.star.tree(tree.ms)) # Plot results par(mfrow = c(1, 2)) plot(tree.ms, type = "u", main = "original tree") plot(tree.star, type = "u", main = "as star tree") ## End(Not run)
This function bootstraps sequences from a model fitted by phyclust
and star trees generated by bootstrap.star.trees
.
The fitted model can be varied in .identifier
.
bootstrap.seq(ret.phyclust, star.trees)
bootstrap.seq(ret.phyclust, star.trees)
ret.phyclust |
a phyclust object in |
star.trees |
star trees might be generated by |
ret.phyclust
is a phyclust object in Class phyclust
which is usually
fitted by phyclust
, or returned by phyclust.m.step
.
star.trees
should be corresponding to the ret.phyclust
which might
be directly bootstrapped from the function bootstrap.star.trees
.
Return a list containing sequences in clusters.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
bootstrap.star.trees
,
bootstrap.star.trees.seq
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) ret.tree <- bootstrap.star.trees(ret.1) ret.seq <- bootstrap.seq(ret.1, ret.tree) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) ret.tree <- bootstrap.star.trees(ret.1) ret.seq <- bootstrap.seq(ret.1, ret.tree) ## End(Not run)
This function simplifies the bootstrap function
bootstrap.star.trees.seq()
, and only
return a list object with class seq.data
.
bootstrap.seq.data(ret.phyclust, min.n.class = 1)
bootstrap.seq.data(ret.phyclust, min.n.class = 1)
ret.phyclust |
a phyclust object in |
min.n.class |
minimum number of sequences for a cluster. |
ret.phyclust
is a phyclust object in Class phyclust
which is usually
fitted by phyclust
, or returned by phyclust.m.step
.
min.n.class
is a boundary condition to avoid degenerate clusters
when some population proportions, , are small in the
fitted model.
Return an object in Class seq.data
as
the result from read.*()
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
bootstrap.star.trees
,
Class seq.data
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) (ret.all <- bootstrap.seq.data(ret.1)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) (ret.all <- bootstrap.seq.data(ret.1)) ## End(Not run)
This function bootstraps a star tree from a model fitted by phyclust
.
Each cluster corresponds to a star tree and a center sequence where
sequences will evolve from. This function is called by
bootstrap.star.trees.seq
to generate sequences.
The fitted model can be varied in .identifier
.
bootstrap.star.trees(ret.phyclust, min.n.class = 1)
bootstrap.star.trees(ret.phyclust, min.n.class = 1)
ret.phyclust |
a phyclust object in |
min.n.class |
minimum number of sequences for a cluster. |
ret.phyclust
is a phyclust object in Class phyclust
which is usually
fitted by phyclust
, or returned by phyclust.m.step
.
min.n.class
is a boundary condition to avoid degenerate clusters
when some population proportions, , are small in the
fitted model.
Return a list containing star trees according to
ret.phyclust
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
bootstrap.seq
,
bootstrap.star.trees.seq
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) ret.trees <- bootstrap.star.trees(ret.1) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) ret.trees <- bootstrap.star.trees(ret.1) ## End(Not run)
This function bootstraps sequences from a model fitted by phyclust
by combining two functions bootstrap.star.trees
and
bootstrap.seq
.
The fitted model can be varied in .identifier
.
bootstrap.star.trees.seq(ret.phyclust, min.n.class = 1)
bootstrap.star.trees.seq(ret.phyclust, min.n.class = 1)
ret.phyclust |
a phyclust object in |
min.n.class |
minimum number of sequences for a cluster. |
ret.phyclust
is a phyclust object in Class phyclust
which is usually
fitted by phyclust
, or returned by phyclust.m.step
.
min.n.class
is a boundary condition to avoid degenerate clusters
when some population proportions, , are small in the
fitted model.
Return a list containing two elements, and both are corresponding to the
model of ret.phyclust
, including:
trees |
a list, |
seq |
a list, sequences in |
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
bootstrap.star.trees
,
bootstrap.seq
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) ret.all <- bootstrap.star.trees.seq(ret.1) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) ret.1 <- phyclust(seq.data.toy$org, 2, EMC = EMC.1) ret.all <- bootstrap.star.trees.seq(ret.1) ## End(Not run)
Transfer nucleotide codes (A, G, C, T, -) and nucleotide ids (0, 1, 2, 3, 4).
### S3 methods for a list, vector or matrix (default). code2nid(codeseq) nid2code(nidseq, lower.case = TRUE)
### S3 methods for a list, vector or matrix (default). code2nid(codeseq) nid2code(nidseq, lower.case = TRUE)
codeseq |
a character vector contains nucleotide codes, A, G, C, T, or -. |
nidseq |
a numerical vector contains nucleotide ids, 0, 1, 2, 3, or 4. |
lower.case |
transfer in lower cases. |
These functions are based on the internal object .nucleotide
to
transfer codes and nids.
code2nid
returns a numerical vector containing nucleotide ids, and
nid2code
returns a character vector containing nucleotide codes.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.nucleotide
,
snp2sid
,
sid2snp
,
code2snp
,
snp2code
.
## Not run: library(phyclust, quiet = TRUE) a <- c("A", "C", "G", "-", "T") code2nid(a) nid2code(code2nid(a)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) a <- c("A", "C", "G", "-", "T") code2nid(a) nid2code(code2nid(a)) ## End(Not run)
Transfer nucleotide codes (A, G, C, T, -) and SNPs (1, 2, -). Transfer nucleotide ids (0, 1, 2, 3, 4) and SNP ids (0, 1, 2).
### S3 methods for a list, vector or matrix (default). code2snp(codeseq) snp2code(snpseq, half = TRUE) nid2sid(nidseq) sid2nid(sidseq, half = TRUE)
### S3 methods for a list, vector or matrix (default). code2snp(codeseq) snp2code(snpseq, half = TRUE) nid2sid(nidseq) sid2nid(sidseq, half = TRUE)
codeseq |
a character vector contains nucleotide codes, A, G, C, T, or -. |
snpseq |
a character vector contains SNPs, 1, 2, or -. |
half |
nucleotide codes will be half assigned, see the ‘Details’ for more information. |
nidseq |
a numerical vector contains nucleotide ids, 0, 1, 2, 3, or 4. |
sidseq |
a numerical vector contains SNP ids, 0, 1, or 2. |
These functions are based on the internal object .nucleotide
and
.snp
to transfer nucleotide codes and SNPs.
For code2snp
, A, G are transfered to 1, and C, T are transfered to 2.
For snp2code
, 1 is transfered half to A and G, and 2 is transfered
half to C and T if half = TRUE
. Otherwise, 1 is all transfered to A,
and 2 is all transfered to C.
code2nid
returns a character vector containing nucleotide ids, and
nid2code
returns a character vector containing nucleotide codes.
nid2sid
returns a numerical vector containing SNP ids, and
sid2nid
returns a numerical vector containing nucleotide ids.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.nucleotide
,
.snp
,
code2nid
,
nid2code
,
snp2sid
,
sid2snp
.
## Not run: library(phyclust, quiet = TRUE) # For codes a.vector <- c("A", "C", "G", "-", "T") code2snp(a.vector) snp2code(code2snp(a.vector)) snp2code(code2snp(a.vector), half = FALSE) # For ids a.sid.vector <- c(0, 2, 1, 4, 3) nid2sid(a.sid.vector) sid2nid(nid2sid(a.sid.vector)) sid2nid(nid2sid(a.sid.vector), half = FALSE) # Test list a.list <- list(a, a) code2snp(a.list) snp2code(code2snp(a.list)) snp2code(code2snp(a.list), half = FALSE) # Test matrix a.matrix <- rbind(a, a) code2snp(a.matrix) snp2code(code2snp(a.matrix)) snp2code(code2snp(a.matrix), half = FALSE) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # For codes a.vector <- c("A", "C", "G", "-", "T") code2snp(a.vector) snp2code(code2snp(a.vector)) snp2code(code2snp(a.vector), half = FALSE) # For ids a.sid.vector <- c(0, 2, 1, 4, 3) nid2sid(a.sid.vector) sid2nid(nid2sid(a.sid.vector)) sid2nid(nid2sid(a.sid.vector), half = FALSE) # Test list a.list <- list(a, a) code2snp(a.list) snp2code(code2snp(a.list)) snp2code(code2snp(a.list), half = FALSE) # Test matrix a.matrix <- rbind(a, a) code2snp(a.matrix) snp2code(code2snp(a.matrix)) snp2code(code2snp(a.matrix), half = FALSE) ## End(Not run)
Great pony 625 EIAV dataset is published by Baccam, P., et al. (2003),
and they are also available on NCBI database.
This is a follow-up study of Data Pony 618
.
A text file in fasta format is stored in the data subdirectory.
EIAV rev dataset contains 62 nucleotide sequences and 406 sites.
Baccam, P., et al. (2003).
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Baccam, P., et al. (2003) “Subpopulations of Equine Infectious Anemia Virus Rev Coexist In Vivo and Differ in Phenotype”, Journal of Virology, 77, 12122-12131.
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/pony625.fas", sep = "") # edit(file = data.path) my.pony.625 <- read.fasta(data.path) str(my.pony.625) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/pony625.fas", sep = "") # edit(file = data.path) my.pony.625 <- read.fasta(data.path) str(my.pony.625) ## End(Not run)
Crohn's disease dataset is published by Hugot, et al. (2001).
A text file in phylip format is stored in the data subdirectory.
Crohn's disease dataset is used to perform haplotype grouping used in Tzeng's paper (2005).
Totally, 1102 haplotypes/SNP sequences and 8 sites.
Hugot, J.P., et al. (2001).
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Hugot, J.P., et al. (2001) “Association of NOD2 Leucine-Rich Repeat Variants with Susceptibility to Crohn's Disease”, Nature, 411, 599-603.
Tzeng, J.Y. (2005) “Evolutionary-Based Grouping of Haplotypes in Association Analysis”, Genetics Epidemiology, 28, 220-231. https://www4.stat.ncsu.edu/~jytzeng/software.php
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") # edit(file = data.path) my.snp <- read.phylip(data.path, code.type = "SNP") str(my.snp) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") # edit(file = data.path) my.snp <- read.phylip(data.path, code.type = "SNP") str(my.snp) ## End(Not run)
Great pony 524 EIAV dataset is published by Baccam, P., et al. (2003),
and they are also available on NCBI database.
There is a follow-up study, Data Pony 625
.
A text file in phylip format is stored in the data subdirectory.
EIAV rev dataset contains 146 nucleotide sequences and 405 sites.
Belshan, M., et al. (2001).
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Belshan, M., et al. (2001) “Genetic and Biological Variation in Equine Infectious Anemia Virus Rev Correlates with Variable Stages of Clinical Disease in an Experimentally Infected Pony”, Virology, 279, 185-200.
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/pony524.phy", sep = "") # edit(file = data.path) my.pony.524 <- read.phylip(data.path) str(my.pony.524) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/pony524.phy", sep = "") # edit(file = data.path) my.pony.524 <- read.phylip(data.path) str(my.pony.524) ## End(Not run)
Two major file formats are supported in phyclust,
Format phylip
and Format fasta
.
These functions only read files in basic syntax, and
return an object in Class seq.data
.
read.fasta(filename, byrow = TRUE, code.type = .code.type[1], aligned = TRUE, sep = "") read.fasta.format(filename, byrow = TRUE, aligned = TRUE, sep = "") read.phylip(filename, byrow = TRUE, code.type = .code.type[1], sep = "") read.phylip.format(filename, byrow = TRUE, sep = "")
read.fasta(filename, byrow = TRUE, code.type = .code.type[1], aligned = TRUE, sep = "") read.fasta.format(filename, byrow = TRUE, aligned = TRUE, sep = "") read.phylip(filename, byrow = TRUE, code.type = .code.type[1], sep = "") read.phylip.format(filename, byrow = TRUE, sep = "")
filename |
a file name where data is read from. |
byrow |
advanced option, default = TRUE. |
code.type |
either "NUCLEOTIDE" (default) or "SNP". |
aligned |
indicate aligned data. |
sep |
use to split sites, "" (default) and "," for "CODON". |
For unaligned sequences, read.fasta
returns a list storing data.
read.phylip
is only for aligned data and returns a matrix.
read.fasta.format
and read.phylip.format
will read in
original coding without any transformation as code.type = NULL
in write.fasta
and write.phylip
. Suppose these functions
return an object ret
, one can write other functions ret2aa()
to post transform the coding and replace ret$org
by the results of
ret2aa(ret$org.code)
.
byrow
indicates the data will be store by row or not. Usually,
the default is TRUE. The FALSE is only for advance users
with careful manipulations and for speeding up the bootstraps.
sep
can specify a character which is used to split sites in file.
By default, "" denote no character between sites. Only "CODON" id requires
a character to avoid ambiguity
Return an object in Class seq.data
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) # PHYLIP data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") (my.snp <- read.phylip(data.path, code.type = "SNP")) # FASTA data.path <- paste(.libPaths()[1], "/phyclust/data/pony625.fas", sep = "") (my.pony <- read.fasta(data.path)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # PHYLIP data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") (my.snp <- read.phylip(data.path, code.type = "SNP")) # FASTA data.path <- paste(.libPaths()[1], "/phyclust/data/pony625.fas", sep = "") (my.pony <- read.fasta(data.path)) ## End(Not run)
Two major file formats are supported in phyclust,
Format phylip
and Format fasta
.
These functions only write files in basic syntax.
write.fasta(seqdata, filename, classid = NULL, seqname = NULL, width.line = 60, lower.case = FALSE, code.type = .code.type[1], sep = "") write.fasta.format(seqdata, filename, classid = NULL, seqname = NULL, width.line = 60, sep = "") write.phylip(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, lower.case = FALSE, code.type = .code.type[1], sep = "") write.phylip.format(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, sep = "") write.paml(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, lower.case = FALSE, code.type = .code.type[1], sep = "") write.paml.format(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, sep = "")
write.fasta(seqdata, filename, classid = NULL, seqname = NULL, width.line = 60, lower.case = FALSE, code.type = .code.type[1], sep = "") write.fasta.format(seqdata, filename, classid = NULL, seqname = NULL, width.line = 60, sep = "") write.phylip(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, lower.case = FALSE, code.type = .code.type[1], sep = "") write.phylip.format(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, sep = "") write.paml(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, lower.case = FALSE, code.type = .code.type[1], sep = "") write.paml.format(seqdata, filename, classid = NULL, seqname = NULL, width.seqname = 10, width.line = 60, sep = "")
seqdata |
a matrix contains sequence ids as |
filename |
a file name where data is written to. |
classid |
class id of sequences. |
seqname |
sequence names. |
width.seqname |
number of characters of sequence names to be stored. |
width.line |
width of lines for breaking lines. |
lower.case |
use lower case of letters to write |
code.type |
either "NUCLEOTIDE" (default) or "SNP". |
sep |
a character to split sites, "" (default) and "," for "CODON". |
write.fasta
, write.phylip
, and write.paml
are general
functions call write.fasta.format
, write.phylip.format
and
write.paml.format
.
write.fasta.format
, write.phylip.format
, and
wirte.paml.format
will not do any transformation for input sequences,
but directly write them into the file as code.type = NULL
in
write.fasta
, write.phylip
and write.paml
.
Note that PAML uses one of PHYLIP format to deal with sequence files, so
write.paml.format
is to write files in a different format of
write.phylip.format
. The main purpose of write.paml
and
write.paml.format
is to generate files for pamle.baseml
.
sep
can specify a character which is used to split sites in file.
By default, "" denote no character between sites. Only "CODON" id requires
a character to avoid ambiguity.
Save a text file.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) # PHYLIP data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") my.snp <- read.phylip(data.path, code.type = "SNP") write.phylip(my.snp$org, "new.crohn.phy", code.type = "SNP") # FASTA data.path <- paste(.libPaths()[1], "/phyclust/data/pony625.fas", sep = "") (my.pony <- read.fasta(data.path)) write.fasta(my.pony$org, "new.pony.fas") # PAML write.paml(my.pony$org, "new.pony.pam") # Amino acid in PHYLIP aa.aid <- nid2aid(my.pony$org) aa.acode <- aid2acode(aa.aid) write.phylip(aa.aid, "new.pony.aa.phy", code.type = "AMINO_ACID") write.phylip.format(aa.aid, "new.pony.aa.aid.phy", sep = ",") write.phylip.format(aa.acode, "new.pony.aa.acode.phy") # Amino acid in FASTA write.fasta(aa.aid, "new.pony.aa.phy", code.type = "AMINO_ACID") write.fasta.format(aa.aid, "new.pony.aa.aid.phy", sep = ",") write.fasta.format(aa.acode, "new.pony.aa.acode.fas") # Amino acid in PAML write.paml(aa.aid, "new.pony.aa.pam", code.type = "AMINO_ACID") write.paml.format(aa.aid, "new.pony.aa.aid.pam", sep = ",") write.paml.format(aa.acode, "new.pony.aa.acode.pam") ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # PHYLIP data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") my.snp <- read.phylip(data.path, code.type = "SNP") write.phylip(my.snp$org, "new.crohn.phy", code.type = "SNP") # FASTA data.path <- paste(.libPaths()[1], "/phyclust/data/pony625.fas", sep = "") (my.pony <- read.fasta(data.path)) write.fasta(my.pony$org, "new.pony.fas") # PAML write.paml(my.pony$org, "new.pony.pam") # Amino acid in PHYLIP aa.aid <- nid2aid(my.pony$org) aa.acode <- aid2acode(aa.aid) write.phylip(aa.aid, "new.pony.aa.phy", code.type = "AMINO_ACID") write.phylip.format(aa.aid, "new.pony.aa.aid.phy", sep = ",") write.phylip.format(aa.acode, "new.pony.aa.acode.phy") # Amino acid in FASTA write.fasta(aa.aid, "new.pony.aa.phy", code.type = "AMINO_ACID") write.fasta.format(aa.aid, "new.pony.aa.aid.phy", sep = ",") write.fasta.format(aa.acode, "new.pony.aa.acode.fas") # Amino acid in PAML write.paml(aa.aid, "new.pony.aa.pam", code.type = "AMINO_ACID") write.paml.format(aa.aid, "new.pony.aa.aid.pam", sep = ",") write.paml.format(aa.acode, "new.pony.aa.acode.pam") ## End(Not run)
Based on input initialization procedures and methods, this function tries to find the best solution in terms of the highest log-likelihood value.
find.best(X, K, EMC = .EMC, manual.id = NULL, byrow = TRUE, init.procedure = .init.procedure, init.method = .init.method, file.tmp = NULL, visible = FALSE, save.all = FALSE)
find.best(X, K, EMC = .EMC, manual.id = NULL, byrow = TRUE, init.procedure = .init.procedure, init.method = .init.method, file.tmp = NULL, visible = FALSE, save.all = FALSE)
X |
nid/sid matrix with |
K |
number of clusters. |
EMC |
EM control. |
manual.id |
manually input class ids. |
byrow |
advanced option for |
init.procedure |
customized initialization procedures. |
init.method |
customized initialization methods. |
file.tmp |
a file for saving temporary results. |
visible |
TRUE for reporting iterations. |
save.all |
TRUE for saving all results. |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
Note: gaps -
are not supported yet, drop them from data.
EMC
contains all options used for EM algorithms.
manual.id
manually input class ids as an initialization only for
the initialization method, 'manualMu'.
byrow
used in bootstraps to avoid transposing matrix 'X'. If
FALSE, then the 'X' should be have the dimension .
init.procedure
and init.method
are methods for searching
the best result. This function will try all combinations of these two
options.
file.tmp
is used to save temporary results due to long computing.
If NULL
, there will no saving in each combinations.
An list with class phyclust
will be returned containing
several elements, see phyclust
for detail.
implement codes for gaps -
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMControl(exhaust.iter = 1, short.iter = 5, EM.iter = 5) (ret.1 <- find.best(seq.data.toy$org, 2, EMC = EMC.1)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMControl(exhaust.iter = 1, short.iter = 5, EM.iter = 5) (ret.1 <- find.best(seq.data.toy$org, 2, EMC = EMC.1)) ## End(Not run)
Based on the input data, this function will search all data along all sites to find a consensus sequence which may be or may not be one of the data.
find.consensus(X, code.type = .code.type[1], with.gap = FALSE)
find.consensus(X, code.type = .code.type[1], with.gap = FALSE)
X |
nid/sid matrix with |
code.type |
either "NUCLEOTIDE" (default) or "SNP". |
with.gap |
FALSE (default) for no gap in consensus sequence. |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
A vector containing the consensus sequence with length
will be returned.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) find.consensus(seq.data.toy$org) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) find.consensus(seq.data.toy$org) ## End(Not run)
Generate comprehensive trees for simulation studies.
gen.equal.star.anc.dec(K, N.K, rate.f = 0.5)
gen.equal.star.anc.dec(K, N.K, rate.f = 0.5)
K |
number of clusters, |
N.K |
number of sequences for each cluster, a vector with length |
rate.f |
|
These functions generates an ancestral tree in K tips and
generates descendent trees according to N.K
tips.
All trees, ancestral and descendent, are coerced to star shapes
and scaled their heights to fit the ratio rate.f
, and
the final tree has total height 1.
The returns are stored in a list, and the final tree is stored
with a name equal.star
.
A list contains all information of generation and results including:
'K' |
number of clusters. |
'N.K |
number of sequences for each cluster. |
'rate.f' |
|
'anc' |
an ancestral tree. |
'dec' |
all descendent trees. |
'equalstar' |
a tree that descendants are equal star trees. |
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) tree.K <- gen.equal.star.anc.dec(6, rep(3:5, 2), rate.f = 0.7) X.class <- as.numeric(gsub("d(.).(.)", "\\1", tree.K$equal.star$tip.label)) # Plot results plotnj(tree.K$equal.star, X.class, type = "p", edge.width.class = 2, main = "equal.star") axis(1) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) tree.K <- gen.equal.star.anc.dec(6, rep(3:5, 2), rate.f = 0.7) X.class <- as.numeric(gsub("d(.).(.)", "\\1", tree.K$equal.star$tip.label)) # Plot results plotnj(tree.K$equal.star, X.class, type = "p", edge.width.class = 2, main = "equal.star") axis(1) ## End(Not run)
These functions call seqgen
to generate sequences by
evolutions models based on a rooted tree. gen.seq.HKY
is to
generate nucleotide sequences, and gen.seq.SNP
is to generate
SNP sequences.
gen.seq.HKY(rooted.tree, pi, kappa, L, rate.scale = 1, anc.seq = NULL) gen.seq.SNP(rooted.tree, pi, L, rate.scale = 1, anc.seq = NULL)
gen.seq.HKY(rooted.tree, pi, kappa, L, rate.scale = 1, anc.seq = NULL) gen.seq.SNP(rooted.tree, pi, L, rate.scale = 1, anc.seq = NULL)
rooted.tree |
a rooted tree in |
pi |
equilibrium probabilities, sums to 1. |
kappa |
transition and transversion bias. |
L |
number of sites. |
rate.scale |
a scale to all branch lengths. |
anc.seq |
an ancestral sequence either in nids or sids, length = |
The rooted.tree
should be in a phylo
class of ape,
and may be created by ms
.
The pi
has length 4 for nucleotide sequences, and 2 for SNP sequences.
The rate.scale
is equivalent to rescale the height of
rooted.tree
.
Return an object in Class seqgen
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
library(phyclust, quiet = TRUE) # Generate a tree set.seed(1234) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") tree.ms <- read.tree(text = ret.ms[3]) # Generate nucleotide sequences anc.HKY <- rep(0:3, 3) pi.HKY <- c(0.2, 0.2, 0.3, 0.3) kappa <- 1.1 L <- length(anc.HKY) set.seed(1234) paste(nid2code(anc.HKY, lower.case = FALSE), collapse = "") (HKY.1 <- gen.seq.HKY(tree.ms, pi.HKY, kappa, L, anc.seq = anc.HKY)) # evolve 5 times longer (HKY.2 <- gen.seq.HKY(tree.ms, pi.HKY, kappa, L, rate.scale = 5, anc.seq = anc.HKY)) # Generate SNP sequences anc.SNP <- rep(0:1, 6) pi.SNP <- c(0.4, 0.6) L <- length(anc.SNP) set.seed(1234) paste(sid2snp(anc.SNP), collapse = "") (SNP.1 <- gen.seq.SNP(tree.ms, pi.SNP, L, anc.seq = anc.SNP)) # evolve 5 times longer (SNP.2 <- gen.seq.SNP(tree.ms, pi.SNP, L, rate.scale = 5, anc.seq = anc.SNP))
library(phyclust, quiet = TRUE) # Generate a tree set.seed(1234) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") tree.ms <- read.tree(text = ret.ms[3]) # Generate nucleotide sequences anc.HKY <- rep(0:3, 3) pi.HKY <- c(0.2, 0.2, 0.3, 0.3) kappa <- 1.1 L <- length(anc.HKY) set.seed(1234) paste(nid2code(anc.HKY, lower.case = FALSE), collapse = "") (HKY.1 <- gen.seq.HKY(tree.ms, pi.HKY, kappa, L, anc.seq = anc.HKY)) # evolve 5 times longer (HKY.2 <- gen.seq.HKY(tree.ms, pi.HKY, kappa, L, rate.scale = 5, anc.seq = anc.HKY)) # Generate SNP sequences anc.SNP <- rep(0:1, 6) pi.SNP <- c(0.4, 0.6) L <- length(anc.SNP) set.seed(1234) paste(sid2snp(anc.SNP), collapse = "") (SNP.1 <- gen.seq.SNP(tree.ms, pi.SNP, L, anc.seq = anc.SNP)) # evolve 5 times longer (SNP.2 <- gen.seq.SNP(tree.ms, pi.SNP, L, rate.scale = 5, anc.seq = anc.SNP))
Generate a rooted tree with a star shape based on a sequence calls of several functions.
gen.star.tree(N, total.height = 1)
gen.star.tree(N, total.height = 1)
N |
number of leaves. |
total.height |
total tree height. |
A tree with a star shape means that all internal branches are 0 and all leaf branches are equal.
This function combining with gen.seq.HKY
or gen.seq.SNP
is used in simulation studies and bootstrap tree samples
Return a rooted tree in Class phylo
with a star shape.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
ms
.
as.star.tree
,
get.rooted.tree.height
.
rescale.rooted.tree
.
as.phylo
,
plot.phylo
.
## Not run: library(phyclust, quiet = TRUE) ret.star <- gen.star.tree(5) plot(ret.star, type = "u") ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) ret.star <- gen.star.tree(5) plot(ret.star, type = "u") ## End(Not run)
Generate comprehensive trees for simulation studies.
gen.unit.K(K, N.K, rate.anc = 10, rate.dec = 10)
gen.unit.K(K, N.K, rate.anc = 10, rate.dec = 10)
K |
number of clusters, |
N.K |
number of sequences for each cluster, a vector with length |
rate.anc |
|
rate.dec |
|
These functions generates an ancestral tree in K tips and
generates descendent trees according to N.K
tips,
and returns several types of trees, org
, equal
,
max
, and star
, as the following:
'org': original tree, adjacent the descendent trees to the ancestral tree.
'equal': descendent trees are scaled to the average height and attached to the ancestral tree, then scale the total height to be 1.
'max': descendent trees are attached to the ancestral tree, then scale the maximum height to be 1.
'star': descendent trees are applied as.star.tree
and
attached to the ancestral tree,
then scale the maximum height to be 1.
A list contains all information of generation and results including:
'K' |
number of clusters. |
'N.K |
number of sequences for each cluster. |
'rate.anc' |
|
'rate.dec' |
|
'height.anc' |
height of ancestral tree. |
'height.dec' |
height of all descendent trees. |
'anc' |
an ancestral tree. |
'dec' |
all descendent trees. |
'org' |
an original tree. |
'equal' |
a three that descendants are all equal height. |
'max' |
a tree that descendants are scaled by the maximum height. |
'star' |
a tree that descendants are star trees. |
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) # For gen.unit.K() set.seed(1234) tree.K <- gen.unit.K(6, rep(3:5, 2), rate.anc = 0.7, rate.dec = 1.1) X.class <- as.numeric(gsub("d(.)(.*)", "\\1", tree.K$org$tip.label)) # Plot results par(mfrow = c(2, 2)) plotnj(tree.K$org, X.class, type = "p", edge.width.class = 2, main = "org") axis(1) plotnj(tree.K$equal, X.class, type = "p", edge.width.class = 2, main = "equal") axis(1) plotnj(tree.K$max, X.class, type = "p", edge.width.class = 2, main = "max") axis(1) plotnj(tree.K$star, X.class, type = "p", edge.width.class = 2, main = "star") axis(1) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # For gen.unit.K() set.seed(1234) tree.K <- gen.unit.K(6, rep(3:5, 2), rate.anc = 0.7, rate.dec = 1.1) X.class <- as.numeric(gsub("d(.)(.*)", "\\1", tree.K$org$tip.label)) # Plot results par(mfrow = c(2, 2)) plotnj(tree.K$org, X.class, type = "p", edge.width.class = 2, main = "org") axis(1) plotnj(tree.K$equal, X.class, type = "p", edge.width.class = 2, main = "equal") axis(1) plotnj(tree.K$max, X.class, type = "p", edge.width.class = 2, main = "max") axis(1) plotnj(tree.K$star, X.class, type = "p", edge.width.class = 2, main = "star") axis(1) ## End(Not run)
This function gets a rooted tree height, and only meaningful for a ultrametric tree which has the equal height from the root to all leaves.
get.rooted.tree.height(rooted.tree, tol = .Machine$double.eps^0.5)
get.rooted.tree.height(rooted.tree, tol = .Machine$double.eps^0.5)
rooted.tree |
a rooted tree in |
tol |
for |
The rooted.tree
should be in a phylo
class of ape,
and should be ultrametric that may be created by ms
.
Return the rooted tree height.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
ms
,
read.tree
,
as.phylo
,
is.ultrametric
,
rescale.rooted.tree
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(5, 1, opts = paste("-T", sep = " ")) tree.ms <- read.tree(text = ret.ms[3]) is.ultrametric(tree.ms) get.rooted.tree.height(tree.ms) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(5, 1, opts = paste("-T", sep = " ")) tree.ms <- read.tree(text = ret.ms[3]) is.ultrametric(tree.ms) get.rooted.tree.height(tree.ms) ## End(Not run)
For SNP sequences only, Tzeng's method (2005) uses an evolution approach to group haplotypes based on a deterministic transformation of haplotype frequency. This function find the best number of clusters based on Shannon information content.
getcut.fun(pp.org, nn, plot = 0)
getcut.fun(pp.org, nn, plot = 0)
pp.org |
frequency of haplotypes, sorted in decreasing order. |
nn |
number of haplotypes. |
plot |
illustrated in a plot. |
pp.org
is summarized from X
in haplo.post.prob
,
nn
is equal to the number of rows of X
.
This function is called by haplo.post.prob
to determine
the best guess of number of clusters.
See Tzeng (2005) and Shannon (1948) for details.
Return the best guess of number of clusters.
Jung-Ying Tzeng.
Maintain: Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Tzeng, J.Y. (2005) “Evolutionary-Based Grouping of Haplotypes in Association Analysis”, Genetics Epidemiology, 28, 220-231. https://www4.stat.ncsu.edu/~jytzeng/software.php
Shannon, C.E. (1948) “A mathematical theory of communication”, Bell System Tech J, 27, 379-423, 623-656.
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") my.snp <- read.phylip(data.path, code.type = "SNP") ret <- haplo.post.prob(my.snp$org, ploidy = 1) getcut.fun(sort(ret$haplo$hap.prob, decreasing = TRUE), nn = my.snp$nseq, plot = 1) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") my.snp <- read.phylip(data.path, code.type = "SNP") ret <- haplo.post.prob(my.snp$org, ploidy = 1) getcut.fun(sort(ret$haplo$hap.prob, decreasing = TRUE), nn = my.snp$nseq, plot = 1) ## End(Not run)
For SNP sequences only, Tzeng's method (2005) uses an evolution approach
to group haplotypes based on a deterministic transformation of haplotype
frequency. This is a modified version of original function,
haplo.score.RD.unphased.fun
.
haplo.post.prob(X, ploidy = 2, skip.haplo = 1e-07, K = NULL)
haplo.post.prob(X, ploidy = 2, skip.haplo = 1e-07, K = NULL)
X |
sid matrix with |
ploidy |
ploidy, no effect for phase known, keep consistence only. |
skip.haplo |
lower bound of haplotypes frequencies. |
K |
number of clusters. |
X
should be a phase known haplotype data. For phase unknown and
Tzeng's method (2006) are not tested yet.
If K
is NULL, the result of getcut.fun
will be used.
See the original paper and source codes' documents for details. The function returns a list contains:
'haplo' |
summarized data set in a list contains:
|
|||||||||||||
'FD.id' |
unique ids of 'haplotype' for full dimension analysis. |
|||||||||||||
'RD.id' |
unique ids of 'haplotype' for reduced dimension analysis. |
|||||||||||||
'FD.post' |
posterior probabilities for full dimension analysis. |
|||||||||||||
'RD.post' |
posterior probabilities for reduced dimension analysis. |
|||||||||||||
'g.truncate' |
number of clusters |
test codes for phased unknown cases.
Jung-Ying Tzeng.
Maintain: Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Tzeng, J.Y. (2005) “Evolutionary-Based Grouping of Haplotypes in Association Analysis”, Genetics Epidemiology, 28, 220-231. https://www4.stat.ncsu.edu/~jytzeng/software.php
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") my.snp <- read.phylip(data.path, code.type = "SNP") ret <- haplo.post.prob(my.snp$org, ploidy = 1) str(ret) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) data.path <- paste(.libPaths()[1], "/phyclust/data/crohn.phy", sep = "") my.snp <- read.phylip(data.path, code.type = "SNP") ret <- haplo.post.prob(my.snp$org, ploidy = 1) str(ret) ## End(Not run)
This function modifies the original standalone code of ms()
developed by
Hudson (2002) for generating samples/coalescent trees under a Wright-Fisher
neutral model.
ms(nsam = NULL, nreps = 1, opts = NULL, temp.file = NULL, tbs.matrix = NULL)
ms(nsam = NULL, nreps = 1, opts = NULL, temp.file = NULL, tbs.matrix = NULL)
nsam |
number of samples/coalescent trees, usually greater than 2. |
nreps |
number of replications. |
opts |
options as the standalone version. |
temp.file |
temporary file for ms output. |
tbs.matrix |
a matrix for 'tbs' options given in opts. |
This function directly reuses the C code of ms
by arguments
as input from the opts
. The options opts
is followed from the
original ms
except nsam
and nreps
.
Note that stdin, stdout, and pipe are all disable from opts
.
For examples, options commonly used in phyclust are:
"-T": generate trees in a neutral model.
"-G": generate trees with a population growth rate, e.g. "-G 0.5".
These will return trees in a NEWICK format which can be read by the
read.tree()
of ape and passed to seqgen()
to generate
sequences.
temp.file
allows users to specify ms output file themselves, but
this file will not be deleted nor converted into R after the call to
ms()
. Users should take care the readings. By default, ms()
uses a system temp file to store the output which is converted into R
after the call and is deleted after converting.
tbs.matrix
is a matrix to specify the values of tbs
given
in opts
. See demo('simu_ms_tbs')
for an example how to
use this additional option. This option has been slightly tweaked by
utilizing tbs
options in the standalone ms
. However,
the output format is not the same as that in the standalone ms
.
Post-process is required with caution.
This function returns a vector, and each element stores one line of STDOUT
of ms()
separated by newline. The vector stores in a class ms
.
The details of output format can found on the website
http://home.uchicago.edu/~rhudson1/source.html and its manual.
Carefully read the ms
's original document before using the ms()
function.
Hudson, R.R. (2002).
Maintain: Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Hudson, R.R. (2002) “Generating Samples under a Wright-Fisher Neutral Model of Genetic Variation”, Bioinformatics, 18, 337-338. http://home.uchicago.edu/~rhudson1/source.html
print.ms()
,
read.tree()
,
bind.tree()
,
seqgen()
.
## Not run: library(phyclust, quiet = TRUE) ms() # an ancestral tree set.seed(1234) (ret.ms <- ms(nsam = 3, opts = "-T -G 0.1")) (tree.anc <- read.tree(text = ret.ms[3])) tree.anc$tip.label <- paste("a", 1:K, sep = "") # adjacent descendant trees to the ancestral tree K <- 3 N <- 12 N.k <- c(3, 4, 5) ms.dec <- NULL # a list to store trees of ms tree.dec <- NULL # a list to store the trees in phylo class tree.joint <- tree.anc for(k in 1:K){ ms.dec[[k]] <- ms(N.k[k], opts = "-T -G 1.0") tree.dec[[k]] <- read.tree(text = ms.dec[[k]][3]) tree.dec[[k]]$tip.label <- paste("d", k, ".", 1:N.k[k], sep = "") tree.joint <- bind.tree(tree.joint, tree.dec[[k]], where = which(tree.joint$tip.label == paste("a", k, sep = ""))) } str(tree.joint) # plot trees par(mfrow = c(2, 3)) plot(tree.anc, main = paste("anc (", K, ")", sep = "")) axis(1) for(k in 1:K){ plot(tree.dec[[k]], main = paste("dec", k, " (", N.k[k], ")", sep = "")) axis(1) } plot(tree.joint, main = paste("joint (", N, ")", sep = "")) axis(1) # use tbs option (an example from msdoc.pdf by Hudson, R.R.) tbs.matrix <- matrix(c(3.0, 3.5, 5.0, 8.5), nrow = 2) ret <- ms(nsam = 5, nreps = 2, opts = "-t tbs -r tbs 1000", tbs.matrix = tbs.matrix) print(ret) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) ms() # an ancestral tree set.seed(1234) (ret.ms <- ms(nsam = 3, opts = "-T -G 0.1")) (tree.anc <- read.tree(text = ret.ms[3])) tree.anc$tip.label <- paste("a", 1:K, sep = "") # adjacent descendant trees to the ancestral tree K <- 3 N <- 12 N.k <- c(3, 4, 5) ms.dec <- NULL # a list to store trees of ms tree.dec <- NULL # a list to store the trees in phylo class tree.joint <- tree.anc for(k in 1:K){ ms.dec[[k]] <- ms(N.k[k], opts = "-T -G 1.0") tree.dec[[k]] <- read.tree(text = ms.dec[[k]][3]) tree.dec[[k]]$tip.label <- paste("d", k, ".", 1:N.k[k], sep = "") tree.joint <- bind.tree(tree.joint, tree.dec[[k]], where = which(tree.joint$tip.label == paste("a", k, sep = ""))) } str(tree.joint) # plot trees par(mfrow = c(2, 3)) plot(tree.anc, main = paste("anc (", K, ")", sep = "")) axis(1) for(k in 1:K){ plot(tree.dec[[k]], main = paste("dec", k, " (", N.k[k], ")", sep = "")) axis(1) } plot(tree.joint, main = paste("joint (", N, ")", sep = "")) axis(1) # use tbs option (an example from msdoc.pdf by Hudson, R.R.) tbs.matrix <- matrix(c(3.0, 3.5, 5.0, 8.5), nrow = 2) ret <- ms(nsam = 5, nreps = 2, opts = "-t tbs -r tbs 1000", tbs.matrix = tbs.matrix) print(ret) ## End(Not run)
Transfer nids (0, 1, ..., 4) , aids (0, 1, ..., 21) and cids (0, 1, ..., 64).
### S3 methods for a list, vector or matrix (default). nid2aid(nidseq, start = 1, end = NULL, drop.gap = FALSE, byrow = TRUE) nid2cid(nidseq, start = 1, end = NULL, drop.gap = FALSE, byrow = TRUE) cid2aid(cidseq) aid2acode(aidseq, lower.case = FALSE) acode2aid(acodeseq)
### S3 methods for a list, vector or matrix (default). nid2aid(nidseq, start = 1, end = NULL, drop.gap = FALSE, byrow = TRUE) nid2cid(nidseq, start = 1, end = NULL, drop.gap = FALSE, byrow = TRUE) cid2aid(cidseq) aid2acode(aidseq, lower.case = FALSE) acode2aid(acodeseq)
nidseq |
a numerical vector contains nucleotide ids, 0, 1, 2, 3, or 4. |
cidseq |
a numerical vector contains codon ids, 0, 1, ..., or 64. |
aidseq |
a numerical vector contains amino acid ids, 0, 1, ..., or 21. |
acodeseq |
a character vector contains amino acid codes. |
start |
the start site to translate. |
end |
the end site to translate. |
drop.gap |
ignore gaps if TRUE. |
byrow |
advanced option, default = TRUE. |
lower.case |
transfer in lower cases. |
These functions are based on the internal object .nucleotide
,
.codon
, .amino.acid
, and .genetic.code
to
transfer sequences.
nid2aid
and cid2aid
returns a numerical vector containing
amino acid ids, and nid2cid
returns a numerical vector containing
codon ids, aid2acode
returns a character vector containing
amino acid codes, and acode2aid
returns a numerical vector containing
amino acid ids.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.nucleotide
,
.amino.acid
,
.codon
,
.genetic.code
,
code2nid
.
## Not run: library(phyclust, quiet = TRUE) ### Test S3 methods by a vector a.vector <- c("A", "C", "G", "-", "T", "A") code2nid(a.vector) nid2cid(code2nid(a.vector)) cid2aid(nid2cid(code2nid(a.vector))) nid2aid(code2nid(a.vector)) aid2acode(nid2aid(code2nid(a.vector))) acode2aid(aid2acode(nid2aid(code2nid(a.vector)))) ### Test S3 methods by a matrix a.matrix <- rbind(a.vector, a.vector, a.vector) code2nid(a.matrix) nid2cid(code2nid(a.matrix)) cid2aid(nid2cid(code2nid(a.matrix))) nid2aid(code2nid(a.matrix)) aid2acode(nid2aid(code2nid(a.matrix))) acode2aid(aid2acode(nid2aid(code2nid(a.matrix)))) ### Test S3 methods by a list a.list <- list(a.vector, a.vector) code2nid(a.list) nid2cid(code2nid(a.list)) cid2aid(nid2cid(code2nid(a.list))) nid2aid(code2nid(a.list)) aid2acode(nid2aid(code2nid(a.list))) acode2aid(aid2acode(nid2aid(code2nid(a.list)))) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) ### Test S3 methods by a vector a.vector <- c("A", "C", "G", "-", "T", "A") code2nid(a.vector) nid2cid(code2nid(a.vector)) cid2aid(nid2cid(code2nid(a.vector))) nid2aid(code2nid(a.vector)) aid2acode(nid2aid(code2nid(a.vector))) acode2aid(aid2acode(nid2aid(code2nid(a.vector)))) ### Test S3 methods by a matrix a.matrix <- rbind(a.vector, a.vector, a.vector) code2nid(a.matrix) nid2cid(code2nid(a.matrix)) cid2aid(nid2cid(code2nid(a.matrix))) nid2aid(code2nid(a.matrix)) aid2acode(nid2aid(code2nid(a.matrix))) acode2aid(aid2acode(nid2aid(code2nid(a.matrix)))) ### Test S3 methods by a list a.list <- list(a.vector, a.vector) code2nid(a.list) nid2cid(code2nid(a.list)) cid2aid(nid2cid(code2nid(a.list))) nid2aid(code2nid(a.list)) aid2acode(nid2aid(code2nid(a.list))) acode2aid(aid2acode(nid2aid(code2nid(a.list)))) ## End(Not run)
This function modifies the original standalone code of baseml
in PAML developed by Yang (1997) for phylogenetic analysis by maximum
likelihood. This function provides a way to generate an ancestral tree
for given central sequences clustered by phyclust
.
paml.baseml(X, seqname = NULL, opts = NULL, newick.trees = NULL) paml.baseml.control(...) paml.baseml.show.default()
paml.baseml(X, seqname = NULL, opts = NULL, newick.trees = NULL) paml.baseml.control(...) paml.baseml.show.default()
X |
nid matrix with |
seqname |
sequence names. |
opts |
options as the standalone version, provided by |
newick.trees |
a vector/list contains NEWICK trees for |
... |
for other possible opts and values. See PAML manual for details. |
show |
show opts and values. |
The function paml.baseml
directly reuses the C code of baseml
of PAML, and the function paml.baseml.control
is to generate controls
for paml.baseml
as the file baseml.ctl
of PAML.
The seqname
should be consistent with X
, and the leaf nodes
of newick.trees
.
The options opts
is followed from the original baseml.ctl
except seqfile
, treefile
and outputfile
will be omitted.
paml.baseml.control
generates default opts
, and
paml.baseml.show.default
displays annotations for the default
opts
.
This function returns a list, and each element stores one line of outputs
of baseml
separated by newline. The list stores in a class
baseml
. All the output of baseml
of PAML will be saved in
several files, and these will be read in by scan
. Further post
processing can be done by parsing the returning vector. The details of
output format can found on the website
http://abacus.gene.ucl.ac.uk/software/paml.html and its manual.
Note that some functionalities of baseml
of PAML are changed in
paml.baseml
due to the complexity of input and output. The changes
include such as disable the option G
and rename the file 2base.t
to pairbase.t
.
Typically, the list contains the original output of baseml
including
pairbase.t
, mlb
, rst
, rst1
, and rub
if they
are not empty. The best tree (unrooted by default) will be stored in
best.tree
parsed from mlb
based on the highest log likelihood.
All output to STDOUT
are stored in stdout
. No STDIN
are
allowed.
Note that the print function for the class baseml
will only show
the best.tree
only. Use str
or names
to see the whole
returns of the list.
Carefully read the PAML's original document before using the
paml.baseml
function, and paml.baseml
may not be ported
well from baseml
of PAML. Please double check with the standalone
version.
baseml
may not be a well designed program, and may run slowly.
If it were stuck, temporary files would all store at a directory obtained
by tempfile("/paml.baseml.")
.
baseml
has its own options and settings which may be different
than phyclust and ape. For example, the following is from
the PAML's document, “In PAML, a rooted tree has a bifurcation at the root,
while an unrooted tree has a trifurcation or multifurcation at the root.”
i.e. paml.baseml
uses a rooted result for an unrooted tree, as well
as for a rooted tree.
baseml
also needs a sequence file which is dumped from R (duplicated
from memory) for paml.baseml
, and this file can be very big if
sequences are too long or number of sequences is too large. Also,
paml.baseml
may take long time to search the best tree if data are
large or initial trees are not provided.
Yang, Z. (1997) and Yang, Z. (2007)
Maintain: Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Yang, Z. (1997) “PAML: a program package for phylogenetic analysis by maximum likelihood”, Computer Applications in BioSciences, 13, 555-556.
Yang, Z. (2007) “PAML 4: a program package for phylogenetic analysis by maximum likelihood”, Molecular Biology and Evolution, 24, 1586-1591. http://abacus.gene.ucl.ac.uk/software/paml.html
## Not run: library(phyclust, quiet = TRUE) paml.baseml.show.default() ### Generate data. set.seed(123) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") ret.seqgen <- seqgen(opts = "-mHKY -l40 -s0.2", newick.tree = ret.ms[3]) (ret.nucleotide <- read.seqgen(ret.seqgen)) X <- ret.nucleotide$org seqname <- ret.nucleotide$seqname ### Run baseml. opts <- paml.baseml.control(model = 4, clock = 1) (ret.baseml <- paml.baseml(X, seqname = seqname, opts = opts)) (ret.baseml.init <- paml.baseml(X, seqname = seqname, opts = opts, newick.trees = ret.ms[3])) ret.ms[3] ### Unrooted tree. opts <- paml.baseml.control(model = 4) (ret.baseml.unrooted <- paml.baseml(X, seqname = seqname, opts = opts)) ### More information. opts <- paml.baseml.control(noisy = 3, verbose = 1, model = 4, clock = 1) ret.more <- paml.baseml(X, seqname = seqname, opts = opts) # ret.more$stdout ### Plot trees par(mfrow = c(2, 2)) plot(read.tree(text = ret.ms[3]), main = "true") plot(read.tree(text = ret.baseml$best.tree), main = "baseml") plot(read.tree(text = ret.baseml.init$best.tree), main = "baseml with initial") plot(unroot(read.tree(text = ret.baseml.unrooted$best.tree)), main = "baseml unrooted") ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) paml.baseml.show.default() ### Generate data. set.seed(123) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") ret.seqgen <- seqgen(opts = "-mHKY -l40 -s0.2", newick.tree = ret.ms[3]) (ret.nucleotide <- read.seqgen(ret.seqgen)) X <- ret.nucleotide$org seqname <- ret.nucleotide$seqname ### Run baseml. opts <- paml.baseml.control(model = 4, clock = 1) (ret.baseml <- paml.baseml(X, seqname = seqname, opts = opts)) (ret.baseml.init <- paml.baseml(X, seqname = seqname, opts = opts, newick.trees = ret.ms[3])) ret.ms[3] ### Unrooted tree. opts <- paml.baseml.control(model = 4) (ret.baseml.unrooted <- paml.baseml(X, seqname = seqname, opts = opts)) ### More information. opts <- paml.baseml.control(noisy = 3, verbose = 1, model = 4, clock = 1) ret.more <- paml.baseml(X, seqname = seqname, opts = opts) # ret.more$stdout ### Plot trees par(mfrow = c(2, 2)) plot(read.tree(text = ret.ms[3]), main = "true") plot(read.tree(text = ret.baseml$best.tree), main = "baseml") plot(read.tree(text = ret.baseml.init$best.tree), main = "baseml with initial") plot(unroot(read.tree(text = ret.baseml.unrooted$best.tree)), main = "baseml unrooted") ## End(Not run)
The main function of phyclust implements finite mixture models for sequence data that the mutation processes are modeled by evolution processes based on Continuous Time Markov Chain theory.
phyclust(X, K, EMC = .EMC, manual.id = NULL, label = NULL, byrow = TRUE)
phyclust(X, K, EMC = .EMC, manual.id = NULL, label = NULL, byrow = TRUE)
X |
nid/sid matrix with |
K |
number of clusters. |
EMC |
EM control. |
manual.id |
manually input class ids. |
label |
label of sequences for semi-supervised clustering |
byrow |
advanced option for |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
EMC
contains all options used for EM algorithms.
manual.id
manually input class ids as an initialization only for
the initialization method, 'manualMu'.
label
indicates the known clusters for labeled sequences which is a
vector with length N
and has values from 0
to K
.
0
indicates clusters are unknown. label = NULL
is for
unsupervised clustering. Only un- and semi-supervised clustering are
implemented.
byrow
used in bootstraps to avoid transposing matrix 'X'. If
FALSE, then the 'X' should be have the dimension .
A list with class phyclust
will be returned containing
several elements as the following:
'N.X.org' |
number of sequences in the |
|||||||||||||||
'N.X.unique' |
number of unique sequences in the |
|||||||||||||||
'L' |
number of sites, length of sequences, number of column of the |
|||||||||||||||
'K' |
number of clusters. |
|||||||||||||||
'Eta' |
proportion of subpopulations, |
|||||||||||||||
'Z.normalized' |
posterior probabilities, |
|||||||||||||||
'Mu' |
centers of subpopulations, dim = |
|||||||||||||||
'QA' |
Q matrix array, information for the evolution model, a list contains:
|
|||||||||||||||
'logL' |
log likelihood values. |
|||||||||||||||
'p' |
number of parameters. |
|||||||||||||||
'bic' |
BIC, |
|||||||||||||||
'aic' |
AIC, |
|||||||||||||||
'N.seq.site' |
number of segregating sites. |
|||||||||||||||
'class.id' |
class id for each sequences based on the maximum posterior. |
|||||||||||||||
'n.class' |
number of sequences in each cluster. |
|||||||||||||||
'conv' |
convergence information, a list contains:
|
|||||||||||||||
'init.procedure' |
initialization procedure. |
|||||||||||||||
'init.method' |
initialization method. |
|||||||||||||||
'substitution.model' |
substitution model. |
|||||||||||||||
'edist.model' |
evolution distance model. |
|||||||||||||||
'code.type' |
code type. |
|||||||||||||||
'em.method' |
EM algorithm. |
|||||||||||||||
'boundary.method' |
boundary method. |
|||||||||||||||
'label.method' |
label method. |
make a general class for Q
and QA
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.EMC
,
.EMControl
,
find.best
,
phyclust.se
.
phyclust.se.update
.
library(phyclust, quiet = TRUE) X <- seq.data.toy$org set.seed(1234) (ret.1 <- phyclust(X, 3)) EMC.2 <- .EMC EMC.2$substitution.model <- "HKY85" # the same as EMC.2 <- .EMControl(substitution.model = "HKY85") (ret.2 <- phyclust(X, 3, EMC = EMC.2)) # for semi-supervised clustering semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 (ret.3 <- phyclust(X, 3, EMC = EMC.2, label = semi.label))
library(phyclust, quiet = TRUE) X <- seq.data.toy$org set.seed(1234) (ret.1 <- phyclust(X, 3)) EMC.2 <- .EMC EMC.2$substitution.model <- "HKY85" # the same as EMC.2 <- .EMControl(substitution.model = "HKY85") (ret.2 <- phyclust(X, 3, EMC = EMC.2)) # for semi-supervised clustering semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 (ret.3 <- phyclust(X, 3, EMC = EMC.2, label = semi.label))
This is a single E-step of phyclust
, usually following or followed
by the other M-step.
phyclust.e.step(X, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, Z.state = TRUE, label = NULL)
phyclust.e.step(X, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, Z.state = TRUE, label = NULL)
X |
nid/sid matrix with |
ret.phyclust |
an object with the class |
K |
number of clusters. |
Eta |
proportion of subpopulations, |
Mu |
centers of subpopulations, dim = |
pi |
equilibrium probabilities, each row sums to 1. |
kappa |
transition and transversion bias. |
Tt |
total evolution time, |
substitution.model |
substitution model. |
identifier |
identifier. |
code.type |
code type. |
Z.state |
see ‘Details’. |
label |
label of sequences for semi-supervised clustering. |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
Either input ret.phyclust
or all other arguments for this function
except Z.state
. ret.phyclust
can be obtain either from an
EM iteration of phyclust
or from a M step of phyclust.m.step
.
Z.state
indicates the return values of . If
TRUE, the
Z.normalized
returned by this function will be
posterior probabilities. Otherwise, it will be logPt
, log of
transition probabilities, .
If label
is inputted, the label information will be used
the E-step, even the ret.phyclust
is the result of unsupervised
clustering.
This function returns a matrix with dimension =
. The values is dependent on
Z.state
, and
they are either posterior probabilities if TRUE or transition
probabilities otherwise.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
phyclust.em.step
,
phyclust.m.step
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) ret.2 <- phyclust.e.step(X, ret.phyclust = ret.1) str(ret.2) # For semi-supervised clustering. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 ret.3 <- phyclust.e.step(X, ret.phyclust = ret.1, label = semi.label) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) ret.2 <- phyclust.e.step(X, ret.phyclust = ret.1) str(ret.2) # For semi-supervised clustering. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 ret.3 <- phyclust.e.step(X, ret.phyclust = ret.1, label = semi.label) ## End(Not run)
This computes pair wised evolution distance of sequences.
phyclust.edist(X, edist.model = .edist.model[1])
phyclust.edist(X, edist.model = .edist.model[1])
X |
nid/sid matrix with |
edist.model |
evolution distance model. |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
This function returns a object with class dist
.
incorporate dist.dna
of ape.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) X <- rbind(c(0, 2, 1, 3, 0, 2, 2, 0, 3, 2, 2), c(0, 0, 1, 3, 2, 2, 1, 0, 3, 1, 2), c(0, 2, 1, 1, 0, 2, 1, 3, 0, 0, 1), c(2, 2, 1, 1, 0, 0, 2, 3, 0, 2, 1), c(2, 2, 1, 1, 0, 0, 2, 3, 1, 2, 0)) (ret <- phyclust.edist(X, edist.model = "D_HAMMING")) str(ret) as.matrix(ret) plot(nj(ret), type = "u", no.margin = TRUE) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) X <- rbind(c(0, 2, 1, 3, 0, 2, 2, 0, 3, 2, 2), c(0, 0, 1, 3, 2, 2, 1, 0, 3, 1, 2), c(0, 2, 1, 1, 0, 2, 1, 3, 0, 0, 1), c(2, 2, 1, 1, 0, 0, 2, 3, 0, 2, 1), c(2, 2, 1, 1, 0, 0, 2, 3, 1, 2, 0)) (ret <- phyclust.edist(X, edist.model = "D_HAMMING")) str(ret) as.matrix(ret) plot(nj(ret), type = "u", no.margin = TRUE) ## End(Not run)
This is a single EM-step of phyclust
.
phyclust.em.step(X, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, label = NULL)
phyclust.em.step(X, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, label = NULL)
X |
nid/sid matrix with |
ret.phyclust |
an object with the class |
K |
number of clusters. |
Eta |
proportion of subpopulations, |
Mu |
centers of subpopulations, dim = |
pi |
equilibrium probabilities, each row sums to 1. |
kappa |
transition and transversion bias. |
Tt |
total evolution time, |
substitution.model |
substitution model. |
identifier |
identifier. |
code.type |
code type. |
label |
label of sequences for semi-supervised clustering. |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
Either input ret.phyclust
or all other arguments for this function.
ret.phyclust
can be obtained either from an EM iteration of
phyclust
or from a M step of phyclust.m.step
.
If label
is inputted, the label information will be used
the EM-step, even the ret.phyclust
is the result of unsupervised
clustering.
This function returns an object with class phyclust
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
phyclust.e.step
,
phyclust.m.step
.
library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) ret.2 <- phyclust.em.step(X, ret.phyclust = ret.1) str(ret.2) # For semi-supervised clustering. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 ret.3 <- phyclust.em.step(X, ret.phyclust = ret.1, label = semi.label) str(ret.3)
library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) ret.2 <- phyclust.em.step(X, ret.phyclust = ret.1) str(ret.2) # For semi-supervised clustering. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 ret.3 <- phyclust.em.step(X, ret.phyclust = ret.1, label = semi.label) str(ret.3)
This computes a log-likelihood value of phyclust
.
phyclust.logL(X, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, label = NULL)
phyclust.logL(X, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, label = NULL)
X |
nid/sid matrix with |
ret.phyclust |
an object with the class |
K |
number of clusters. |
Eta |
proportion of subpopulations, |
Mu |
centers of subpopulations, dim = |
pi |
equilibrium probabilities, each row sums to 1. |
kappa |
transition and transversion bias. |
Tt |
total evolution time, |
substitution.model |
substitution model. |
identifier |
identifier. |
code.type |
code type. |
label |
label of sequences for semi-supervised clustering. |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
Either input ret.phyclust
or all other arguments for this function.
ret.phyclust
can be obtain either from an EM iteration of
phyclust
or from a M step of phyclust.m.step
.
If label
is inputted, the label information will be used to
calculate log likelihood (complete-data), even the ret.phyclust
is the result of unsupervised clustering.
This function returns a log-likelihood value of phyclust
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) phyclust.logL(X, ret.phyclust = ret.1) # For semi-supervised clustering. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 phyclust.logL(X, ret.phyclust = ret.1, label = semi.label) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) phyclust.logL(X, ret.phyclust = ret.1) # For semi-supervised clustering. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 phyclust.logL(X, ret.phyclust = ret.1, label = semi.label) ## End(Not run)
This is a single M-step of phyclust
, usually following or followed
by the other E-step.
phyclust.m.step(X, ret.phyclust = NULL, K = NULL, pi = NULL, kappa = NULL, Tt = NULL, Z.normalized = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, label = NULL)
phyclust.m.step(X, ret.phyclust = NULL, K = NULL, pi = NULL, kappa = NULL, Tt = NULL, Z.normalized = NULL, substitution.model = NULL, identifier = NULL, code.type = NULL, label = NULL)
X |
nid/sid matrix with |
ret.phyclust |
an object with the class |
K |
number of clusters. |
pi |
equilibrium probabilities, each row sums to 1. |
kappa |
transition and transversion bias. |
Tt |
total evolution time, |
Z.normalized |
posterior probabilities obtained from an E-step. |
substitution.model |
substitution model. |
identifier |
identifier. |
code.type |
code type. |
label |
label of sequences for semi-supervised clustering. |
X
should be a numerical matrix containing sequence data that
can be transfered by code2nid
or code2sid
.
Either input ret.phyclust
or all other arguments for this function.
ret.phyclust
can be obtained either from an EM iteration of
phyclust
or from a E step of phyclust.e.step
.
If label
is inputted, the label information will be used
the M-step and Z.normalized
will be replaced, even the
ret.phyclust
is the result of unsupervised clustering.
This function returns an object with class phyclust
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
phyclust.em.step
,
phyclust.e.step
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$short.iter <- 1 EMC.1$EM.iter <- 1 # Test with phyclust. X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) # Test with an em step. ret.em <- phyclust.em.step(X, ret.1) # Test with an E- and M-step. ret.1$Z.normalized <- phyclust.e.step(X, ret.phyclust = ret.1) ret.m <- phyclust.m.step(X, ret.phyclust = ret.1) # Test with 2 em steps. set.seed(1234) EMC.2 <- EMC.1 EMC.2$EM.iter <- 2 ret.2 <- phyclust(X, 2, EMC = EMC.2) # Check logL. phyclust.logL(X, ret.1) phyclust.logL(X, ret.em) phyclust.logL(X, ret.m) phyclust.logL(X, ret.2) # For semi-supervised. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 ret.m.1 <- phyclust.m.step(X, ret.phyclust = ret.1, label = semi.label) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$short.iter <- 1 EMC.1$EM.iter <- 1 # Test with phyclust. X <- seq.data.toy$org ret.1 <- phyclust(X, 2, EMC = EMC.1) # Test with an em step. ret.em <- phyclust.em.step(X, ret.1) # Test with an E- and M-step. ret.1$Z.normalized <- phyclust.e.step(X, ret.phyclust = ret.1) ret.m <- phyclust.m.step(X, ret.phyclust = ret.1) # Test with 2 em steps. set.seed(1234) EMC.2 <- EMC.1 EMC.2$EM.iter <- 2 ret.2 <- phyclust(X, 2, EMC = EMC.2) # Check logL. phyclust.logL(X, ret.1) phyclust.logL(X, ret.em) phyclust.logL(X, ret.m) phyclust.logL(X, ret.2) # For semi-supervised. semi.label <- rep(0, nrow(X)) semi.label[1:3] <- 1 ret.m.1 <- phyclust.m.step(X, ret.phyclust = ret.1, label = semi.label) ## End(Not run)
This computes transition probabilities of phyclust
given time.
phyclust.Pt(Q, Tt, substitution.model = .substitution.model$model[1])
phyclust.Pt(Q, Tt, substitution.model = .substitution.model$model[1])
Q |
a list according to the substitution model. |
Tt |
total evolution time, |
substitution.model |
substitution model. |
The major models for Q
are:
Model | Author and Publication | Parameter |
JC69 | Jukes and Cantor 1969. | |
K80 | Kimura 1980. | |
F81 | Felsenstein 1981. | |
HKY85 | Hasegawa, Kishino, and Yano 1985. | |
A list of Q
should contains pi
, kappa
based on
substitution models and code types. Tt
may be separately stored.
Depending on identifiers, Q
s can be composite to a QA
,
Q matrix array.
A list with class Pt
will be returned containing several
elements as the following:
'Pt' |
a transition probability matrix. |
'log.Pt' |
a log transition probability matrix. |
'H' |
a negative entropy, |
vectorize Tt
for repeated computation in C.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.substitution.model
,
phyclust
,
phyclust.em.step
.
## Not run: library(phyclust, quiet = TRUE) Tt <- 0.5 Q <- list(pi = c(0.25, 0.25, 0.25, 0.25), kappa = 0.5) phyclust.Pt(Q, Tt, "HKY85") Q <- list(pi = c(0.5, 0.5), kappa = 0.5) phyclust.Pt(Q, Tt, "SNP_JC69") ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) Tt <- 0.5 Q <- list(pi = c(0.25, 0.25, 0.25, 0.25), kappa = 0.5) phyclust.Pt(Q, Tt, "HKY85") Q <- list(pi = c(0.5, 0.5), kappa = 0.5) phyclust.Pt(Q, Tt, "SNP_JC69") ## End(Not run)
The phyclust.se
is an advanced function of phyclust
.
The phyclust.se
implements finite mixture models for sequence data
with sequencing errors. The same as phyclust
,
the mutation processes are modeled by evolution processes based on
Continuous Time Markov Chain theory.
phyclust.se(X, K, EMC = .EMC, manual.id = NULL, byrow = TRUE)
phyclust.se(X, K, EMC = .EMC, manual.id = NULL, byrow = TRUE)
X |
nid/sid matrix with |
K |
number of clusters. |
EMC |
EM control. |
manual.id |
manually input class ids. |
byrow |
advanced option for |
phyclust.se
considers mutations with sequencing error, but only
for NUCLEOTIDE data and only the EM algorithm is
implemented. Currently, phyclust.se
implements the following steps:
1 | assume non-sequencing errors as the phyclust does. |
2 | use the initialization as the phyclust does. |
3 | run the phyclust to find the non-sequencing error MLE's. |
4 | initial by the results of phyclust . |
5 | update all parameters including probabilities of sequencing errors. |
See the help page of phyclust
for the explanations of
input arguments since both functions share the same arguments. Note that
the underling model assumptions are very different of both functions.
A list with class phyclust
will be returned containing
several elements as described in phyclust
. But, the
phyclust.se
returns extra parameters for sequencing errors, and
they are shown in the following:
'SE' |
a list returning parameters of sequencing error models, including:
|
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.EMC
,
.EMControl
,
phyclust.se
,
phyclust.se.update
.
## Not run: library(phyclust, quiet = TRUE) X <- seq.data.toy$org set.seed(1234) (ret.1 <- phyclust.se(X, 3)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) X <- seq.data.toy$org set.seed(1234) (ret.1 <- phyclust.se(X, 3)) ## End(Not run)
Since phyclust.se
is difficult to optimize on a constrained high
dimension parameter space, the phyclust
is relatively easier to
find a better result, as well as the find.best
function.
This function will use the phyclust
result as initial parameters and
perform a sequencing error model. All parameters (Eta, Mu, Q, ...) in this
function will be updated through the EM algorithm as phyclust.se
.
Typically, this function run on the find.best
results will yield
a better result than on the phyclust.se
.
phyclust.se.update(X, EMC = .EMC, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, byrow = TRUE)
phyclust.se.update(X, EMC = .EMC, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, byrow = TRUE)
X |
nid/sid matrix with |
EMC |
EM control. |
ret.phyclust |
an object with the class |
K |
number of clusters. |
Eta |
proportion of subpopulations, |
Mu |
centers of subpopulations, dim = |
pi |
equilibrium probabilities, each row sums to 1. |
kappa |
transition and transversion bias. |
Tt |
total evolution time, |
byrow |
advanced option for |
All the input arguments are the same as the inputs of the function
phyclust.em.step
and phyclust.update
.
This function returns an object with class phyclust
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust.se
,
phyclust.update
,
phyclust
,
find.best
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) X <- seq.data.toy$org (ret.1 <- find.best(X, 4)) (ret.2 <- phyclust.se.update(X, ret.phyclust = ret.1)) .EMC$se.constant <- 1e-3 (ret.3 <- phyclust.se.update(X, ret.phyclust = ret.2)) ### Search optimal error func <- function(C){ .EMC$se.constant <<- C -phyclust.se.update(X, ret.phyclust = ret.1)$logL } (ret.opt <- optimize(f = func, lower = 1e-3, upper = 1e-1)) .EMC$se.constant <- ret.opt$minimum (ret.se.opt <- phyclust.se.update(X, ret.phyclust = ret.1)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) X <- seq.data.toy$org (ret.1 <- find.best(X, 4)) (ret.2 <- phyclust.se.update(X, ret.phyclust = ret.1)) .EMC$se.constant <- 1e-3 (ret.3 <- phyclust.se.update(X, ret.phyclust = ret.2)) ### Search optimal error func <- function(C){ .EMC$se.constant <<- C -phyclust.se.update(X, ret.phyclust = ret.1)$logL } (ret.opt <- optimize(f = func, lower = 1e-3, upper = 1e-1)) .EMC$se.constant <- ret.opt$minimum (ret.se.opt <- phyclust.se.update(X, ret.phyclust = ret.1)) ## End(Not run)
This function will run the EM algorithm on initial parameters specified by users or from other initial procedures. All parameters (Eta, Mu, Q, ...) in this function will be updated.
phyclust.update(X, EMC = .EMC, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, label = NULL, byrow = TRUE)
phyclust.update(X, EMC = .EMC, ret.phyclust = NULL, K = NULL, Eta = NULL, Mu = NULL, pi = NULL, kappa = NULL, Tt = NULL, label = NULL, byrow = TRUE)
X |
nid/sid matrix with |
EMC |
EM control. |
ret.phyclust |
an object with the class |
K |
number of clusters. |
Eta |
proportion of subpopulations, |
Mu |
centers of subpopulations, dim = |
pi |
equilibrium probabilities, each row sums to 1. |
kappa |
transition and transversion bias. |
Tt |
total evolution time, |
label |
label of sequences for semi-supervised clustering. |
byrow |
advanced option for |
This function is equivalent to run exhaustEM
on one specified
initial parameters, and no initial procedure is involved. While this
function is a little bit different to run phyclust
with
manual.id
where Mu
will be reestimated as the new initials.
Simply speaking, this function only runs the EM algorithm given the
initial parameters.
All the input arguments are the same as the inputs of the functions
phyclust
and phyclust.em.step
.
This function returns an object with class phyclust
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
find.best
,
phyclust.se
,
phyclust.se.update
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org (ret.1 <- phyclust(X, 2, EMC = EMC.1)) (ret.2 <- phyclust.update(X, ret.phyclust = ret.1)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) EMC.1 <- .EMC EMC.1$EM.iter <- 1 # the same as EMC.1 <- .EMControl(EM.iter = 1) X <- seq.data.toy$org (ret.1 <- phyclust(X, 2, EMC = EMC.1)) (ret.2 <- phyclust.update(X, ret.phyclust = ret.1)) ## End(Not run)
This function provides dots plots of data set given an idea how diverse the sequences are by drawing dots with different colors for all mutations.
plotdots(X, X.class = NULL, Mu = NULL, code.type = .code.type[1], diff.only = TRUE, fill = FALSE, label = TRUE, with.gap = FALSE, xlim = NULL, ylim = NULL, main = "Dots Plot", xlab = "Sites", ylab = "Sequences", missing.col = "gray95", ...)
plotdots(X, X.class = NULL, Mu = NULL, code.type = .code.type[1], diff.only = TRUE, fill = FALSE, label = TRUE, with.gap = FALSE, xlim = NULL, ylim = NULL, main = "Dots Plot", xlab = "Sites", ylab = "Sequences", missing.col = "gray95", ...)
X |
numerical data matrix with |
X.class |
class ids indicated for all sequences. |
Mu |
a center sequence with length |
code.type |
either "NUCLEOTIDE" (default) or "SNP". |
diff.only |
draw the segregating sites only, default = TRUE. |
fill |
fill in all dots, default = FALSE. |
label |
indicate segregating sites, default = TRUE. |
with.gap |
pass to |
xlim |
limit of x-axis. |
ylim |
limit of y-axis. |
main |
main label, default = "Dots Plot". |
xlab |
x-axis label, default = "Sites". |
ylab |
y-axis label, default = "Sequences". |
missing.col |
color for the missing allele, default = NA. |
... |
other options passed to |
The first rows in Mu
will be drawn entirely on dots plots
in colors which are "green3", "blue2", #CC00CC
, "red2", "gray",
and "white", according the ids + 1. If fill
is FALSE, other sequences
will be drawn by the mutation sites comparing to the first sequences.
Otherwise, they be drawn entirely.
If X.class
is set, the sequences will be drawn in cluster order.
If Mu
is NULL
, the consensus sequence of X
will be drawn.
If label
is TRUE, the bottom row will be drawn in color "orange"
to indicate segregating sites.
with.gap
is only used when Mu
is NULL
.
A dots plot will be drawn.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) # For nucleotide X <- seq.data.toy$org par(mfrow = c(2, 2)) plotdots(X) plotdots(X, diff.only = FALSE) plotdots(X, diff.only = FALSE, label = FALSE) plotdots(X, fill = TRUE, diff.only = FALSE, label = FALSE) # With class ids X.class <- as.numeric(gsub(".*-(.*)", "\\1", seq.data.toy$seqname)) plotdots(X, X.class) # For SNP X.SNP <- nid2sid(X) plotdots(X.SNP, X.class) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # For nucleotide X <- seq.data.toy$org par(mfrow = c(2, 2)) plotdots(X) plotdots(X, diff.only = FALSE) plotdots(X, diff.only = FALSE, label = FALSE) plotdots(X, fill = TRUE, diff.only = FALSE, label = FALSE) # With class ids X.class <- as.numeric(gsub(".*-(.*)", "\\1", seq.data.toy$seqname)) plotdots(X, X.class) # For SNP X.SNP <- nid2sid(X) plotdots(X.SNP, X.class) ## End(Not run)
This function provides gaps plots of data set to identify regions where gaps enriched. The plot show the proportions of context by sites and the diverse may be caused by mutations, sequencing errors, or alignment errors.
plotgaps(X, code.type = .code.type[1], main = "Gaps Plot", xlab = "Sites", ylab = "Proportion", ...)
plotgaps(X, code.type = .code.type[1], main = "Gaps Plot", xlab = "Sites", ylab = "Proportion", ...)
X |
numerical data matrix with |
code.type |
either "NUCLEOTIDE" (default) or "SNP". |
main |
main label, default = "Gaps Plot". |
xlab |
x-axis label, default = "Sites". |
ylab |
y-axis label, default = "Proportion". |
... |
other options passed to |
Proportions of gaps will be drawn.
A gaps plot will be drawn.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) # For nucleotide set.seed(1234) X <- seq.data.toy$org X[sample(c(T, F), length(X), replace = TRUE, prob = c(0.05, 0.95))] <- .nucleotide$nid[.nucleotide$code == "-"] plotgaps(X) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # For nucleotide set.seed(1234) X <- seq.data.toy$org X[sample(c(T, F), length(X), replace = TRUE, prob = c(0.05, 0.95))] <- .nucleotide$nid[.nucleotide$code == "-"] plotgaps(X) ## End(Not run)
Plot histogram to compare number of mutations.
plothist(X, X.class = NULL, Mu = NULL, fill.color = .Color, draw.all = TRUE, main = "Mutation counts", xlab = "Difference", ylab = "Counts", append = FALSE)
plothist(X, X.class = NULL, Mu = NULL, fill.color = .Color, draw.all = TRUE, main = "Mutation counts", xlab = "Difference", ylab = "Counts", append = FALSE)
X |
nid/sid matrix with |
X.class |
class ids indicated for all sequences. |
Mu |
a central sequence with length |
fill.color |
color to fill the histogram. |
draw.all |
draw a histogram use all sequences. |
main |
main label, default = "Mutation counts". |
xlab |
x-axis label, default = "Difference". |
ylab |
y-axis label, default = "Counts". |
append |
overwrite histograms. |
If X.class
is set, the histograms will be drawn by classes and
all sequences will be compared to the central sequence Mu
.
Otherwise, all sequences will be used to count mutations.
draw.all
is not effect if X.class
is not set.
If Mu
is set, it will be used to compare to all other sequences
to count mutations. Otherwise, the first sequence of X
will be
used, and the first sequence in the first class will be used if
X.class
is set. If Mu
is a matrix, the first row will be
used as the central sequence.
Histograms will be drawn to show the number of mutations away from the central sequence.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) X <- seq.data.toy$org plothist(X) # With class ids X.class <- as.numeric(gsub(".*-(.*)", "\\1", seq.data.toy$seqname)) plothist(X, X.class) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) X <- seq.data.toy$org plothist(X) # With class ids X.class <- as.numeric(gsub(".*-(.*)", "\\1", seq.data.toy$seqname)) plothist(X, X.class) ## End(Not run)
This is a enhanced version of plot.phylo
in ape which
can plot trees in Class phylo
including neighbor-joining trees,
unrooted trees, trees with star shapes, ... etc.
plotnj(unrooted.tree, X.class = NULL, type = "u", main = NULL, show.tip.label = FALSE, show.node.label = FALSE, edge.width = 1, edge.width.class = edge.width, ...)
plotnj(unrooted.tree, X.class = NULL, type = "u", main = NULL, show.tip.label = FALSE, show.node.label = FALSE, edge.width = 1, edge.width.class = edge.width, ...)
unrooted.tree |
an unrooted tree in |
X.class |
class ids indicated for all tips. |
type |
plot types, see |
main |
main label. |
show.tip.label |
show tip label if available. |
show.node.label |
show node label if available. |
edge.width |
edge width for all internal branches if |
edge.width.class |
edge width for tip branches if |
... |
other options passed to |
This function is built to plot unrooted trees, but it may also apply
for other trees in Class phylo
.
type
can be "u", "p", "c", "f", "r" as in plot.phylo
.
If X.class
is set, then the tip branches will be drawn with
colors by class ids, and the colors are controlled by .color
.
The width of branches is controlled by edge.width
for all internal
branches and by edge.width.class
for tip branches.
Return a tree plot.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(nsam = 24, opts = "-T -G 0.5") tree.anc <- read.tree(text = ret.ms[3]) is.rooted(tree.anc) tree.new <- as.star.tree(tree.anc) X.class <- rep(1:6, each = 4) par(mfrow = c(2, 2)) plotnj(tree.anc, X.class, type = "u", edge.width.class = 2, main = "unrooted tree") plotnj(tree.new, X.class, type = "u", edge.width.class = 2, main = "star tree") plotnj(tree.anc, X.class, type = "c", edge.width.class = 2, main = "unrooted tree in cladogram") plotnj(tree.new, X.class, type = "r", edge.width.class = 2, main = "star tree in radial") ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(nsam = 24, opts = "-T -G 0.5") tree.anc <- read.tree(text = ret.ms[3]) is.rooted(tree.anc) tree.new <- as.star.tree(tree.anc) X.class <- rep(1:6, each = 4) par(mfrow = c(2, 2)) plotnj(tree.anc, X.class, type = "u", edge.width.class = 2, main = "unrooted tree") plotnj(tree.new, X.class, type = "u", edge.width.class = 2, main = "star tree") plotnj(tree.anc, X.class, type = "c", edge.width.class = 2, main = "unrooted tree in cladogram") plotnj(tree.new, X.class, type = "r", edge.width.class = 2, main = "star tree in radial") ## End(Not run)
This function provides structure plots of data set given based on posterior
probabilities, the Z.normalized
matrix.
plotstruct(Z, X.class = NULL, sort.inside.class = TRUE, direction = "h", main = "Structure Plot", xlab = "Observations", ylab = "Posterior Probabilities", ...)
plotstruct(Z, X.class = NULL, sort.inside.class = TRUE, direction = "h", main = "Structure Plot", xlab = "Observations", ylab = "Posterior Probabilities", ...)
Z |
a Z matrix as Z.normalized in |
X.class |
class ids indicated for all observations |
sort.inside.class |
sort observations inside class by max posteriors. |
direction |
either "h" for horizontal or "v" for vertical. |
main |
main label, default = "Structure Plot". |
xlab |
x-axis label, default = "Observations". |
ylab |
y-axis label, default = "Posterior Probabilities". |
... |
other options passed to |
The posterior probabilities in ret.phyclust$Z.normalized
will
be drawn in colors.
If X.class
is submitted, the plot will draw in the order of
class ids and the sort.inside.class
will be skipped.
A structure plot will be drawn.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
phyclust
,
find.best
,
plotdots
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.1 <- phyclust(seq.data.toy$org, 3) plotstruct(ret.1$Z.normalized) windows() plotstruct(ret.1$Z.normalized, sort.inside.class = FALSE) # With class ids X.class <- as.numeric(gsub(".*-(.*)", "\\1", seq.data.toy$seqname)) windows() plotstruct(ret.1$Z.normalized, X.class = X.class) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.1 <- phyclust(seq.data.toy$org, 3) plotstruct(ret.1$Z.normalized) windows() plotstruct(ret.1$Z.normalized, sort.inside.class = FALSE) # With class ids X.class <- as.numeric(gsub(".*-(.*)", "\\1", seq.data.toy$seqname)) windows() plotstruct(ret.1$Z.normalized, X.class = X.class) ## End(Not run)
Several classes are declared in phyclust, and these are functions to print and summary objects.
## S3 method for class 'baseml' print(x, ...) ## S3 method for class 'ms' print(x, ...) ## S3 method for class 'phyclust' print(x, digits = max(4, getOption("digits") - 3), ...) ## S3 method for class 'Pt' print(x, ...) ## S3 method for class 'RRand' print(x, digits = max(4, getOption("digits") - 3), ...) ## S3 method for class 'seq.data' print(x, ...) ## S3 method for class 'seqgen' print(x, ...) ## S3 method for class 'phyclust' summary(object, ...)
## S3 method for class 'baseml' print(x, ...) ## S3 method for class 'ms' print(x, ...) ## S3 method for class 'phyclust' print(x, digits = max(4, getOption("digits") - 3), ...) ## S3 method for class 'Pt' print(x, ...) ## S3 method for class 'RRand' print(x, digits = max(4, getOption("digits") - 3), ...) ## S3 method for class 'seq.data' print(x, ...) ## S3 method for class 'seqgen' print(x, ...) ## S3 method for class 'phyclust' summary(object, ...)
x |
an object with the class attributes. |
digits |
for printing out numbers. |
object |
an object with the class attributes. |
... |
other possible options. |
These are useful functions for summarizing and debugging.
For ms
, seqgen
, and paml.baseml
, it will show the
result as standalone versions on the STDOUT out with line by line.
For other functions, they only show summaries of objects. Use
names
or str
to explore the details.
The results will cat or print on the STDOUT by default.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
ms
,
paml.baseml
,
phyclust
,
phyclust.Pt
,
RRand
,
seqgen
.
## Not run: library(phyclust, quiet = TRUE) # Functions applied by directly type the names of objects. ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) # Functions applied by directly type the names of objects. ## End(Not run)
This function prune the center sequences Mu where the sites will be reset as GAPs if all members within the same cluster are all GAPs.
prune.Mu(X, X.class, Mu, code.type = .code.type[1])
prune.Mu(X, X.class, Mu, code.type = .code.type[1])
X |
numerical data matrix with |
X.class |
class ids indicated for all sequences. |
Mu |
a center sequence with length |
code.type |
either "NUCLEOTIDE" (default) or "SNP". |
For each cluster indicated by X.class
, this function will prune
Mu
and reset the sites as GAPs if all members within cluster
are all GAPs. Mu
are usually the returning values of
phyclust()
.
A pruned Mu
will be returned.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) X <- seq.data.toy$org X[, 5] <- .nucleotide$nid[.nucleotide$code == "-"] ret <- phyclust(X, 2) Mu.GAPs <- prune.Mu(X, ret$class.id, ret$Mu) ret$Mu[, 5] Mu.GAPs[, 5] # Replace by GAPs. ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) X <- seq.data.toy$org X[, 5] <- .nucleotide$nid[.nucleotide$code == "-"] ret <- phyclust(X, 2) Mu.GAPs <- prune.Mu(X, ret$class.id, ret$Mu) ret$Mu[, 5] Mu.GAPs[, 5] # Replace by GAPs. ## End(Not run)
This function can read the results generated by seqgen
and
turn into a object in Class seq.data
.
read.seqgen(text, byrow = TRUE, code.type = .code.type[1])
read.seqgen(text, byrow = TRUE, code.type = .code.type[1])
text |
a text vector generated by |
byrow |
advanced option, default = TRUE. |
code.type |
either "NUCLEOTIDE" (default) or "SNP". |
If code.type
is "SNP", the A, G will be transfered to 1, and
the C, T will be transfered to 2.
Return an object in Class seq.data
.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
seqgen
,
gen.seq.HKY
,
gen.seq.SNP
.
## Not run: library(phyclust, quiet = TRUE) set.seed(123) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") ret.seqgen <- seqgen(opts = "-mHKY -l40 -s0.2", newick.tree = ret.ms[3]) (ret.nucleotide <- read.seqgen(ret.seqgen)) (ret.snp <- read.seqgen(ret.seqgen, code.type = "SNP")) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(123) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") ret.seqgen <- seqgen(opts = "-mHKY -l40 -s0.2", newick.tree = ret.ms[3]) (ret.nucleotide <- read.seqgen(ret.seqgen)) (ret.snp <- read.seqgen(ret.seqgen, code.type = "SNP")) ## End(Not run)
This function rescaled the input rooted tree height by a scale.height.
rescale.rooted.tree(rooted.tree, scale.height = 1)
rescale.rooted.tree(rooted.tree, scale.height = 1)
rooted.tree |
a rooted tree in |
scale.height |
a scale to all branch lengths. |
The rooted.tree
should be in a phylo
class of ape,
and may be created by ms
.
scale.height
is a positive number multiplying on the lengths of
all branches of the rooted tree.
Return a rooted tree in Class phylo
with scaled branches.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
ms
,
read.tree
,
as.phylo
,
get.rooted.tree.height
.
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(5, 1, opts = paste("-T", sep = " ")) tree.ms <- read.tree(text = ret.ms[3]) get.rooted.tree.height(tree.ms) tree.scaled <- rescale.rooted.tree(tree.ms, scale.height = 2) get.rooted.tree.height(tree.scaled) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(1234) ret.ms <- ms(5, 1, opts = paste("-T", sep = " ")) tree.ms <- read.tree(text = ret.ms[3]) get.rooted.tree.height(tree.ms) tree.scaled <- rescale.rooted.tree(tree.ms, scale.height = 2) get.rooted.tree.height(tree.scaled) ## End(Not run)
This function returns the Rand index and the adjusted Rand index for given true class ids and predicted class ids.
RRand(trcl, prcl, lab = NULL)
RRand(trcl, prcl, lab = NULL)
trcl |
true class ids. |
prcl |
predicted class ids. |
lab |
known ids for semi-supervised clustering. |
All ids, trcl
and prcl
, should be positive integers and
started from 1 to K, and the maximums are allowed to be different.
lab
used in semi-supervised clustering contains the labels which
are known before clustering. It should be positive integer and
started from 1 for labeled data and 0 for unlabeled data.
Return a Class RRand
contains Rand index and adjusted
Rand index.
Ranjan Maitra.
Maintain: Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) true.id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) pred.id <- c(2, 1, 2, 1, 1, 1, 2, 1, 1) label <- c(0, 0, 0, 0, 1, 0, 2, 0, 0) RRand(true.id, pred.id) RRand(true.id, pred.id, lab = label) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) true.id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) pred.id <- c(2, 1, 2, 1, 1, 1, 2, 1, 1) label <- c(0, 0, 0, 0, 1, 0, 2, 0, 0) RRand(true.id, pred.id) RRand(true.id, pred.id, lab = label) ## End(Not run)
A toy dataset, seq.data.toy
, with 100 nucleotide sequences
and 200 sites in 4 clusters.
seq.data.gap
contains some missing values indicated by "-".
This data contains a list with a seq.data
structure described
in the ‘Details’ section.
A toy dataset is generated to demonstrate phyclust. It has 100 nucleotide sequences and 200 sites in 4 clusters where the ancestral tree height 0.15 and the descendant tree height 0.09, and sequences are evolved by a HKY85 model.
The structre of class seq.data
is a list containing:
code.type |
either "NUCLEOTIDE" or "SNP". |
info |
header for phylip file. |
nseq |
number of sequences, . |
seqlen |
length of sequences, . |
seqname |
sequence names. |
org.code |
original codes, dim = . |
org |
transfered ids, dim = . |
byrow |
TRUE for dim = , FALSE for transpose. |
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
## Not run: library(phyclust, quiet = TRUE) seq.data.toy seq.data.gap par(mfrow = c(1, 2)) plotdots(seq.data.toy$org) plotdots(seq.data.gap$org) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) seq.data.toy seq.data.gap par(mfrow = c(1, 2)) plotdots(seq.data.toy$org) plotdots(seq.data.gap$org) ## End(Not run)
This function modifies the original standalone code of seq-gen
developed by Rambaut, A. and Grassly, N.C. (1997).
seqgen(opts = NULL, rooted.tree = NULL, newick.tree = NULL, input = NULL, temp.file = NULL)
seqgen(opts = NULL, rooted.tree = NULL, newick.tree = NULL, input = NULL, temp.file = NULL)
opts |
options as the standalone version. |
rooted.tree |
a rooted tree which sequences are generated according to. |
newick.tree |
a NEWICK tree which sequences are generated according to. |
input |
optional inputs of seq-gen, e.g. ancestral sequences. |
temp.file |
temporary file for seqgen output. |
This function directly reuses the C code of seq-gen
by arguments
as input from the STDIN. The options opts
is followed from the
original seq-gen
except an input tree.
Input either a rooted.tree
or a newick.tree
, and
rooted.tree
should have a Class phylo
.
For examples, options commonly used in phyclust are:
"-m": set an evolution model, e.g. "-mHKY".
"-t": set transition/transversion ratio, e.g. "-t0.7".
"-f": equilibrium probabilities of A, C, G, and T, e.g. "-f0.1,0.2,0.3,0.4".
"-l": length of sequences, e.g. "-l10".
"-s": scale rate for the total height of input tree, "-s0.2".
"-k": index of ancestral sequence in input file, see gen.seq.HKY
.
These will return sequences in Format phylip
which can be
read by read.seqgen()
and transfered into an object with
Class seq.data
.
The maximum number of tips is 2000 in seqgen()
by default, but an
extra option opts = "-u 2014 ..."
can be simply increase the size
to 2014.
Note:
input
and rooted.tree
/newick.tree
can not be
submitted at the same time.
seq-gen
use the order A, C, G, T.
-t
is ts/tv ratio which is not equal to .
See more examples in gen.seq.HKY()
and gen.seq.SNP()
.
temp.file
allows users to specify seqgen output file themselves, but
this file will not be deleted nor converted into R after the call to
seqgen()
. Users should take care the readings. By default,
seqgen()
uses a system temp file to store the output which is converted into R
after the call and is deleted after converting.
This function returns a vector, and each element stores one line of STDOUT
of seq-gen
separated by newline. The vector stores in a class
seqgen
. The details of output format can found on the website
http://tree.bio.ed.ac.uk/software/seqgen/ and its manual.
Carefully read the seq-gen
's original document before using the
seqgen()
function.
Rambaut, A. and Grassly, N.C. (1997).
Maintain: Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
Rambaut, A. and Grassly, N.C. (1997) “Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees”, Computer Applications In The Biosciences, 13:3, 235-238. http://tree.bio.ed.ac.uk/software/seqgen/
print.seqgen()
,
read.tree()
,
ms()
,
gen.seq.HKY()
,
gen.seq.SNP()
.
## Not run: library(phyclust, quiet = TRUE) set.seed(123) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") seqgen(opts = "-mHKY -l40 -s0.2", newick.tree = ret.ms[3]) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) set.seed(123) ret.ms <- ms(nsam = 5, nreps = 1, opts = "-T") seqgen(opts = "-mHKY -l40 -s0.2", newick.tree = ret.ms[3]) ## End(Not run)
Transfer SNP codes (1, 2, -) and SNP ids (0, 1, 2).
snp2sid(snpseq) sid2snp(sidseq)
snp2sid(snpseq) sid2snp(sidseq)
snpseq |
a character vector contains SNP codes, 1, 2, or -. |
sidseq |
a numerical vector contains SNP ids, 0, 1, or 2. |
This function is based on the internal object .snp
to
transfer SNP codes and SNP ids.
snp2sid
returns a numerical vector containing SNP ids, and
sid2snp
returns a character vector containing SNP codes.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.show.option
,
.snp
,
code2nid
,
nid2code
,
code2snp
,
snp2code
.
## Not run: library(phyclust, quiet = TRUE) a <- c("1", "2", "1", "-", "2") snp2sid(a) sid2snp(snp2sid(a)) ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) a <- c("1", "2", "1", "-", "2") snp2sid(a) sid2snp(snp2sid(a)) ## End(Not run)
Standard codes and ids for nucleotides, SNPs, codon, amino acid and genetic code. All objects are used to transfer data. These are read-only objects and the elemental order is followed in C.
.nucleotide .snp .codon .amino.acid .genetic.code
.nucleotide .snp .codon .amino.acid .genetic.code
All objects are data frames containing ids and codes.
Note: All ids are coding started from 0.
Nucleotides, A, G, C, T, and - have codes 0, 1, 2, 3, and 4 where "-" is for gaps. SNPs, 1, 2, and - have code codes 0, 1, and 2.
These are objects are in data frames unlike other internal objects due to heavily used in processing data. The original data should be transfered to numerical codes in order to be passed to C codes. In C codes, we use integers, 0, 1, 2, ..., for coding nucleotides or SNPs and so on.
Now, models and methods are implemented only for .nucleotide
and
.snp
. Other objects are leaved for further extension.
Data frames use factor formats as the default, and as.character
is
the way to transfer to the characters.
Wei-Chen Chen [email protected]
Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/
.show.option
,
code2nid
,
nid2code
,
snp2sid
,
sid2snp
.
## Not run: library(phyclust, quiet = TRUE) .nucleotide .snp .codon .amino.acid .genetic.code .missing.code ## End(Not run)
## Not run: library(phyclust, quiet = TRUE) .nucleotide .snp .codon .amino.acid .genetic.code .missing.code ## End(Not run)