The Sense of Confidence during Probabilistic Learning: A Normative Account

在概率学习信心:规范账户

by Florent Meyniel, Daniel Schlunegger, Stanislas Dehaene

Learning in a stochastic environment consists of estimating a model from a limited amount of noisy data, and is therefore inherently uncertain. However, many classical models reduce the learning process to the updating of parameter estimates and neglect the fact that learning is also frequently accompanied by a variable “feeling of knowing” or confidence. The characteristics and the origin of these subjective confidence estimates thus remain largely unknown. Here we investigate whether, during learning, humans not only infer a model of their environment, but also derive an accurate sense of confidence from their inferences. In our experiment, humans estimated the transition probabilities between two visual or auditory stimuli in a changing environment, and reported their mean estimate and their confidence in this report. To formalize the link between both kinds of estimate and assess their accuracy in comparison to a normative reference, we derive the optimal inference strategy for our task. Our results indicate that subjects accurately track the likelihood that their inferences are correct. Learning and estimating confidence in what has been learned appear to be two intimately related abilities, suggesting that they arise from a single inference process. We show that human performance matches several properties of the optimal probabilistic inference. In particular, subjective confidence is impacted by environmental uncertainty, both at the first level (uncertainty in stimulus occurrence given the inferred stochastic characteristics) and at the second level (uncertainty due to unexpected changes in these stochastic characteristics). Confidence also increases appropriately with the number of observations within stable periods. Our results support the idea that humans possess a quantitative sense of confidence in their inferences about abstract non-sensory parameters of the environment. This ability cannot be reduced to simple heuristics, it seems instead a core property of the learning process.

[详细]

  • PLOS Computational Biology
  • 10年前

De novo meta-assembly of ultra-deep sequencing data

从头元超深度测序数据汇编

We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized ‘slices’ and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler.

Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/.

Contact: hamid.mirebrahim@email.ucr.edu

[详细]

  • Bioinformatics
  • 10年前
  • GENES

A hierarchical Bayesian model for flexible module discovery in three-way time-series data

在三时序数据柔性模块发现分层贝叶斯模型

Motivation: Detecting modules of co-ordinated activity is fundamental in the analysis of large biological studies. For two-dimensional data (e.g. genes x patients), this is often done via clustering or biclustering. More recently, studies monitoring patients over time have added another dimension. Analysis is much more challenging in this case, especially when time measurements are not synchronized. New methods that can analyze three-way data are thus needed.

Results: We present a new algorithm for finding coherent and flexible modules in three-way data. Our method can identify both core modules that appear in multiple patients and patient-specific augmentations of these core modules that contain additional genes. Our algorithm is based on a hierarchical Bayesian data model and Gibbs sampling. The algorithm outperforms extant methods on simulated and on real data. The method successfully dissected key components of septic shock response from time series measurements of gene expression. Detected patient-specific module augmentations were informative for disease outcome. In analyzing brain functional magnetic resonance imaging time series of subjects at rest, it detected the pertinent brain regions involved.

Availability and implementation: R code and data are available at http://acgt.cs.tau.ac.il/twigs/.

Contact: rshamir@tau.ac.il

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

Cypiripi: exact genotyping of CYP2D6 using high-throughput sequencing data

cypiripi:利用高通量测序数据的确切基因CYP2D6

Motivation: CYP2D6 is highly polymorphic gene which encodes the (CYP2D6) enzyme, involved in the metabolism of 20–25% of all clinically prescribed drugs and other xenobiotics in the human body. CYP2D6 genotyping is recommended prior to treatment decisions involving one or more of the numerous drugs sensitive to CYP2D6 allelic composition. In this context, high-throughput sequencing (HTS) technologies provide a promising time-efficient and cost-effective alternative to currently used genotyping techniques. To achieve accurate interpretation of HTS data, however, one needs to overcome several obstacles such as high sequence similarity and genetic recombinations between CYP2D6 and evolutionarily related pseudogenes CYP2D7 and CYP2D8, high copy number variation among individuals and short read lengths generated by HTS technologies.

Results: In this work, we present the first algorithm to computationally infer CYP2D6 genotype at basepair resolution from HTS data. Our algorithm is able to resolve complex genotypes, including alleles that are the products of duplication, deletion and fusion events involving CYP2D6 and its evolutionarily related cousin CYP2D7. Through extensive experiments using simulated and real datasets, we show that our algorithm accurately solves this important problem with potential clinical implications.

Availability and implementation: Cypiripi is available at http://sfu-compbio.github.io/cypiripi.

Contact: cenk@sfu.ca.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

Reconstructing 16S rRNA genes in metagenomic data

在宏基因组数据的16S rRNA基因改造

Metagenomic data, which contains sequenced DNA reads of uncultured microbial species from environmental samples, provide a unique opportunity to thoroughly analyze microbial species that have never been identified before. Reconstructing 16S ribosomal RNA, a phylogenetic marker gene, is usually required to analyze the composition of the metagenomic data. However, massive volume of dataset, high sequence similarity between related species, skewed microbial abundance and lack of reference genes make 16S rRNA reconstruction difficult. Generic de novo assembly tools are not optimized for assembling 16S rRNA genes. In this work, we introduce a targeted rRNA assembly tool, REAGO (REconstruct 16S ribosomal RNA Genes from metagenOmic data). It addresses the above challenges by combining secondary structure-aware homology search, zproperties of rRNA genes and de novo assembly. Our experimental results show that our tool can correctly recover more rRNA genes than several popular generic metagenomic assembly tools and specially designed rRNA construction tools.

Availability and implementation: The source code of REAGO is freely available at https://github.com/chengyuan/reago.

Contact: yannisun@msu.edu

[详细]

  • Bioinformatics
  • 10年前
  • GENES

ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes

astral-ii:结合基于物种树估计与数以百计的类群和成千上万的基因

Motivation: The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed ‘bipartitions’. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent.

Results: We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL’s running time is $$O({n}^{2}k|X{|}^{2})$$, and ASTRAL-II’s running time is $$O(nk|X{|}^{2})$$, where n is the number of species, k is the number of loci and X is the set of allowed bipartitions for the search space.

Availability and implementation: ASTRAL-II is available in open source at https://github.com/smirarab/ASTRAL and datasets used are available at http://www.cs.utexas.edu/~phylo/datasets/astral2/.

Contact: smirarab@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis

igrepertoireconstructor:一种抗体库的建设和immunoproteogenomics分析算法

The analysis of concentrations of circulating antibodies in serum (antibody repertoire) is a fundamental, yet poorly studied, problem in immunoinformatics. The two current approaches to the analysis of antibody repertoires [next generation sequencing (NGS) and mass spectrometry (MS)] present difficult computational challenges since antibodies are not directly encoded in the germline but are extensively diversified by somatic recombination and hypermutations. Therefore, the protein database required for the interpretation of spectra from circulating antibodies is custom for each individual. Although such a database can be constructed via NGS, the reads generated by NGS are error-prone and even a single nucleotide error precludes identification of a peptide by the standard proteomics tools. Here, we present the IgRepertoireConstructor algorithm that performs error-correction of immunosequencing reads and uses mass spectra to validate the constructed antibody repertoires.

Availability and implementation: IgRepertoireConstructor is open source and freely available as a C++ and Python program running on all Unix-compatible platforms. The source code is available from http://bioinf.spbau.ru/igtools.

Contact: ppevzner@ucsd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

Reconstruction of clonal trees and tumor composition from multi-sample sequencing data

克隆树木和多样本测序数据的肿瘤成分重建

Motivation: DNA sequencing of multiple samples from the same tumor provides data to analyze the process of clonal evolution in the population of cells that give rise to a tumor.

Results: We formalize the problem of reconstructing the clonal evolution of a tumor using single-nucleotide mutations as the variant allele frequency (VAF) factorization problem. We derive a combinatorial characterization of the solutions to this problem and show that the problem is NP-complete. We derive an integer linear programming solution to the VAF factorization problem in the case of error-free data and extend this solution to real data with a probabilistic model for errors. The resulting AncesTree algorithm is better able to identify ancestral relationships between individual mutations than existing approaches, particularly in ultra-deep sequencing data when high read counts for mutations yield high confidence VAFs.

Availability and implementation: An implementation of AncesTree is available at: http://compbio.cs.brown.edu/software.

Contact: braphael@brown.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

Robust reconstruction of gene expression profiles from reporter gene data using linear inversion

利用线性反演的报告基因的基因表达谱的鲁棒重建数据

Motivation: Time-series observations from reporter gene experiments are commonly used for inferring and analyzing dynamical models of regulatory networks. The robust estimation of promoter activities and protein concentrations from primary data is a difficult problem due to measurement noise and the indirect relation between the measurements and quantities of biological interest.

Results: We propose a general approach based on regularized linear inversion to solve a range of estimation problems in the analysis of reporter gene data, notably the inference of growth rate, promoter activity, and protein concentration profiles. We evaluate the validity of the approach using in silico simulation studies, and observe that the methods are more robust and less biased than indirect approaches usually encountered in the experimental literature based on smoothing and subsequent processing of the primary data. We apply the methods to the analysis of fluorescent reporter gene data acquired in kinetic experiments with Escherichia coli. The methods are capable of reliably reconstructing time-course profiles of growth rate, promoter activity and protein concentration from weak and noisy signals at low population volumes. Moreover, they capture critical features of those profiles, notably rapid changes in gene expression during growth transitions.

Availability and implementation: The methods described in this article are made available as a Python package (LGPL license) and also accessible through a web interface. For more information, see https://team.inria.fr/ibis/wellinverter.

Contact: Hidde.de-Jong@inria.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

Misassembly detection using paired-end sequence reads and optical mapping data

错误组装检测使用配对末端序列读取和光学测绘数据

Motivation: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar.

Results: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar.

Availability and implementation: misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/.

Contact: muggli@cs.colostate.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

Reconstructing gene regulatory dynamics from high-dimensional single-cell snapshot data

从高维单快照数据重建基因调控力度

Motivation: High-dimensional single-cell snapshot data are becoming widespread in the systems biology community, as a mean to understand biological processes at the cellular level. However, as temporal information is lost with such data, mathematical models have been limited to capture only static features of the underlying cellular mechanisms.

Results: Here, we present a modular framework which allows to recover the temporal behaviour from single-cell snapshot data and reverse engineer the dynamics of gene expression. The framework combines a dimensionality reduction method with a cell time-ordering algorithm to generate pseudo time-series observations. These are in turn used to learn transcriptional ODE models and do model selection on structural network features. We apply it on synthetic data and then on real hematopoietic stem cells data, to reconstruct gene expression dynamics during differentiation pathways and infer the structure of a key gene regulatory network.

Availability and implementation: C++ and Matlab code available at https://www.helmholtz-muenchen.de/fileadmin/ICB/software/inferenceSnapshot.zip.

Contact: fabian.theis@helmholtz-muenchen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

Inferring orthologous gene regulatory networks using interspecies data fusion

推断的直系同源基因调控网络利用种间融合

Motivation: The ability to jointly learn gene regulatory networks (GRNs) in, or leverage GRNs between related species would allow the vast amount of legacy data obtained in model organisms to inform the GRNs of more complex, or economically or medically relevant counterparts. Examples include transferring information from Arabidopsis thaliana into related crop species for food security purposes, or from mice into humans for medical applications. Here we develop two related Bayesian approaches to network inference that allow GRNs to be jointly inferred in, or leveraged between, several related species: in one framework, network information is directly propagated between species; in the second hierarchical approach, network information is propagated via an unobserved ‘hypernetwork’. In both frameworks, information about network similarity is captured via graph kernels, with the networks additionally informed by species-specific time series gene expression data, when available, using Gaussian processes to model the dynamics of gene expression.

Results: Results on in silico benchmarks demonstrate that joint inference, and leveraging of known networks between species, offers better accuracy than standalone inference. The direct propagation of network information via the non-hierarchical framework is more appropriate when there are relatively few species, while the hierarchical approach is better suited when there are many species. Both methods are robust to small amounts of mislabelling of orthologues. Finally, the use of Saccharomyces cerevisiae data and networks to inform inference of networks in the budding yeast Schizosaccharomyces pombe predicts a novel role in cell cycle regulation for Gas1 (SPAC19B12.02c), a 1,3-beta-glucanosyltransferase.

Availability and implementation: MATLAB code is available from http://go.warwick.ac.uk/systemsbiology/software/.

Contact: d.l.wild@warwick.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENES

MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms

msprogene:整合蛋白组学超越六帧和单核苷酸多态性

Summary: Ongoing advances in high-throughput technologies have facilitated accurate proteomic measurements and provide a wealth of information on genomic and transcript level. In proteogenomics, this multi-omics data is combined to analyze unannotated organisms and to allow more accurate sample-specific predictions. Existing analysis methods still mainly depend on six-frame translations or reference protein databases that are extended by transcriptomic information or known single nucleotide polymorphisms (SNPs). However, six-frames introduce an artificial sixfold increase of the target database and SNP integration requires a suitable database summarizing results from previous experiments. We overcome these limitations by introducing MSProGene, a new method for integrative proteogenomic analysis based on customized RNA-Seq driven transcript databases. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference. We applied MSProGene on three datasets and show that it facilitates a database-independent reliable yet accurate prediction on gene and protein level and additionally identifies novel genes.

Availability and implementation: MSProGene is written in Java and Python. It is open source and available at http://sourceforge.net/projects/msprogene/.

Contact: renardb@rki.de

[详细]

  • Bioinformatics
  • 10年前
  • PROTEINS

Large-scale model quality assessment for improving protein tertiary structure prediction

为提高蛋白质三级结构预测大型模型质量评价

Motivation: Sampling structural models and ranking them are the two major challenges of protein structure prediction. Traditional protein structure prediction methods generally use one or a few quality assessment (QA) methods to select the best-predicted models, which cannot consistently select relatively better models and rank a large number of models well.

Results: Here, we develop a novel large-scale model QA method in conjunction with model clustering to rank and select protein structural models. It unprecedentedly applied 14 model QA methods to generate consensus model rankings, followed by model refinement based on model combination (i.e. averaging). Our experiment demonstrates that the large-scale model QA approach is more consistent and robust in selecting models of better quality than any individual QA method. Our method was blindly tested during the 11th Critical Assessment of Techniques for Protein Structure Prediction (CASP11) as MULTICOM group. It was officially ranked third out of all 143 human and server predictors according to the total scores of the first models predicted for 78 CASP11 protein domains and second according to the total scores of the best of the five models predicted for these domains. MULTICOM’s outstanding performance in the extremely competitive 2014 CASP11 experiment proves that our large-scale QA approach together with model clustering is a promising solution to one of the two major problems in protein structure modeling.

Availability and implementation: The web server is available at: http://sysbio.rnet.missouri.edu/multicom_cluster/human/.

Contact: chengji@missouri.edu

[详细]

  • Bioinformatics
  • 10年前
  • PROTEINS

Using kernelized partial canonical correlation analysis to study directly coupled side chains and allostery in small G proteins

利用核部分典型相关分析研究直接连接的侧链和小G蛋白的变构

Motivation: Inferring structural dependencies among a protein’s side chains helps us understand their coupled motions. It is known that coupled fluctuations can reveal pathways of communication used for information propagation in a molecule. Side-chain conformations are commonly represented by multivariate angular variables, but existing partial correlation methods that can be applied to this inference task are not capable of handling multivariate angular data. We propose a novel method to infer direct couplings from this type of data, and show that this method is useful for identifying functional regions and their interactions in allosteric proteins.

Results: We developed a novel extension of canonical correlation analysis (CCA), which we call ‘kernelized partial CCA’ (or simply KPCCA), and used it to infer direct couplings between side chains, while disentangling these couplings from indirect ones. Using the conformational information and fluctuations of the inactive structure alone for allosteric proteins in the Ras and other Ras-like families, our method identified allosterically important residues not only as strongly coupled ones but also in densely connected regions of the interaction graph formed by the inferred couplings. Our results were in good agreement with other empirical findings. By studying distinct members of the Ras, Rho and Rab sub-families, we show further that KPCCA was capable of inferring common allosteric characteristics in the small G protein super-family.

Availability and implementation: https://github.com/lsgh/ismb15

Contact: lsoltang@uwaterloo.ca

[详细]

  • Bioinformatics
  • 10年前
  • PROTEINS

Finding optimal interaction interface alignments between biological complexes

寻找最佳的交互界面之间的比对生物复合物

Motivation: Biological molecules perform their functions through interactions with other molecules. Structure alignment of interaction interfaces between biological complexes is an indispensable step in detecting their structural similarities, which are keys to understanding their evolutionary histories and functions. Although various structure alignment methods have been developed to successfully access the similarities of protein structures or certain types of interaction interfaces, existing alignment tools cannot directly align arbitrary types of interfaces formed by protein, DNA or RNA molecules. Specifically, they require a blackbox preprocessing to standardize interface types and chain identifiers. Yet their performance is limited and sometimes unsatisfactory.

Results: Here we introduce a novel method, PROSTA-inter, that automatically determines and aligns interaction interfaces between two arbitrary types of complex structures. Our method uses sequentially remote fragments to search for the optimal superimposition. The optimal residue matching problem is then formulated as a maximum weighted bipartite matching problem to detect the optimal sequence order-independent alignment. Benchmark evaluation on all non-redundant proteinDNA complexes in PDB shows significant performance improvement of our method over TM-align and iAlign (with the blackbox preprocessing). Two case studies where our method discovers, for the first time, structural similarities between two pairs of functionally related proteinDNA complexes are presented. We further demonstrate the power of our method on detecting structural similarities between a proteinprotein complex and a proteinRNA complex, which is biologically known as a proteinRNA mimicry case.

Availability and implementation: The PROSTA-inter web-server is publicly available at http://www.cbrc.kaust.edu.sa/prosta/.

Contact: xin.gao@kaust.edu.sa

[详细]

  • Bioinformatics
  • 10年前
  • PROTEINS

Deconvolving molecular signatures of interactions between microbial colonies

反褶积的分子特征之间的相互作用的微生物菌落

Motivation: The interactions between microbial colonies through chemical signaling are not well understood. A microbial colony can use different molecules to inhibit or accelerate the growth of other colonies. A better understanding of the molecules involved in these interactions could lead to advancements in health and medicine. Imaging mass spectrometry (IMS) applied to co-cultured microbial communities aims to capture the spatial characteristics of the colonies’ molecular fingerprints. These data are high-dimensional and require computational analysis methods to interpret.

Results: Here, we present a dictionary learning method that deconvolves spectra of different molecules from IMS data. We call this method MOLecular Dictionary Learning (MOLDL). Unlike standard dictionary learning methods which assume Gaussian-distributed data, our method uses the Poisson distribution to capture the count nature of the mass spectrometry data. Also, our method incorporates universally applicable information on common ion types of molecules in MALDI mass spectrometry. This greatly reduces model parameterization and increases deconvolution accuracy by eliminating spurious solutions. Moreover, our method leverages the spatial nature of IMS data by assuming that nearby locations share similar abundances, thus avoiding overfitting to noise. Tests on simulated datasets show that this method has good performance in recovering molecule dictionaries. We also tested our method on real data measured on a microbial community composed of two species. We confirmed through follow-up validation experiments that our method recovered true and complete signatures of molecules. These results indicate that our method can discover molecules in IMS data reliably, and hence can help advance the study of interaction of microbial colonies.

Availability and implementation: The code used in this paper is available at: https://github.com/frizfealer/IMS_project.

Contact: vjojic@cs.unc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • PROTEINS

cNMA: a framework of encounter complex-based normal mode analysis to model conformational changes in protein interactions

CNMA框架基础:遇到复杂的模态分析的模型在蛋白质相互作用的构象变化

Motivation: It remains both a fundamental and practical challenge to understand and anticipate motions and conformational changes of proteins during their associations. Conventional normal mode analysis (NMA) based on anisotropic network model (ANM) addresses the challenge by generating normal modes reflecting intrinsic flexibility of proteins, which follows a conformational selection model for protein–protein interactions. But earlier studies have also found cases where conformational selection alone could not adequately explain conformational changes and other models have been proposed. Moreover, there is a pressing demand of constructing a much reduced but still relevant subset of protein conformational space to improve computational efficiency and accuracy in protein docking, especially for the difficult cases with significant conformational changes.

Method and results: With both conformational selection and induced fit models considered, we extend ANM to include concurrent but differentiated intra- and inter-molecular interactions and develop an encounter complex-based NMA (cNMA) framework. Theoretical analysis and empirical results over a large data set of significant conformational changes indicate that cNMA is capable of generating conformational vectors considerably better at approximating conformational changes with contributions from both intrinsic flexibility and inter-molecular interactions than conventional NMA only considering intrinsic flexibility does. The empirical results also indicate that a straightforward application of conventional NMA to an encounter complex often does not improve upon NMA for an individual protein under study and intra- and inter-molecular interactions need to be differentiated properly. Moreover, in addition to induced motions of a protein under study, the induced motions of its binding partner and the coupling between the two sets of protein motions present in a near-native encounter complex lead to the improved performance. A study to isolate and assess the sole contribution of intermolecular interactions toward improvements against conventional NMA further validates the additional benefit from induced-fit effects. Taken together, these results provide new insights into molecular mechanisms underlying protein interactions and new tools for dimensionality reduction for flexible protein docking.

Availability and implementation: Source codes are available upon request.

Contact: yshen@tamu.edu

[详细]

  • Bioinformatics
  • 10年前
  • PROTEINS

Metabolome-scale de novo pathway reconstruction using regioisomer-sensitive graph alignments

从头合成途径的代谢组规模重建区域异构体敏感图对齐

Motivation: Recent advances in mass spectrometry and related metabolomics technologies have enabled the rapid and comprehensive analysis of numerous metabolites. However, biosynthetic and biodegradation pathways are only known for a small portion of metabolites, with most metabolic pathways remaining uncharacterized.

Results: In this study, we developed a novel method for supervised de novo metabolic pathway reconstruction with an improved graph alignment-based approach in the reaction-filling framework. We proposed a novel chemical graph alignment algorithm, which we called PACHA (Pairwise Chemical Aligner), to detect the regioisomer-sensitive connectivities between the aligned substructures of two compounds. Unlike other existing graph alignment methods, PACHA can efficiently detect only one common subgraph between two compounds. Our results show that the proposed method outperforms previous descriptor-based methods or existing graph alignment-based methods in the enzymatic reaction-likeness prediction for isomer-enriched reactions. It is also useful for reaction annotation that assigns potential reaction characteristics such as EC (Enzyme Commission) numbers and PIERO (Enzymatic Reaction Ontology for Partial Information) terms to substrate–product pairs. Finally, we conducted a comprehensive enzymatic reaction-likeness prediction for all possible uncharacterized compound pairs, suggesting potential metabolic pathways for newly predicted substrate–product pairs.

Contact: maskot@bio.titech.ac.jp

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS

Exploring the structure and function of temporal networks with dynamic graphlets

动态graphlets探索时空网络的结构与功能

Motivation: With increasing availability of temporal real-world networks, how to efficiently study these data? One can model a temporal network as a single aggregate static network, or as a series of time-specific snapshots, each being an aggregate static network over the corresponding time window. Then, one can use established methods for static analysis on the resulting aggregate network(s), but losing in the process valuable temporal information either completely, or at the interface between different snapshots, respectively. Here, we develop a novel approach for studying a temporal network more explicitly, by capturing inter-snapshot relationships.

Results: We base our methodology on well-established graphlets (subgraphs), which have been proven in numerous contexts in static network research. We develop new theory to allow for graphlet-based analyses of temporal networks. Our new notion of dynamic graphlets is different from existing dynamic network approaches that are based on temporal motifs (statistically significant subgraphs). The latter have limitations: their results depend on the choice of a null network model that is required to evaluate the significance of a subgraph, and choosing a good null model is non-trivial. Our dynamic graphlets overcome the limitations of the temporal motifs. Also, when we aim to characterize the structure and function of an entire temporal network or of individual nodes, our dynamic graphlets outperform the static graphlets. Clearly, accounting for temporal information helps. We apply dynamic graphlets to temporal age-specific molecular network data to deepen our limited knowledge about human aging.

Availability and implementation: http://www.nd.edu/~cone/DG.

Contact: tmilenko@nd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS

Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses

适应组合:学习当地的遗传相关结构提高了分析汇总统计

Motivation: Approaches to identifying new risk loci, training risk prediction models, imputing untyped variants and fine-mapping causal variants from summary statistics of genome-wide association studies are playing an increasingly important role in the human genetics community. Current summary statistics-based methods rely on global ‘best guess’ reference panels to model the genetic correlation structure of the dataset being studied. This approach, especially in admixed populations, has the potential to produce misleading results, ignores variation in local structure and is not feasible when appropriate reference panels are missing or small. Here, we develop a method, Adapt-Mix, that combines information across all available reference panels to produce estimates of local genetic correlation structure for summary statistics-based methods in arbitrary populations.

Results: We applied Adapt-Mix to estimate the genetic correlation structure of both admixed and non-admixed individuals using simulated and real data. We evaluated our method by measuring the performance of two summary statistics-based methods: imputation and joint-testing. When using our method as opposed to the current standard of ‘best guess’ reference panels, we observed a 28% decrease in mean-squared error for imputation and a 73.7% decrease in mean-squared error for joint-testing.

Availability and implementation: Our method is publicly available in a software package called ADAPT-Mix available at https://github.com/dpark27/adapt_mix.

Contact: noah.zaitlen@ucsf.edu

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS

Inferring parental genomic ancestries using pooled semi-Markov processes

推断亲本的基因组的祖先汇集半马尔可夫过程

Motivation: A basic problem of broad public and scientific interest is to use the DNA of an individual to infer the genomic ancestries of the parents. In particular, we are often interested in the fraction of each parent’s genome that comes from specific ancestries (e.g. European, African, Native American, etc). This has many applications ranging from understanding the inheritance of ancestry-related risks and traits to quantifying human assortative mating patterns.

Results: We model the problem of parental genomic ancestry inference as a pooled semi-Markov process. We develop a general mathematical framework for pooled semi-Markov processes and construct efficient inference algorithms for these models. Applying our inference algorithm to genotype data from 231 Mexican trios and 258 Puerto Rican trios where we have the true genomic ancestry of each parent, we demonstrate that our method accurately infers parameters of the semi-Markov processes and parents’ genomic ancestries. We additionally validated the method on simulations. Our model of pooled semi-Markov process and inference algorithms may be of independent interest in other settings in genomics and machine learning.

Contact: jazo@microsoft.com

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS

Integrative random forest for gene regulatory network inference

综合随机森林的基因调控网络推理

Motivation: Gene regulatory network (GRN) inference based on genomic data is one of the most actively pursued computational biological problems. Because different types of biological data usually provide complementary information regarding the underlying GRN, a model that integrates big data of diverse types is expected to increase both the power and accuracy of GRN inference. Towards this goal, we propose a novel algorithm named iRafNet: integrative random forest for gene regulatory network inference.

Results: iRafNet is a flexible, unified integrative framework that allows information from heterogeneous data, such as protein–protein interactions, transcription factor (TF)-DNA-binding, gene knock-down, to be jointly considered for GRN inference. Using test data from the DREAM4 and DREAM5 challenges, we demonstrate that iRafNet outperforms the original random forest based network inference algorithm (GENIE3), and is highly comparable to the community learning approach. We apply iRafNet to construct GRN in Saccharomyces cerevisiae and demonstrate that it improves the performance in predicting TF-target gene regulations and provides additional functional insights to the predicted gene regulations.

Availability and implementation: The R code of iRafNet implementation and a tutorial are available at: http://research.mssm.edu/tulab/software/irafnet.html

Contact: zhidong.tu@mssm.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS

Identification of causal genes for complex traits

复杂性状影响基因的鉴定

Motivation: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider ‘causal variants’ as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations.

Results: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability . Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2.

Availability and implementation: Software is freely available for download at genetics.cs.ucla.edu/caviar.

Contact: eeskin@cs.ucla.edu

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS