KAPPA, a simple algorithm for discovery and clustering of proteins defined by a key amino acid pattern: a case study of the cysteine-rich proteins

简单、a algorithm for KAPPA发现和关于化学品和废物的proteins办法作了key amino酸模型:a case study of the cysteine-rich proteins

Motivation: Proteins defined by a key amino acid pattern are key players in the exchange of signals between bacteria, animals and plants, as well as important mediators for cell–cell communication within a single organism. Their description and characterization open the way to a better knowledge of molecular signalling in a broad range of organisms, and to possible application in medical and agricultural research. The contrasted pattern of evolution in these proteins makes it difficult to detect and cluster them with classical sequence-based search tools. Here, we introduce Key Aminoacid Pattern-based Protein Analyzer (KAPPA), a new multi-platform program to detect them in a given set of proteins, analyze their pattern and cluster them by comparison to reference patterns (ab initio search) or internal pairwise comparison (de novo search).

Results: In this study, we use the concrete example of cysteine-rich proteins (CRPs) to show that the similarity of two cysteine patterns can be precisely and efficiently assessed by a quantitative tool created for KAPPA: the -score. We also demonstrate the clear advantage of KAPPA over other classical sequence search tools for ab initio search of new CRPs. Eventually, we present de novo clustering and subclustering functionalities that allow to rapidly generate consistent groups of CRPs without a seed reference.

Availability and implementation: KAPPA executables are available for Linux, Windows and Mac OS at http://kappa-sequence-search.sourceforge.net.

Contact: dp.matton@umontreal.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Omics Pipe: a community-based framework for reproducible multi-omics data analysis

组学管:一个以社区为基础的可再生的multi-omics数据分析的框架

Motivation: Omics Pipe (http://sulab.scripps.edu/omicspipe) is a computational framework that automates multi-omics data analysis pipelines on high performance compute clusters and in the cloud. It supports best practice published pipelines for RNA-seq, miRNA-seq, Exome-seq, Whole-Genome sequencing, ChIP-seq analyses and automatic processing of data from The Cancer Genome Atlas (TCGA). Omics Pipe provides researchers with a tool for reproducible, open source and extensible next generation sequencing analysis. The goal of Omics Pipe is to democratize next-generation sequencing analysis by dramatically increasing the accessibility and reproducibility of best practice computational pipelines, which will enable researchers to generate biologically meaningful and interpretable results.

Results: Using Omics Pipe, we analyzed 100 TCGA breast invasive carcinoma paired tumor-normal datasets based on the latest UCSC hg19 RefSeq annotation. Omics Pipe automatically downloaded and processed the desired TCGA samples on a high throughput compute cluster to produce a results report for each sample. We aggregated the individual sample results and compared them to the analysis in the original publications. This comparison revealed high overlap between the analyses, as well as novel findings due to the use of updated annotations and methods.

Availability and implementation: Source code for Omics Pipe is freely available on the web (https://bitbucket.org/sulab/omics_pipe). Omics Pipe is distributed as a standalone Python package for installation (https://pypi.python.org/pypi/omics_pipe) and as an Amazon Machine Image in Amazon Web Services Elastic Compute Cloud that contains all necessary third-party software dependencies and databases (https://pythonhosted.org/omics_pipe/AWS_installation.html).

Contact: asu@scripps.edu or kfisch@ucsd.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

bbcontacts: prediction of {beta}-strand pairing from direct coupling patterns

bbcontacts:{β}链配对的预测直接耦合模式

Motivation: It has recently become possible to build reliable de novo models of proteins if a multiple sequence alignment (MSA) of at least 1000 homologous sequences can be built. Methods of global statistical network analysis can explain the observed correlations between columns in the MSA by a small set of directly coupled pairs of columns. Strong couplings are indicative of residue-residue contacts, and from the predicted contacts a structure can be computed. Here, we exploit the structural regularity of paired β-strands that leads to characteristic patterns in the noisy matrices of couplings. The β–β contacts should be detected more reliably than single contacts, reducing the required number of sequences in the MSAs.

Results: bbcontacts predicts β–β contacts by detecting these characteristic patterns in the 2D map of coupling scores using two hidden Markov models (HMMs), one for parallel and one for antiparallel contacts. β-bulges are modelled as indel states. In contrast to existing methods, bbcontacts uses predicted instead of true secondary structure. On a standard set of 916 test proteins, 34% of which have MSAs with < 1000 sequences, bbcontacts achieves 50% precision for contacting β–β residue pairs at 50% recall using predicted secondary structure and 64% precision at 64% recall using true secondary structure, while existing tools achieve around 45% precision at 45% recall using true secondary structure.

Availability and implementation: bbcontacts is open source software (GNU Affero GPL v3) available at https://bitbucket.org/soedinglab/bbcontacts

Contact: jessica.andreani@mines.org or soeding@mpibpc.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • STRUCTURAL BIOINFORMATICS

Computational identification of MoRFs in protein sequences

计算全在蛋白质序列的识别

Motivation: Intrinsically disordered regions of proteins play an essential role in the regulation of various biological processes. Key to their regulatory function is the binding of molecular recognition features (MoRFs) to globular protein domains in a process known as a disorder-to-order transition. Predicting the location of MoRFs in protein sequences with high accuracy remains an important computational challenge.

Method: In this study, we introduce MoRFCHiBi, a new computational approach for fast and accurate prediction of MoRFs in protein sequences. MoRFCHiBi combines the outcomes of two support vector machine (SVM) models that take advantage of two different kernels with high noise tolerance. The first, SVMS, is designed to extract maximal information from the general contrast in amino acid compositions between MoRFs, their surrounding regions (Flanks), and the remainders of the sequences. The second, SVMT, is used to identify similarities between regions in a query sequence and MoRFs of the training set.

Results: We evaluated the performance of our predictor by comparing its results with those of two currently available MoRF predictors, MoRFpred and ANCHOR. Using three test sets that have previously been collected and used to evaluate MoRFpred and ANCHOR, we demonstrate that MoRFCHiBi outperforms the other predictors with respect to different evaluation metrics. In addition, MoRFCHiBi is downloadable and fast, which makes it useful as a component in other computational prediction tools.

Availability and implementation: http://www.chibi.ubc.ca/morf/.

Contact: gsponer@chibi.ubc.ca.

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • STRUCTURAL BIOINFORMATICS

ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways

分配:上下文相关的基因组分析的多个异构生物通路

Motivation: Although gene-expression signature-based biomarkers are often developed for clinical diagnosis, many promising signatures fail to replicate during validation. One major challenge is that biological samples used to generate and validate the signature are often from heterogeneous biological contexts—controlled or in vitro samples may be used to generate the signature, but patient samples may be used for validation. In addition, systematic technical biases from multiple genome-profiling platforms often mask true biological variation. Addressing such challenges will enable us to better elucidate disease mechanisms and provide improved guidance for personalized therapeutics.

Results: Here, we present a pathway profiling toolkit, Adaptive Signature Selection and InteGratioN (ASSIGN), which enables robust and context-specific pathway analyses by efficiently capturing pathway activity in heterogeneous sets of samples and across profiling technologies. The ASSIGN framework is based on a flexible Bayesian factor analysis approach that allows for simultaneous profiling of multiple correlated pathways and for the adaptation of pathway signatures into specific disease. We demonstrate the robustness and versatility of ASSIGN in estimating pathway activity in simulated data, cell lines perturbed pathways and in primary tissues samples including The Cancer Genome Atlas breast carcinoma samples and liver samples exposed to genotoxic carcinogens.

Availability and implementation: Software for our approach is available for download at: http://www.bioconductor.org/packages/release/bioc/html/ASSIGN.html and https://github.com/wevanjohnson/ASSIGN.

Contact: andreab@genetics.utah.edu or wej@bu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

Bayesian feature selection for high-dimensional linear regression via the Ising approximation with applications to genomics

贝叶斯特征选择通过伊辛高维线性回归近似应用基因组学

Motivation: Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets.

Results: Here, we introduce a new approach—the Bayesian Ising Approximation (BIA)—to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model with weak couplings. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high-dimensional regression by analyzing a gene expression dataset with nearly 30 000 features. These results also highlight the impact of correlations between features on Bayesian feature selection.

Availability and implementation: An implementation of the BIA in C++, along with data for reproducing our gene expression analyses, are freely available at http://physics.bu.edu/~pankajm/BIACode.

Contact: charleskennethfisher@gmail.com or ckfisher@bu.edu or pankajm@bu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

Gaussian process test for high-throughput sequencing time series: application to experimental evolution

高斯过程测试高通量测序时间系列:应用进化实验

Motivation: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but also monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analyzing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth.

Results: We present the beta-binomial Gaussian process model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine it with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present simulations exploring different experimental design choices and results on real data from Drosophila experimental evolution experiment in temperature adaptation.

Availability and implementation: R software implementing the test is available at https://github.com/handetopa/BBGP.

Contact: hande.topa@aalto.fi, agnes.jonas@vetmeduni.ac.at, carolin.kosiol@vetmeduni.ac.at, antti.honkela@hiit.fi

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENETICS AND POPULATION ANALYSIS

Context-specific metabolic network reconstruction of a naphthalene-degrading bacterial community guided by metaproteomic data

上下文相关的代谢网络重建naphthalene-degrading细菌社区由metaproteomic数据

Motivation: With the advent of meta-‘omics’ data, the use of metabolic networks for the functional analysis of microbial communities became possible. However, while network-based methods are widely developed for single organisms, their application to bacterial communities is currently limited.

Results: Herein, we provide a novel, context-specific reconstruction procedure based on metaproteomic and taxonomic data. Without previous knowledge of a high-quality, genome-scale metabolic networks for each different member in a bacterial community, we propose a meta-network approach, where the expression levels and taxonomic assignments of proteins are used as the most relevant clues for inferring an active set of reactions. Our approach was applied to draft the context-specific metabolic networks of two different naphthalene-enriched communities derived from an anthropogenically influenced, polyaromatic hydrocarbon contaminated soil, with (CN2) or without (CN1) bio-stimulation. We were able to capture the overall functional differences between the two conditions at the metabolic level and predict an important activity for the fluorobenzoate degradation pathway in CN1 and for geraniol metabolism in CN2. Experimental validation was conducted, and good agreement with our computational predictions was observed. We also hypothesize different pathway organizations at the organismal level, which is relevant to disentangle the role of each member in the communities. The approach presented here can be easily transferred to the analysis of genomic, transcriptomic and metabolomic data.

Contact: fplanes@ceit.es or mferrer@icp.csic.es

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS BIOLOGY

The assembly of miRNA-mRNA-protein regulatory networks using high-throughput expression data

miRNA-mRNA-protein监管网络的组装使用高通量表达式的数据

Motivation: Inference of gene regulatory networks from high throughput measurement of gene and protein expression is particularly attractive because it allows the simultaneous discovery of interactive molecular signals for numerous genes and proteins at a relatively low cost.

Results: We developed two score-based local causal learning algorithms that utilized the Markov blanket search to identify direct regulators of target mRNAs and proteins. These two algorithms were specifically designed for integrated high throughput RNA and protein data. Simulation study showed that these algorithms outperformed other state-of-the-art gene regulatory network learning algorithms. We also generated integrated miRNA, mRNA, and protein expression data based on high throughput analysis of primary trophoblasts, derived from term human placenta and cultured under standard or hypoxic conditions. We applied the new algorithms to these data and identified gene regulatory networks for a set of trophoblastic proteins found to be differentially expressed under the specified culture conditions.

Contact: ysadovsky@mwri.magee.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS BIOLOGY

Similarity-based prediction for Anatomical Therapeutic Chemical classification of drugs by integrating multiple data sources

相似性预测解剖学治疗药物的化学分类通过集成多个数据源

Motivation: Anatomical Therapeutic Chemical (ATC) classification system, widely applied in almost all drug utilization studies, is currently the most widely recognized classification system for drugs. Currently, new drug entries are added into the system only on users’ requests, which leads to seriously incomplete drug coverage of the system, and bioinformatics prediction is helpful during this process.

Results: Here we propose a novel prediction model of drug-ATC code associations, using logistic regression to integrate multiple heterogeneous data sources including chemical structures, target proteins, gene expression, side-effects and chemical–chemical associations. The model obtains good performance for the prediction not only on ATC codes of unclassified drugs but also on new ATC codes of classified drugs assessed by cross-validation and independent test sets, and its efficacy exceeds previous methods. Further to facilitate the use, the model is developed into a user-friendly web service SPACE (Similarity-based Predictor of ATC CodE), which for each submitted compound, will give candidate ATC codes (ranked according to the decreasing probability_score predicted by the model) together with corresponding supporting evidence. This work not only contributes to knowing drugs’ therapeutic, pharmacological and chemical properties, but also provides clues for drug repositioning and side-effect discovery. In addition, the construction of the prediction model also provides a general framework for similarity-based data integration which is suitable for other drug-related studies such as target, side-effect prediction etc.

Availability and implementation: The web service SPACE is available at http://www.bprc.ac.cn/space

Contact: hefc@nic.bmi.ac.cn or lidong.bprc@foxmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS BIOLOGY

Plant photosynthesis phenomics data quality control

植物光合作用phenomics数据质量控制

Motivation: Plant phenomics, the collection of large-scale plant phenotype data, is growing exponentially. The resources have become essential component of modern plant science. Such complex datasets are critical for understanding the mechanisms governing energy intake and storage in plants, and this is essential for improving crop productivity. However, a major issue facing these efforts is the determination of the quality of phenotypic data. Automated methods are needed to identify and characterize alterations caused by system errors, all of which are difficult to remove in the data collection step and distinguish them from more interesting cases of altered biological responses.

Results: As a step towards solving this problem, we have developed a coarse-to-refined model called dynamic filter to identify abnormalities in plant photosynthesis phenotype data by comparing light responses of photosynthesis using a simplified kinetic model of photosynthesis. Dynamic filter employs an expectation-maximization process to adjust the kinetic model in coarse and refined regions to identify both abnormalities and biological outliers. The experimental results show that our algorithm can effectively identify most of the abnormalities in both real and synthetic datasets.

Availability and implementation: Software available at www.msu.edu/%7Ejinchen/DynamicFilter

Contact: jinchen@msu.edu or kramerd8@cns.msu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • DATA AND TEXT MINING

Prediction of potential disease-associated microRNAs based on random walk

基于随机游走的预测潜在的疾病有关的小分子核糖核酸

Motivation: Identifying microRNAs associated with diseases (disease miRNAs) is helpful for exploring the pathogenesis of diseases. Because miRNAs fulfill function via the regulation of their target genes and because the current number of experimentally validated targets is insufficient, some existing methods have inferred potential disease miRNAs based on the predicted targets. It is difficult for these methods to achieve excellent performance due to the high false-positive and false-negative rates for the target prediction results. Alternatively, several methods have constructed a network composed of miRNAs based on their associated diseases and have exploited the information within the network to predict the disease miRNAs. However, these methods have failed to take into account the prior information regarding the network nodes and the respective local topological structures of the different categories of nodes. Therefore, it is essential to develop a method that exploits the more useful information to predict reliable disease miRNA candidates.

Results: miRNAs with similar functions are normally associated with similar diseases and vice versa. Therefore, the functional similarity between a pair of miRNAs is calculated based on their associated diseases to construct a miRNA network. We present a new prediction method based on random walk on the network. For the diseases with some known related miRNAs, the network nodes are divided into labeled nodes and unlabeled nodes, and the transition matrices are established for the two categories of nodes. Furthermore, different categories of nodes have different transition weights. In this way, the prior information of nodes can be completely exploited. Simultaneously, the various ranges of topologies around the different categories of nodes are integrated. In addition, how far the walker can go away from the labeled nodes is controlled by restarting the walking. This is helpful for relieving the negative effect of noisy data. For the diseases without any known related miRNAs, we extend the walking on a miRNA-disease bilayer network. During the prediction process, the similarity between diseases, the similarity between miRNAs, the known miRNA-disease associations and the topology information of the bilayer network are exploited. Moreover, the importance of information from different layers of network is considered. Our method achieves superior performance for 18 human diseases with AUC values ranging from 0.786 to 0.945. Moreover, case studies on breast neoplasms, lung neoplasms, prostatic neoplasms and 32 diseases further confirm the ability of our method to discover potential disease miRNAs.

Availability and implementation: A web service for the prediction and analysis of disease miRNAs is available at http://bioinfolab.stx.hk/midp/.

Contact: guoyahong_hlju@163.com or lixia@hrbmu.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • DATA AND TEXT MINING

Factor graph analysis of live cell-imaging data reveals mechanisms of cell fate decisions

因子图实时cell-imaging数据的分析,将揭示机制的细胞命运的决定

Motivation: Cell fate decisions have a strong stochastic component. The identification of the underlying mechanisms therefore requires a rigorous statistical analysis of large ensembles of single cells that were tracked and phenotyped over time.

Results: We introduce a probabilistic framework for testing elementary hypotheses on dynamic cell behavior using time-lapse cell-imaging data. Factor graphs, probabilistic graphical models, are used to properly account for cell lineage and cell phenotype information. Our model is applied to time-lapse movies of murine granulocyte-macrophage progenitor (GMP) cells. It decides between competing hypotheses on the mechanisms of their differentiation. Our results theoretically substantiate previous experimental observations that lineage instruction, not selection is the cause for the differentiation of GMP cells into mature monocytes or neutrophil granulocytes.

Availability and implementation: The Matlab source code is available at http://treschgroup.de/Genealogies.html

Contact: failmezger@mpipz.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • BIOIMAGE INFORMATICS

RAMPART: a workflow management system for de novo genome assembly

RAMPART:工作流管理系统对新创基因组组装

Motivation: The de novo assembly of genomes from whole- genome shotgun sequence data is a computationally intensive, multi-stage task and it is not known a priori which methods and parameter settings will produce optimal results. In current de novo assembly projects, a popular strategy involves trying many approaches, using different tools and settings, and then comparing and contrasting the results in order to select a final assembly for publication.

Results: Herein, we present RAMPART, a configurable workflow management system for de novo genome assembly, which helps the user identify combinations of third-party tools and settings that provide good results for their particular genome and sequenced reads. RAMPART is designed to exploit High performance computing environments, such as clusters and shared memory systems, where available.

Availability and implementation: RAMPART is available under the GPLv3 license at: https://github.com/TGAC/RAMPART.

Contact: daniel.mapleson@tgac.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online. In addition, the user manual is available online at: http://rampart.readthedocs.org/en/latest.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

Transposome: a toolkit for annotation of transposable element families from unassembled sequence reads

Transposome:注释的工具包转座因子家庭未装配的顺序读取

Motivation: Transposable elements (TEs) can be found in virtually all eukaryotic genomes and have the potential to produce evolutionary novelty. Despite the broad taxonomic distribution of TEs, the evolutionary history of these sequences is largely unknown for many taxa due to a lack of genomic resources and identification methods. Given that most TE annotation methods are designed to work on genome assemblies, we sought to develop a method to provide a fine-grained classification of TEs from DNA sequence reads. Here, we present a toolkit for the efficient annotation of TE families from low-coverage whole-genome shotgun (WGS) data, enabling the rapid identification of TEs in a large number of taxa. We compared our software, Transposome, with other approaches for annotating repeats from WGS data, and we show that it offers significant improvements in run time and produces more precise estimates of genomic repeat abundance. Transposome may also be used as a general toolkit for working with Next Generation Sequencing (NGS) data, and for constructing custom genome analysis pipelines.

Availability and implementation: The source code for Transposome is freely available (http://sestaton.github.io/Transposome), implemented in Perl and is supported on Linux.

Contact: statonse@biodiversity.ubc.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

MetAmp: combining amplicon data from multiple markers for OTU analysis

MetAmp:结合OTU分析从多个标记扩增子数据

Motivation: We present a novel method and corresponding application, MetAmp, to combine amplicon data from multiple genomic markers into Operational Taxonomic Units (OTUs) for microbial community analysis, calibrating the markers using data from known microbial genomes. When amplicons for multiple markers such as the 16S rRNA gene hypervariable regions are available, MetAmp improves the accuracy of OTU-based methods for characterizing bacterial composition and community structure. MetAmp works best with at least three markers, and is applicable to non-bacterial analyses and to non 16S markers. Our application and testing have been limited to 16S analysis of microbial communities.

Results: We clustered standard test sequences derived from the Human Microbiome Mock Community test sets and compared MetAmp and other tools with respect to their ability to recover OTUs for these benchmark bacterial communities. MetAmp compared favorably to QIIME, UPARSE and Mothur using amplicons from one, two, and three markers.

Availability and implementation: MetAmp is available at http://izhbannikov.github.io/MetAmp/

Contact: ilyaz@uidaho.edu, foster@uidaho.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

SFA-SPA: a suffix array based short peptide assembler for metagenomic data

SFA-SPA:基于后缀数组的短肽为宏基因组数据汇编

Summary: The determination of protein sequences from a metagenomic dataset enables the study of metabolism and functional roles of the organisms that are present in the sampled microbial community. We had previously introduced algorithm and software for the accurate reconstruction of protein sequences from short peptides identified on nucleotide reads in a metagenomic dataset. Here, we present significant computational improvements to the short peptide assembly algorithm that make it practical to reconstruct proteins from large metagenomic datasets containing several hundred million reads, while maintaining accuracy. The improved computational efficiency is achieved using a suffix array data structure that allows for fast querying during the assembly process, and a significant redesign of assembly steps that enables multi-threaded execution.

Availability and implementation: The program is available under the GPLv3 license from sourceforge.net/projects/spa-assembler.

Contact: syooseph@jcvi.org

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Learning HMMs for nucleotide sequences from amino acid alignments

学习嗯氨基酸的核苷酸序列比对

Profile hidden Markov models (profile HMMs) are known to efficiently predict whether an amino acid (AA) sequence belongs to a specific protein family. Profile HMMs can also be used to search for protein domains in genome sequences. In this case, HMMs are typically learned from AA sequences and then used to search on the six-frame translation of nucleotide (NT) sequences. However, this approach demands additional processing of the original data and search results. Here, we propose an alternative and more direct method which converts an AA alignment into an NT one, after which an NT-based HMM is trained to be applied directly on a genome.

Contact: carlos@rc.unesp.br

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

MethylMix: an R package for identifying DNA methylation-driven genes

MethylMix:R包识别DNA methylation-driven基因

Summary: DNA methylation is an important mechanism regulating gene transcription, and its role in carcinogenesis has been extensively studied. Hyper and hypomethylation of genes is an alternative mechanism to deregulate gene expression in a wide range of diseases. At the same time, high-throughput DNA methylation assays have been developed generating vast amounts of genome wide DNA methylation measurements. Yet, few tools exist that can formally identify hypo and hypermethylated genes that are predictive of transcription and thus functionally relevant for a particular disease. To accommodate this lack of tools, we developed MethylMix, an algorithm implemented in R to identify disease specific hyper and hypomethylated genes. MethylMix is based on a beta mixture model to identify methylation states and compares them with the normal DNA methylation state. MethylMix introduces a novel metric, the ‘Differential Methylation value’ or DM-value defined as the difference of a methylation state with the normal methylation state. Finally, matched gene expression data are used to identify, besides differential, transcriptionally predictive methylation states by focusing on methylation changes that effect gene expression.

Availability and implementation: MethylMix was implemented as an R package and is available in bioconductor.

Contact: olivier.gevaert@stanford.edu

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

dslice: an R package for nonparametric testing of associations with application in QTL and gene set analysis

dslice:R包协会的非参数测试与应用程序在QTL和基因分析

Summary: Many statistical problems in bioinformatics and genetics can be formulated as the testing of associations between a categorical variable and a continuous variable. A dynamic slicing method was proposed for non-parametric dependence testing, which has been demonstrated to have higher powers compared with traditional methods such as Kolmogorov–Smirnov test. We introduce an R package dslice to facilitate the use of dynamic slicing method in bioinformatic applications such as quantitative trait loci study and gene set enrichment analysis.

Availability and implementation: dslice is implemented in Rcpp and available in the Comprehensive R Archive Network. The package is distributed under the GNU General Public License (version 2 or later).

Contact: zhangxg@tsinghua.edu.cn or jliu@stat.harvard.edu.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

3USS: a web server for detecting alternative 3'UTRs from RNA-seq experiments

3号:web服务器检测替代3 'utrs RNA-seq实验

Summary: Protein-coding genes with multiple alternative polyadenylation sites can generate mRNA 3'UTR sequences of different lengths, thereby causing the loss or gain of regulatory elements, which can affect stability, localization and translation efficiency. 3USS is a web-server developed with the aim of giving experimentalists the possibility to automatically identify alternative 3'UTRs (shorter or longer with respect to a reference transcriptome), an option that is not available in standard RNA-seq data analysis procedures. The tool reports as putative novel the 3'UTRs not annotated in available databases. Furthermore, if data from two related samples are uploaded, common and specific alternative 3'UTRs are identified and reported by the server.

Availability and implementation: 3USS is freely available at http://www.biocomputing.it/3uss_server

Contact: anna.tramontano@uniroma1.it

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

GenoExp: a web tool for predicting gene expression levels from single nucleotide polymorphisms

GenoExp tool for web:从基因水平预测言论nucleotide polymorphisms单

Summary: Understanding the effect of single nucleotide polymorphisms (SNPs) on the expression level of genes is an important goal. We recently published a study in which we devised a multi-SNP predictive model for gene expression in Lymphoblastoid cell lines (LCL), and showed that it can robustly predict the expression of a small number of genes in test individuals. Here, we validate the generality of our models by predicting expression profiles for genes in LCL in an independent study, and extend the pool of predictable genes for which we are able to explain more than 25% of their expression variability to 232 genes across 14 different cell types. As the number of people who obtained their SNP profiles through companies such as 23andMe is rising rapidly, we developed GenoExp, a web-based tool in which users can upload their individual SNP data and obtain predicted expression levels for the set of predictable genes across the 14 different cell types. Our tool thus allows users with biological knowledge to study the possible effects that their set of SNPs might have on these genes and predict their cell-specific expression levels relative to the population average.

Availability and implementation: GenoExp is freely available at http://genie.weizmann.ac.il/pubs/GenoExp/.

Contact: eran.segal@weizmann.ac.il

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

ClassifyR: an R package for performance assessment of classification with applications to transcriptomics

ClassifyR:R包性能评估转录组的分类与应用程序

Although a large collection of classification software packages exist in R, a new generic framework for linking custom classification functions with classification performance measures is needed. A generic classification framework has been designed and implemented as an R package in an object oriented style. Its design places emphasis on parallel processing, reproducibility and extensibility. Finally, a comprehensive set of performance measures are available to ease post-processing. Taken together, these important characteristics enable rapid and reproducible benchmarking of alternative classifiers.

Availability and implementation: ClassifyR is implemented in R and can be obtained from the Bioconductor project: http://bioconductor.org/packages/release/bioc/html/ClassifyR.html

Contact: dario.strbenac@sydney.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

Expitope: a web server for epitope expression

Expitope:web服务器抗原决定基的表情

Motivation: Adoptive T cell therapies based on introduction of new T cell receptors (TCRs) into patient recipient T cells is a promising new treatment for various kinds of cancers. A major challenge, however, is the choice of target antigens. If an engineered TCR can cross-react with self-antigens in healthy tissue, the side-effects can be devastating. We present the first web server for assessing epitope sharing when designing new potential lead targets. We enable the users to find all known proteins containing their peptide of interest. The web server returns not only exact matches, but also approximate ones, allowing a number of mismatches of the users choice. For the identified candidate proteins the expression values in various healthy tissues, representing all vital human organs, are extracted from RNA Sequencing (RNA-Seq) data as well as from some cancer tissues as control. All results are returned to the user sorted by a score, which is calculated using well-established methods and tools for immunological predictions. It depends on the probability that the epitope is created by proteasomal cleavage and its affinities to the transporter associated with antigen processing and the major histocompatibility complex class I alleles. With this framework, we hope to provide a helpful tool to exclude potential cross-reactivity in the early stage of TCR selection for use in design of adoptive T cell immunotherapy.

Availability and implementation: The Expitope web server can be accessed via http://webclu.bio.wzw.tum.de/expitope.

Contact: d.frishman@wzw.tum.de

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION