MADGiC: a model-based approach for identifying driver genes in cancer

MADGiC:基于模型的方法确定司机在癌症基因

Motivation: Identifying and prioritizing somatic mutations is an important and challenging area of cancer research that can provide new insights into gene function as well as new targets for drug development. Most methods for prioritizing mutations rely primarily on frequency-based criteria, where a gene is identified as having a driver mutation if it is altered in significantly more samples than expected according to a background model. Although useful, frequency-based methods are limited in that all mutations are treated equally. It is well known, however, that some mutations have no functional consequence, while others may have a major deleterious impact. The spatial pattern of mutations within a gene provides further insight into their functional consequence. Properly accounting for these factors improves both the power and accuracy of inference. Also important is an accurate background model.

Results: Here, we develop a Model-based Approach for identifying Driver Genes in Cancer (termed MADGiC) that incorporates both frequency and functional impact criteria and accommodates a number of factors to improve the background model. Simulation studies demonstrate advantages of the approach, including a substantial increase in power over competing methods. Further advantages are illustrated in an analysis of ovarian and lung cancer data from The Cancer Genome Atlas (TCGA) project.

Availability and implementation: R code to implement this method is available at http://www.biostat.wisc.edu/ kendzior/MADGiC/.

Contact: kendzior@biostat.wisc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

An integrative approach to predicting the functional effects of non-coding and coding sequence variation

一个综合的方法来预测功能的非编码和编码序列变异的影响

Motivation: Technological advances have enabled the identification of an increasingly large spectrum of single nucleotide variants within the human genome, many of which may be associated with monogenic disease or complex traits. Here, we propose an integrative approach, named FATHMM-MKL, to predict the functional consequences of both coding and non-coding sequence variants. Our method utilizes various genomic annotations, which have recently become available, and learns to weight the significance of each component annotation source.

Results: We show that our method outperforms current state-of-the-art algorithms, CADD and GWAVA, when predicting the functional consequences of non-coding variants. In addition, FATHMM-MKL is comparable to the best of these algorithms when predicting the impact of coding variants. The method includes a confidence measure to rank order predictions.

Availability and implementation: The FATHMM-MKL webserver is available at: http://fathmm.biocompute.org.uk

Contact: H.Shihab@bristol.ac.uk or Mark.Rogers@bristol.ac.uk or C.Campbell@bristol.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment

PANNZER:但一个个蛋白质的高通量功能注释一个容易出错的环境

Motivation: The last decade has seen a remarkable growth in protein databases. This growth comes at a price: a growing number of submitted protein sequences lack functional annotation. Approximately 32% of sequences submitted to the most comprehensive protein database UniProtKB are labelled as ‘Unknown protein’ or alike. Also the functionally annotated parts are reported to contain 30–40% of errors. Here, we introduce a high-throughput tool for more reliable functional annotation called Protein ANNotation with Z-score (PANNZER). PANNZER predicts Gene Ontology (GO) classes and free text descriptions about protein functionality. PANNZER uses weighted k-nearest neighbour methods with statistical testing to maximize the reliability of a functional annotation.

Results: Our results in free text description line prediction show that we outperformed all competing methods with a clear margin. In GO prediction we show clear improvement to our older method that performed well in CAFA 2011 challenge.

Availability and implementation: The PANNZER program was developed using the Python programming language (Version 2.6). The stand-alone installation of the PANNZER requires MySQL database for data storage and the BLAST (BLASTALL v.2.2.21) tools for the sequence similarity search. The tutorial, evaluation test sets and results are available on the PANNZER web site. PANNZER is freely available at http://ekhidna.biocenter.helsinki.fi/pannzer.

Contact: patrik.koskinen@helsinki.fi

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping

转移汉明距离:一个快速和准确的SIMD-friendly过滤器加快读一致性验证映射

Motivation: Calculating the edit-distance (i.e. minimum number of insertions, deletions and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences. In practice, only sequence pairs with a small edit-distance provide useful scientific data. However, the majority of sequence pairs analyzed by seed-and-extend based mappers differ by significantly more errors than what is typically allowed. Such error-abundant sequence pairs needlessly waste resources and severely hinder the performance of read mappers. Therefore, it is crucial to develop a fast and accurate filter that can rapidly and efficiently detect error-abundant string pairs and remove them from consideration before more computationally expensive methods are used.

Results: We present a simple and efficient algorithm, Shifted Hamming Distance (SHD), which accelerates the alignment verification procedure in read mapping, by quickly filtering out error-abundant sequence pairs using bit-parallel and SIMD-parallel operations. SHD only filters string pairs that contain more errors than a user-defined threshold, making it fully comprehensive. It also maintains high accuracy with moderate error threshold (up to 5% of the string length) while achieving a 3-fold speedup over the best previous algorithm (Gene Myers’s bit-vector algorithm). SHD is compatible with all mappers that perform sequence alignment for verification.

Availability and implementation: We provide an implementation of SHD in C with Intel SSE instructions at: https://github.com/CMU-SAFARI/SHD.

Contact: hxin@cmu.edu, calkan@cs.bilkent.edu.tr or onur@cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets

排斥并行采样从大型序列集发现算法多样化的图案

Motivation: The motif discovery problem consists of finding recurring patterns of short strings in a set of nucleotide sequences. This classical problem is receiving renewed attention as most early motif discovery methods lack the ability to handle large data of recent genome-wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay little regard to the accuracy of motif detection. Unlike such methods, our method focuses on increasing the detection accuracy while maintaining the computation efficiency at an acceptable level. The major advantage of our method is that it can mine diverse multiple motifs undetectable by current methods.

Results: The repulsive parallel Markov chain Monte Carlo (RPMCMC) algorithm that we propose is a parallel version of the widely used Gibbs motif sampler. RPMCMC is run on parallel interacting motif samplers. A repulsive force is generated when different motifs produced by different samplers near each other. Thus, different samplers explore different motifs. In this way, we can detect much more diverse motifs than conventional methods can. Through application to 228 transcription factor ChIP-seq datasets of the ENCODE project, we show that the RPMCMC algorithm can find many reliable cofactor interacting motifs that existing methods are unable to discover.

Availability and implementation: A C++ implementation of RPMCMC and discovered cofactor motifs for the 228 ENCODE ChIP-seq datasets are available from http://daweb.ism.ac.jp/yoshidalab/motif.

Contact: ikebata.hisaki@ism.ac.jp, yoshidar@ism.ac.jp

Supplementary information: Supplementary data are available from Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

KMC 2: fast and resource-frugal k-mer counting

KMC 2:快速和resource-frugal k-mer计数

Motivation: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.

Results: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12 GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k, x)-mers allows to significantly reduce the I/O and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106 GB of compressed size) in about 20 min, on a 6-core Intel i7 PC with an solid-state disk.

Availability and implementation: KMC 2 is freely available at http://sun.aei.polsl.pl/kmc.

Contact: sebastian.deorowicz@polsl.pl

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Ultrafast SNP analysis using the Burrows-Wheeler transform of short-read data

超快的SNP分析使用burrows - wheeler短内容数据的转换

Motivation: Sequence-variation analysis is conventionally performed on mapping results that are highly redundant and occasionally contain undesirable heuristic biases. A straightforward approach to single-nucleotide polymorphism (SNP) analysis, using the Burrows–Wheeler transform (BWT) of short-read data, is proposed.

Results: The BWT makes it possible to simultaneously process collections of read fragments of the same sequences; accordingly, SNPs were found from the BWT much faster than from the mapping results. It took only a few minutes to find SNPs from the BWT (with a supplementary data, fragment depth of coverage [FDC]) using a desktop workstation in the case of human exome or transcriptome sequencing data and 20 min using a dual-CPU server in the case of human genome sequencing data. The SNPs found with the proposed method almost agreed with those found by a time-consuming state-of-the-art tool, except for the cases in which the use of fragments of reads led to sensitivity loss or sequencing depth was not sufficient. These exceptions were predictable in advance on the basis of minimum length for uniqueness (MLU) and FDC defined on the reference genome. Moreover, BWT and FDC were computed in less time than it took to get the mapping results, provided that the data were large enough.

Availability and implementation: A proof-of-concept binary code for a Linux platform is available on request to the corresponding author.

Contact: kouichi.kimura.hh@hitachi.com

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations

CellCODE:一个健壮的潜变量微分表达式分析异构细胞数量的方法

Motivation: Identifying alterations in gene expression associated with different clinical states is important for the study of human biology. However, clinical samples used in gene expression studies are often derived from heterogeneous mixtures with variable cell-type composition, complicating statistical analysis. Considerable effort has been devoted to modeling sample heterogeneity, and presently, there are many methods that can estimate cell proportions or pure cell-type expression from mixture data. However, there is no method that comprehensively addresses mixture analysis in the context of differential expression without relying on additional proportion information, which can be inaccurate and is frequently unavailable.

Results: In this study, we consider a clinically relevant situation where neither accurate proportion estimates nor pure cell expression is of direct interest, but where we are rather interested in detecting and interpreting relevant differential expression in mixture samples. We develop a method, Cell-type COmputational Differential Estimation (CellCODE), that addresses the specific statistical question directly, without requiring a physical model for mixture components. Our approach is based on latent variable analysis and is computationally transparent; it requires no additional experimental data, yet outperforms existing methods that use independent proportion measurements. CellCODE has few parameters that are robust and easy to interpret. The method can be used to track changes in proportion, improve power to detect differential expression and assign the differentially expressed genes to the correct cell type.

Availability and implementation: The CellCODE R package can be downloaded at http://www.pitt.edu/~mchikina/CellCODE/ or installed from the GitHub repository ‘mchikina/CellCODE’.

Contact: mchikina@pitt.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

Bias in microRNA functional enrichment analysis

偏见在微rna功能富集分析

Motivation: Many studies have investigated the differential expression of microRNAs (miRNAs) in disease states and between different treatments, tissues and developmental stages. Given a list of perturbed miRNAs, it is common to predict the shared pathways on which they act. The standard test for functional enrichment typically yields dozens of significantly enriched functional categories, many of which appear frequently in the analysis of apparently unrelated diseases and conditions.

Results: We show that the most commonly used functional enrichment test is inappropriate for the analysis of sets of genes targeted by miRNAs. The hypergeometric distribution used by the standard method consistently results in significant P-values for functional enrichment for targets of randomly selected miRNAs, reflecting an underlying bias in the predicted gene targets of miRNAs as a whole. We developed an algorithm to measure enrichment using an empirical sampling approach, and applied this in a reanalysis of the gene ontology classes of targets of miRNA lists from 44 published studies. The vast majority of the miRNA target sets were not significantly enriched in any functional category after correction for bias. We therefore argue against continued use of the standard functional enrichment method for miRNA targets.

Availability and implementation: A Python script implementing the empirical algorithm is freely available at http://sgjlab.org/empirical-go/.

Contact: sam.griffiths-jones@manchester.ac.uk or janine.lamb@manchester.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels

DDIG-in:检测致病基因变异造成移码indels和无意义突变用人在核苷酸和蛋白质序列和结构性能水平

Motivation: Frameshifting (FS) indels and nonsense (NS) variants disrupt the protein-coding sequence downstream of the mutation site by changing the reading frame or introducing a premature termination codon, respectively. Despite such drastic changes to the protein sequence, FS indels and NS variants have been discovered in healthy individuals. How to discriminate disease-causing from neutral FS indels and NS variants is an understudied problem.

Results: We have built a machine learning method called DDIG-in (FS) based on real human genetic variations from the Human Gene Mutation Database (inherited disease-causing) and the 1000 Genomes Project (GP) (putatively neutral). The method incorporates both sequence and predicted structural features and yields a robust performance by 10-fold cross-validation and independent tests on both FS indels and NS variants. We showed that human-derived NS variants and FS indels derived from animal orthologs can be effectively employed for independent testing of our method trained on human-derived FS indels. DDIG-in (FS) achieves a Matthews correlation coefficient (MCC) of 0.59, a sensitivity of 86%, and a specificity of 72% for FS indels. Application of DDIG-in (FS) to NS variants yields essentially the same performance (MCC of 0.43) as a method that was specifically trained for NS variants. DDIG-in (FS) was shown to make a significant improvement over existing techniques.

Availability and implementation: The DDIG-in web-server for predicting NS variants, FS indels, and non-frameshifting (NFS) indels is available at http://sparks-lab.org/ddig.

Contact: yaoqi.zhou@griffith.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENETICS AND POPULATION ANALYSIS

Selection of models for the analysis of risk-factor trees: leveraging biological knowledge to mine large sets of risk factors with application to microbiome data

选择模式trees risk-factor analysis for the对地雷知识:运用生物风险因素sets广泛microbiome to数据与应用

Motivation: Establishment of a statistical association between microbiome features and clinical outcomes is of growing interest because of the potential for yielding insights into biological mechanisms and pathogenesis. Extracting microbiome features that are relevant for a disease is challenging and existing variable selection methods are limited due to large number of risk factor variables from microbiome sequence data and their complex biological structure.

Results: We propose a tree-based scanning method, Selection of Models for the Analysis of Risk factor Trees (referred to as SMART-scan), for identifying taxonomic groups that are associated with a disease or trait. SMART-scan is a model selection technique that uses a predefined taxonomy to organize the large pool of possible predictors into optimized groups, and hierarchically searches and determines variable groups for association test. We investigate the statistical properties of SMART-scan through simulations, in comparison to a regular single-variable analysis and three commonly-used variable selection methods, stepwise regression, least absolute shrinkage and selection operator (LASSO) and classification and regression tree (CART). When there are taxonomic group effects in the data, SMART-scan can significantly increase power by using bacterial taxonomic information to split large numbers of variables into groups. Through an application to microbiome data from a vervet monkey diet experiment, we demonstrate that SMART-scan can identify important phenotype-associated taxonomic features missed by single-variable analysis, stepwise regression, LASSO and CART.

Availability and implementation: The SMART-scan approach is implemented in R and is available at https://dsgweb.wustl.edu/qunyuan/software/smartscan/

Contact: qunyuan@wustl.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS BIOLOGY

Combining tree-based and dynamical systems for the inference of gene regulatory networks

结合树型和动力系统推理的基因调控网络

Motivation: Reconstructing the topology of gene regulatory networks (GRNs) from time series of gene expression data remains an important open problem in computational systems biology. Existing GRN inference algorithms face one of two limitations: model-free methods are scalable but suffer from a lack of interpretability and cannot in general be used for out of sample predictions. On the other hand, model-based methods focus on identifying a dynamical model of the system. These are clearly interpretable and can be used for predictions; however, they rely on strong assumptions and are typically very demanding computationally.

Results: Here, we propose a new hybrid approach for GRN inference, called Jump3, exploiting time series of expression data. Jump3 is based on a formal on/off model of gene expression but uses a non-parametric procedure based on decision trees (called ‘jump trees’) to reconstruct the GRN topology, allowing the inference of networks of hundreds of genes. We show the good performance of Jump3 on in silico and synthetic networks and applied the approach to identify regulatory interactions activated in the presence of interferon gamma.

Availability and implementation: Our MATLAB implementation of Jump3 is available at http://homepages.inf.ed.ac.uk/vhuynht/software.html.

Contact: vhuynht@inf.ed.ac.uk or G.Sanguinetti@ed.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS BIOLOGY

Deep profiling of multitube flow cytometry data

深入剖析多管的流式细胞术数据

Motivation: Deep profiling the phenotypic landscape of tissues using high-throughput flow cytometry (FCM) can provide important new insights into the interplay of cells in both healthy and diseased tissue. But often, especially in clinical settings, the cytometer cannot measure all the desired markers in a single aliquot. In these cases, tissue is separated into independently analysed samples, leaving a need to electronically recombine these to increase dimensionality. Nearest-neighbour (NN) based imputation fulfils this need but can produce artificial subpopulations. Clustering-based NNs can reduce these, but requires prior domain knowledge to be able to parameterize the clustering, so is unsuited to discovery settings.

Results: We present flowBin, a parameterization-free method for combining multitube FCM data into a higher-dimensional form suitable for deep profiling and discovery. FlowBin allocates cells to bins defined by the common markers across tubes in a multitube experiment, then computes aggregate expression for each bin within each tube, to create a matrix of expression of all markers assayed in each tube. We show, using simulated multitube data, that flowType analysis of flowBin output reproduces the results of that same analysis on the original data for cell types of >10% abundance. We used flowBin in conjunction with classifiers to distinguish normal from cancerous cells. We used flowBin together with flowType and RchyOptimyx to profile the immunophenotypic landscape of NPM1-mutated acute myeloid leukemia, and present a series of novel cell types associated with that mutation.

Availability and implementation: FlowBin is available in Bioconductor under the Artistic 2.0 free open source license. All data used are available in FlowRepository under accessions: FR-FCM-ZZYA, FR-FCM-ZZZK and FR-FCM-ZZES.

Contact: rbrinkman@bccrc.ca.

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS BIOLOGY

Topology-function conservation in protein-protein interaction networks

Topology-function保护在蛋白质相互作用网络

Motivation: Proteins underlay the functioning of a cell and the wiring of proteins in protein–protein interaction network (PIN) relates to their biological functions. Proteins with similar wiring in the PIN (topology around them) have been shown to have similar functions. This property has been successfully exploited for predicting protein functions. Topological similarity is also used to guide network alignment algorithms that find similarly wired proteins between PINs of different species; these similarities are used to transfer annotation across PINs, e.g. from model organisms to human. To refine these functional predictions and annotation transfers, we need to gain insight into the variability of the topology-function relationships. For example, a function may be significantly associated with specific topologies, while another function may be weakly associated with several different topologies. Also, the topology-function relationships may differ between different species.

Results: To improve our understanding of topology-function relationships and of their conservation among species, we develop a statistical framework that is built upon canonical correlation analysis. Using the graphlet degrees to represent the wiring around proteins in PINs and gene ontology (GO) annotations to describe their functions, our framework: (i) characterizes statistically significant topology-function relationships in a given species, and (ii) uncovers the functions that have conserved topology in PINs of different species, which we term topologically orthologous functions. We apply our framework to PINs of yeast and human, identifying seven biological process and two cellular component GO terms to be topologically orthologous for the two organisms.

Availability and implementation: http://bio-nets.doc.ic.ac.uk/goCCA.zip

Contact: natasha@imperial.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SYSTEMS BIOLOGY

Large-scale extraction of brain connectivity from the neuroscientific literature

大规模提取大脑连接的神经科学文献

Motivation: In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles. One challenge for modern neuroinformatics is finding methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and the integration of such data into computational models. A key example of this is metascale brain connectivity, where results are not reported in a normalized repository. Instead, these experimental results are published in natural language, scattered among individual scientific publications. This lack of normalization and centralization hinders the large-scale integration of brain connectivity results. In this article, we present text-mining models to extract and aggregate brain connectivity results from 13.2 million PubMed abstracts and 630 216 full-text publications related to neuroscience. The brain regions are identified with three different named entity recognizers (NERs) and then normalized against two atlases: the Allen Brain Atlas (ABA) and the atlas from the Brain Architecture Management System (BAMS). We then use three different extractors to assess inter-region connectivity.

Results: NERs and connectivity extractors are evaluated against a manually annotated corpus. The complete in litero extraction models are also evaluated against in vivo connectivity data from ABA with an estimated precision of 78%. The resulting database contains over 4 million brain region mentions and over 100 000 (ABA) and 122 000 (BAMS) potential brain region connections. This database drastically accelerates connectivity literature review, by providing a centralized repository of connectivity data to neuroscientists.

Availability and implementation: The resulting models are publicly available at github.com/BlueBrain/bluima.

Contact: renaud.richardet@epfl.ch

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • DATA AND TEXT MINING

GrammR: graphical representation and modeling of count data with application in metagenomics

文法:统计数据的图形表示和建模应用宏基因组

Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations.

Results: We adapt a new measure of dissimilarity, penalized Kendall’s -distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

Availability and implementation: GrammR is implemented as an R-package available at http://www.stat.osu.edu/~statgen/SOFTWARE/GrammR/. It may also be downloaded from http://cran.rproject.org/web/packages/GrammR/.

Contact: shili@stat.osu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • DATA AND TEXT MINING

Acquire: an open-source comprehensive cancer biobanking system

收购:开源综合癌症生物库系统

Motivation: The probability of effective treatment of cancer with a targeted therapeutic can be improved for patients with defined genotypes containing actionable mutations. To this end, many human cancer biobanks are integrating more tightly with genomic sequencing facilities and with those creating and maintaining patient-derived xenografts (PDX) and cell lines to provide renewable resources for translational research.

Results: To support the complex data management needs and workflows of several such biobanks, we developed Acquire. It is a robust, secure, web-based, database-backed open-source system that supports all major needs of a modern cancer biobank. Its modules allow for i) up-to-the-minute ‘scoreboard’ and graphical reporting of collections; ii) end user roles and permissions; iii) specimen inventory through caTissue Suite; iv) shipping forms for distribution of specimens to pathology, genomic analysis and PDX/cell line creation facilities; v) robust ad hoc querying; vi) molecular and cellular quality control metrics to track specimens’ progress and quality; vii) public researcher request; viii) resource allocation committee distribution request review and oversight and ix) linkage to available derivatives of specimen.

Availability and Implementation: Acquire implements standard controlled vocabularies, ontologies and objects from the NCI, CDISC and others. Here we describe the functionality of the system, its technological stack and the processes it supports. A test version Acquire is available at https://tcrbacquire-stg.research.bcm.edu; software is available in https://github.com/BCM-DLDCC/Acquire; and UML models, data and workflow diagrams, behavioral specifications and other documents are available at https://github.com/BCM-DLDCC/Acquire/tree/master/supplementaryMaterials.

Contact: becnel@bcm.edu

[详细]

  • Bioinformatics
  • 10年前
  • DATABASES AND ONTOLOGIES

rbamtools: an R interface to samtools enabling fast accumulative tabulation of splicing events over multiple RNA-seq samples

rbamtools:R接口samtools启用快速累积制表剪接事件的多个RNA-seq样本

Summary: The open source environment R isf the most widely used software to statistically explore biological data sets including sequence alignments. BAM is the de facto standard file format for sequence alignment. With rbamtools, we provide now a full spectrum of accessibility to BAM for R users such as reading, writing, extraction of subsets and plotting of alignment depth where the script syntax closely follows the SAM/BAM format. Additionally, rbamtools enables fast accumulative tabulation of splicing events over multiple BAM files.

Availability and implementation: rbamtools is available on CRAN and on R-Forge.

Contact: kaisers@med.uni-duesseldorf.de

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

NLR-parser: rapid annotation of plant NLR complements

NLR-parser:快速植物NLR补充注释

Motivation: The repetitive nature of plant disease resistance genes encoding for nucleotide-binding leucine-rich repeat (NLR) proteins hampers their prediction with standard gene annotation software. Motif alignment and search tool (MAST) has previously been reported as a tool to support annotation of NLR-encoding genes. However, the decision if a motif combination represents an NLR protein was entirely manual.

Results: The NLR-parser pipeline is designed to use the MAST output from six-frame translated amino acid sequences and filters for predefined biologically curated motif compositions. Input reads can be derived from, for example, raw long-read sequencing data or contigs and scaffolds coming from plant genome projects. The output is a tab-separated file with information on start and frame of the first NLR specific motif, whether the identified sequence is a TNL or CNL, potentially full or fragmented. In addition, the output of the NB-ARC domain sequence can directly be used for phylogenetic analyses. In comparison to other prediction software, the highly complex NB-ARC domain is described in detail using several individual motifs.

Availability and implementation: The NLR-parser tool can be downloaded from Git-Hub (github.com/steuernb/NLR-Parser). It requires a valid Java installation as well as MAST as part of the MEME Suite. The tool is run from the command line.

Contact: burkhard.steuernagel@jic.ac.uk; fjupe@salk.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

PVAAS: identify variants associated with aberrant splicing from RNA-seq

PVAAS:确定从RNA-seq变异与异常的拼接

Motivation: RNA-seq has been widely used to study the transcriptome. Comparing to microarray, sequencing-based RNA-seq is able to identify splicing variants and single nucleotide variants in one experiment simultaneously. This provides unique opportunity to detect variants that associated with aberrant splicing. Despite the popularity of RNA-seq, no bioinformatics tool has been developed to leverage this advantage to identify variants associated with aberrant splicing.

Results: We have developed PVAAS, a tool to identify single nucleotide variants that associated with aberrant alternative splicing from RNA-seq data. PVAAS works in three steps: (i) identify aberrant splicings; (ii) use user-provided variants or perform variant calling; (iii) assess the significance of association between variants and aberrant splicing events.

Availability and implementation: PVAAS is written in Python and C. Source code and a comprehensive user’s manual are freely available at: http://pvaas.sourceforge.net/.

Contact: wang.liguo@mayo.edu or kocher.jeanpierre@mayo.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

PASPA: a web server for mRNA poly(A) site predictions in plants and algae

与PASPA规定:web服务器mRNA保利(a)网站预测植物和藻类

Motivation: Polyadenylation is an essential process during eukaryotic gene expression. Prediction of poly(A) sites helps to define the 3' end of genes, which is important for gene annotation and elucidating gene regulation mechanisms. However, due to limited knowledge of poly(A) signals, it is still challenging to predict poly(A) sites in plants and algae. PASPA is a web server for poly(A) site prediction in plants and algae, which integrates many in-house tools as add-ons to facilitate poly(A) site prediction, visualization and mining. This server can predict poly(A) sites for ten species, including seven previously poly(A) signal non-characterized species, with sensitivity and specificity in a range between 0.80 and 0.95.

Availability and implementation: http://bmi.xmu.edu.cn/paspa

Contact: xhuister@xmu.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

热卖:一个超高速单节点的解决方案对大型和复杂的宏基因组大会通过简洁de Bruijn图

Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement.

Availability and implementation: The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license.

Contact: rb@l3-bioinfo.com or twlam@cs.hku.hk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

LINKPHASE3: an improved pedigree-based phasing algorithm robust to genotyping and map errors

LINKPHASE3:一种改进pedigree-based逐步算法健壮的基因分型和映射错误

Summary: Many applications in genetics require haplotype reconstruction. We present a phasing program designed for large half-sibs families (as observed in plant and animals) that is robust to genotyping and map errors. We demonstrate that it is more efficient than previous versions and other programs, particularly in the presence of genotyping errors.

Availability and implementation: The software LINKPHASE3 is included in the PHASEBOOK package and can be freely downloaded from www.giga.ulg.ac.be/jcms/prod_381171/software. The package is written in FORTRAN and contains source codes. A manual is provided with the package.

Contact: tom.druet@ulg.ac.be

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENETICS AND POPULATION ANALYSIS

scrm: efficiently simulating long sequences using the approximated coalescent with recombination

scrm:有效地模拟长序列使用近似的合并与重组

Motivation: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations.

Results: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure.

Availability and implementation: The open source implementation scrm is freely available at https://scrm.github.io under the conditions of the GPLv3 license.

Contact: staab@bio.lmu.de or gerton.lunter@well.ox.ac.uk.

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENETICS AND POPULATION ANALYSIS