Modified screening and ranking algorithm for copy number variation detection

修改后的筛选和排序算法拷贝数变异检测

Motivation: Copy number variation (CNV) is a type of structural variation, usually defined as genomic segments that are 1 kb or larger, which present variable copy numbers when compared with a reference genome. The screening and ranking algorithm (SaRa) was recently proposed as an efficient approach for multiple change-points detection, which can be applied to CNV detection. However, some practical issues arise from application of SaRa to single nucleotide polymorphism data.

Results: In this study, we propose a modified SaRa on CNV detection to address these issues. First, we use the quantile normalization on the original intensities to guarantee that the normal mean model-based SaRa is a robust method. Second, a novel normal mixture model coupled with a modified Bayesian information criterion is proposed for candidate change-point selection and further clustering the potential CNV segments to copy number states. Simulations revealed that the modified SaRa became a robust method for identifying change-points and achieved better performance than the circular binary segmentation (CBS) method. By applying the modified SaRa to real data from the HapMap project, we illustrated its performance on detecting CNV segments. In conclusion, our modified SaRa method improves SaRa theoretically and numerically, for identifying CNVs with high-throughput genotyping data.

Availability and Implementation: The modSaRa package is implemented in R program and freely available at http://c2s2.yale.edu/software/modSaRa.

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

Clonality inference in multiple tumor samples using phylogeny

使用发展史Clonality推理在多个肿瘤样本

Motivation: Intra-tumor heterogeneity presents itself through the evolution of subclones during cancer progression. Although recent research suggests that this heterogeneity has clinical implications, in silico determination of the clonal subpopulations remains a challenge.

Results: We address this problem through a novel combinatorial method, named clonality inference in tumors using phylogeny (CITUP), that infers clonal populations and their frequencies while satisfying phylogenetic constraints and is able to exploit data from multiple samples. Using simulated datasets and deep sequencing data from two cancer studies, we show that CITUP predicts clonal frequencies and the underlying phylogeny with high accuracy.

Availability and implementation: CITUP is freely available at: http://sourceforge.net/projects/citup/.

Contact: cenk@sfu.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

LocNES: a computational tool for locating classical NESs in CRM1 cargo proteins

LocNES:计算工具定位古典湖水CRM1货物蛋白质

Motivation: Classical nuclear export signals (NESs) are short cognate peptides that direct proteins out of the nucleus via the CRM1-mediated export pathway. CRM1 regulates the localization of hundreds of macromolecules involved in various cellular functions and diseases. Due to the diverse and complex nature of NESs, reliable prediction of the signal remains a challenge despite several attempts made in the last decade.

Results: We present a new NES predictor, LocNES. LocNES scans query proteins for NES consensus-fitting peptides and assigns these peptides probability scores using Support Vector Machine model, whose feature set includes amino acid sequence, disorder propensity, and the rank of position-specific scoring matrix score. LocNES demonstrates both higher sensitivity and precision over existing NES prediction tools upon comparative analysis using experimentally identified NESs.

Availability and implementation: LocNES is freely available at http://prodata.swmed.edu/LocNES

Contact: yuhmin.chook@utsouthwestern.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

MicroRNA modules prefer to bind weak and unconventional target sites

微rna模块更喜欢将衰弱和非传统的目标网站

Motivation: MicroRNAs (miRNAs) play critical roles in gene regulation. Although it is well known that multiple miRNAs may work as miRNA modules to synergistically regulate common target mRNAs, the understanding of miRNA modules is still in its infancy.

Results: We employed the recently generated high throughput experimental data to study miRNA modules. We predicted 181 miRNA modules and 306 potential miRNA modules. We observed that the target sites of these predicted modules were in general weaker compared with those not bound by miRNA modules. We also discovered that miRNAs in predicted modules preferred to bind unconventional target sites rather than canonical sites. Surprisingly, contrary to a previous study, we found that most adjacent miRNA target sites from the same miRNA modules were not within the range of 10–130 nucleotides. Interestingly, the distance of target sites bound by miRNAs in the same modules was shorter when miRNA modules bound unconventional instead of canonical sites. Our study shed new light on miRNA binding and miRNA target sites, which will likely advance our understanding of miRNA regulation.

Availability and implementation: The software miRModule can be freely downloaded at http://hulab.ucf.edu/research/projects/miRNA/miRModule.

Supplementary information: Supplementary data are available at Bioinformatics online.

Contact: haihu@cs.ucf.edu or xiaoman@mail.ucf.edu.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

A Bayesian framework for de novo mutation calling in parents-offspring trios

贝叶斯框架新创突变呼唤parents-offspring三人小组

Motivation: Spontaneous (de novo) mutations play an important role in the disease etiology of a range of complex diseases. Identifying de novo mutations (DNMs) in sporadic cases provides an effective strategy to find genes or genomic regions implicated in the genetics of disease. High-throughput next-generation sequencing enables genome- or exome-wide detection of DNMs by sequencing parents-proband trios. It is challenging to sift true mutations through massive amount of noise due to sequencing error and alignment artifacts. One of the critical limitations of existing methods is that for all genomic regions the same pre-specified mutation rate is assumed, which has a significant impact on the DNM calling accuracy.

Results: In this study, we developed and implemented a novel Bayesian framework for DNM calling in trios (TrioDeNovo), which overcomes these limitations by disentangling prior mutation rates from evaluation of the likelihood of the data so that flexible priors can be adjusted post-hoc at different genomic sites. Through extensively simulations and application to real data we showed that this new method has improved sensitivity and specificity over existing methods, and provides a flexible framework to further improve the efficiency by incorporating proper priors. The accuracy is further improved using effective filtering based on sequence alignment characteristics.

Availability and implementation: The C++ source code implementing TrioDeNovo is freely available at https://medschool.vanderbilt.edu/cgg.

Contact: bingshan.li@vanderbilt.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

UProC: tools for ultra-fast protein domain classification

UProC:超高速蛋白质域分类工具

Motivation: With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics.

Results: The ultrafast protein classification (UProC) toolbox implements a novel algorithm (‘Mosaic Matching’) for large-scale sequence analysis. UProC is by three orders of magnitude faster than profile-based methods and in a metagenome simulation study achieved up to 80% higher sensitivity on unassembled 100 bp reads.

Availability and implementation: UProC is available as an open-source software at https://github.com/gobics/uproc. Precompiled databases (Pfam) are linked on the UProC homepage: http://uproc.gobics.de/.

Contact: peter@gobics.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Disk-based compression of data from genome sequencing

基于磁盘的数据压缩的基因组测序

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows–Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage.

Results: We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space.

Availability and implementation: http://sun.aei.polsl.pl/orcom under a free license.

Contact: sebastian.deorowicz@polsl.pl

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

整合alignment-based生物序列分类和alignment-free序列相似性措施

Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized.

Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences.

Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html.

Contact: ivan.borozan@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Knowledge-based modeling of peptides at protein interfaces: PiPreD

知识建模的多肽蛋白质接口:PiPreD

Motivation: Protein–protein interactions (PPIs) underpin virtually all cellular processes both in health and disease. Modulating the interaction between proteins by means of small (chemical) agents is therefore a promising route for future novel therapeutic interventions. In this context, peptides are gaining momentum as emerging agents for the modulation of PPIs.

Results: We reported a novel computational, structure and knowledge-based approach to model orthosteric peptides to target PPIs: PiPreD. PiPreD relies on a precompiled and bespoken library of structural motifs, iMotifs, extracted from protein complexes and a fast structural modeling algorithm driven by the location of native chemical groups on the interface of the protein target named anchor residues. PiPreD comprehensive and systematically samples the entire interface deriving peptide conformations best suited for the given region on the protein interface. PiPreD complements the existing technologies and provides new solutions for the disruption of selected interactions.

Availability and implementation: Database and accessory scripts and programs are available upon request to the authors or at http://www.bioinsilico.org/PIPRED.

Contact: narcis.fernandez@gmail.com

[详细]

  • Bioinformatics
  • 10年前
  • STRUCTURAL BIOINFORMATICS

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome

GlycoMine:基于机器学习方法预测N、C -和O-linked糖基化的人类蛋白质组

Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM.

Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.

Availability and implementation: The webserver, Java Applet, user instructions, datasets, and predicted glycosylation sites in the human proteome are freely available at http://www.structbioinfor.org/Lab/GlycoMine/.

Contact: Jiangning.Song@monash.edu or James.Whisstock@monash.edu or zhangyang@nwsuaf.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • STRUCTURAL BIOINFORMATICS

Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models

Co-expression高通量转录组测序数据的分析与混合泊松模型

Motivation: In recent years, gene expression studies have increasingly made use of high-throughput sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression (DGE) has flourished, primarily in the context of normalization and differential analysis.

Results: In this work, we focus on the question of clustering DGE profiles as a means to discover groups of co-expressed genes. We propose a Poisson mixture model using a rigorous framework for parameter estimation as well as the choice of the appropriate number of clusters. We illustrate co-expression analyses using our approach on two real RNA-seq datasets. A set of simulation studies also compares the performance of the proposed model with that of several related approaches developed to cluster RNA-seq or serial analysis of gene expression data.

Availability and and implementation: The proposed method is implemented in the open-source R package HTSCluster, available on CRAN.

Contact: andrea.rau@jouy.inra.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

Inferring single-cell gene expression mechanisms using stochastic simulation

利用随机模拟推断单细胞基因表达的机制

Motivation: Stochastic promoter switching between transcriptionally active (ON) and inactive (OFF) states is a major source of noise in gene expression. It is often implicitly assumed that transitions between promoter states are memoryless, i.e. promoters spend an exponentially distributed time interval in each of the two states. However, increasing evidence suggests that promoter ON/OFF times can be non-exponential, hinting at more complex transcriptional regulatory architectures. Given the essential role of gene expression in all cellular functions, efficient computational techniques for characterizing promoter architectures are critically needed.

Results: We have developed a novel model reduction for promoters with arbitrary numbers of ON and OFF states, allowing us to approximate complex promoter switching behavior with Weibull-distributed ON/OFF times. Using this model reduction, we created bursty Monte Carlo expectation-maximization with modified cross-entropy method (‘bursty MCEM2’), an efficient parameter estimation and model selection technique for inferring the number and configuration of promoter states from single-cell gene expression data. Application of bursty MCEM2 to data from the endogenous mouse glutaminase promoter reveals nearly deterministic promoter OFF times, consistent with a multi-step activation mechanism consisting of 10 or more inactive states. Our novel approach to modeling promoter fluctuations together with bursty MCEM2 provides powerful tools for characterizing transcriptional bursting across genes under different environmental conditions.

Availability and implementation: R source code implementing bursty MCEM2 is available upon request.

Contact: absingh@udel.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENE EXPRESSION

Empowering biologists with multi-omics data: colorectal cancer as a paradigm

让生物学家multi-omics数据:结直肠癌作为范例

Motivation: Recent completion of the global proteomic characterization of The Cancer Genome Atlas (TCGA) colorectal cancer (CRC) cohort resulted in the first tumor dataset with complete molecular measurements at DNA, RNA and protein levels. Using CRC as a paradigm, we describe the application of the NetGestalt framework to provide easy access and interpretation of multi-omics data.

Results: The NetGestalt CRC portal includes genomic, epigenomic, transcriptomic, proteomic and clinical data for the TCGA CRC cohort, data from other CRC tumor cohorts and cell lines, and existing knowledge on pathways and networks, giving a total of more than 17 million data points. The portal provides features for data query, upload, visualization and integration. These features can be flexibly combined to serve various needs of the users, maximizing the synergy among omics data, human visualization and quantitative analysis. Using three case studies, we demonstrate that the portal not only provides user-friendly data query and visualization but also enables efficient data integration within a single omics data type, across multiple omics data types, and over biological networks.

Availability and implementation: The NetGestalt CRC portal can be freely accessed at http://www.netgestalt.org.

Contact: bing.zhang@vanderbilt.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • DATABASES AND ONTOLOGIES

Using isoelectric point to determine the pH for initial protein crystallization trials

利用等电点来确定初始蛋白质结晶的pH值试验

Motivation: The identification of suitable conditions for crystallization is a rate-limiting step in protein structure determination. The pH of an experiment is an important parameter and has the potential to be used in data-mining studies to help reduce the number of crystallization trials required. However, the pH is usually recorded as that of the buffer solution, which can be highly inaccurate.

Results: Here, we show that a better estimate of the true pH can be predicted by considering not only the buffer pH but also any other chemicals in the crystallization solution. We use these more accurate pH values to investigate the disputed relationship between the pI of a protein and the pH at which it crystallizes.

Availability and implementation: Data used to generate models are available as Supplementary Material.

Contact: julie.wilson@york.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • DATABASES AND ONTOLOGIES

A haplotype-based framework for group-wise transmission/disequilibrium tests for rare variant association analysis

haplotype-based框架group-wise传输罕见变异协会/不均衡测试分析

Motivation: A major focus of current sequencing studies for human genetics is to identify rare variants associated with complex diseases. Aside from reduced power of detecting associated rare variants, controlling for population stratification is particularly challenging for rare variants. Transmission/disequilibrium tests (TDT) based on family designs are robust to population stratification and admixture, and therefore provide an effective approach to rare variant association studies to eliminate spurious associations. To increase power of rare variant association analysis, gene-based collapsing methods become standard approaches for analyzing rare variants. Existing methods that extend this strategy to rare variants in families usually combine TDT statistics at individual variants and therefore lack the flexibility of incorporating other genetic models.

Results: In this study, we describe a haplotype-based framework for group-wise TDT (gTDT) that is flexible to encompass a variety of genetic models such as additive, dominant and compound heterozygous (CH) (i.e. recessive) models as well as other complex interactions. Unlike existing methods, gTDT constructs haplotypes by transmission when possible and inherently takes into account the linkage disequilibrium among variants. Through extensive simulations we showed that type I error was correctly controlled for rare variants under all models investigated, and this remained true in the presence of population stratification. Under a variety of genetic models, gTDT showed increased power compared with the single marker TDT. Application of gTDT to an autism exome sequencing data of 118 trios identified potentially interesting candidate genes with CH rare variants.

Availability and implementation: We implemented gTDT in C++ and the source code and the detailed usage are available on the authors’ website (https://medschool.vanderbilt.edu/cgg).

Contact: bingshan.li@vanderbilt.edu or wei.chen@chp.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENETIC AND POPULATION ANALYSIS

PBOOST: a GPU-based tool for parallel permutation tests in genome-wide association studies

PBOOST:都在全基因组关联研究平行排列测试的工具

Motivation: The importance of testing associations allowing for interactions has been demonstrated by Marchini et al. (2005). A fast method detecting associations allowing for interactions has been proposed by Wan et al. (2010a). The method is based on likelihood ratio test with the assumption that the statistic follows the 2 distribution. Many single nucleotide polymorphism (SNP) pairs with significant associations allowing for interactions have been detected using their method. However, the assumption of 2 test requires the expected values in each cell of the contingency table to be at least five. This assumption is violated in some identified SNP pairs. In this case, likelihood ratio test may not be applicable any more. Permutation test is an ideal approach to checking the P-values calculated in likelihood ratio test because of its non-parametric nature. The P-values of SNP pairs having significant associations with disease are always extremely small. Thus, we need a huge number of permutations to achieve correspondingly high resolution for the P-values. In order to investigate whether the P-values from likelihood ratio tests are reliable, a fast permutation tool to accomplish large number of permutations is desirable.

Results: We developed a permutation tool named PBOOST. It is based on GPU with highly reliable P-value estimation. By using simulation data, we found that the P-values from likelihood ratio tests will have relative error of >100% when 50% cells in the contingency table have expected count less than five or when there is zero expected count in any of the contingency table cells. In terms of speed, PBOOST completed 107 permutations for a single SNP pair from the Wellcome Trust Case Control Consortium (WTCCC) genome data (Wellcome Trust Case Control Consortium, 2007) within 1 min on a single Nvidia Tesla M2090 device, while it took 60 min in a single CPU Intel Xeon E5-2650 to finish the same task. More importantly, when simultaneously testing 256 SNP pairs for 107 permutations, our tool took only 5 min, while the CPU program took 10 h. By permuting on a GPU cluster consisting of 40 nodes, we completed 1012 permutations for all 280 SNP pairs reported with P-values smaller than $$1.6\times {10}^{-12}$$ in the WTCCC datasets in 1 week.

Availability and implementation: The source code and sample data are available at http://bioinformatics.ust.hk/PBOOST.zip.

Contact: gyang@ust.hk; eeyu@ust.hk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

J-Circos: an interactive Circos plotter

J - Circos: an interactive Circos plotter

Summary: Circos plots are graphical outputs that display three dimensional chromosomal interactions and fusion transcripts. However, the Circos plot tool is not an interactive visualization tool, but rather a figure generator. For example, it does not enable data to be added dynamically nor does it provide information for specific data points interactively. Recently, an R-based Circos tool (RCircos) has been developed to integrate Circos to R, but similarly, Rcircos can only be used to generate plots. Thus, we have developed a Circos plot tool (J-Circos) that is an interactive visualization tool that can plot Circos figures, as well as being able to dynamically add data to the figure, and providing information for specific data points using mouse hover display and zoom in/out functions. J-Circos uses the Java computer language to enable, it to be used on most operating systems (Windows, MacOS, Linux). Users can input data into J-Circos using flat data formats, as well as from the Graphical user interface (GUI). J-Circos will enable biologists to better study more complex chromosomal interactions and fusion transcripts that are otherwise difficult to visualize from next-generation sequencing data.

Availability and implementation: J-circos and its manual are freely available at http://www.australianprostatecentre.org/research/software/jcircos

Contact: j.an@qut.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

PRSice: Polygenic Risk Score software

PRSice: Polygenic Risk Score software

Summary: A polygenic risk score (PRS) is a sum of trait-associated alleles across many genetic loci, typically weighted by effect sizes estimated from a genome-wide association study. The application of PRS has grown in recent years as their utility for detecting shared genetic aetiology among traits has become appreciated; PRS can also be used to establish the presence of a genetic signal in underpowered studies, to infer the genetic architecture of a trait, for screening in clinical trials, and can act as a biomarker for a phenotype. Here we present the first dedicated PRS software, PRSice (‘precise'), for calculating, applying, evaluating and plotting the results of PRS. PRSice can calculate PRS at a large number of thresholds ("high resolution") to provide the best-fit PRS, as well as provide results calculated at broad P-value thresholds, can thin Single Nucleotide Polymorphisms (SNPs) according to linkage disequilibrium and P-value or use all SNPs, handles genotyped and imputed data, can calculate and incorporate ancestry-informative variables, and can apply PRS across multiple traits in a single run. We exemplify the use of PRSice via application to data on schizophrenia, major depressive disorder and smoking, illustrate the importance of identifying the best-fit PRS and estimate a P-value significance threshold for high-resolution PRS studies.

Availability and implementation: PRSice is written in R, including wrappers for bash data management scripts and PLINK-1.9 to minimize computational time. PRSice runs as a command-line program with a variety of user-options, and is freely available for download from http://PRSice.info

Contact: jack.euesden@kcl.ac.uk or paul.oreilly@kcl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • GENOME ANALYSIS

VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications

VarSim:高保真仿真和验证框架,用于高通量基因组测序与癌症的应用程序

Summary: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing.

Availability and implementation: Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim.

Contact: rd@bina.com

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

PrimerDesign-M: a multiple-alignment based multiple-primer design tool for walking across variable genomes

PrimerDesign-M:(a multiple-alignment设计工具multiple-primer分公司之间genomes变量走路

Summary: Analyses of entire viral genomes or mtDNA requires comprehensive design of many primers across their genomes. Furthermore, simultaneous optimization of several DNA primer design criteria may improve overall experimental efficiency and downstream bioinformatic processing. To achieve these goals, we developed PrimerDesign-M. It includes several options for multiple-primer design, allowing researchers to efficiently design walking primers that cover long DNA targets, such as entire HIV-1 genomes, and that optimizes primers simultaneously informed by genetic diversity in multiple alignments and experimental design constraints given by the user. PrimerDesign-M can also design primers that include DNA barcodes and minimize primer dimerization. PrimerDesign-M finds optimal primers for highly variable DNA targets and facilitates design flexibility by suggesting alternative designs to adapt to experimental conditions.

Availability and implementation: PrimerDesign-M is available as a webtool at http://www.hiv.lanl.gov/content/sequence/PRIMER_DESIGN/primer_design.html

Contact: tkl@lanl.gov or seq-info@lanl.gov.

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

CDvist: a webserver for identification and visualization of conserved domains in protein sequences

CDvist:识别和可视化的网络服务器保存领域的蛋白质序列

Summary: Identification of domains in protein sequences allows their assigning to biological functions. Several webservers exist for identification of protein domains using similarity searches against various databases of protein domain models. However, none of them provides comprehensive domain coverage while allowing bulk querying and their visualization schemes can be improved. To address these issues, we developed CDvist (a comprehensive domain visualization tool), which combines the best available search algorithms and databases into a user-friendly framework. First, a given protein sequence is matched to domain models using high-specificity tools and only then unmatched segments are subjected to more sensitive algorithms resulting in a best possible comprehensive coverage. Bulk querying and rich visualization and download options provide improved functionality to domain architecture analysis.

Availability and implementation: Freely available on the web at http://cdvist.utk.edu

Contact: oadebali@vols.utk.edu or ijouline@utk.edu

[详细]

  • Bioinformatics
  • 10年前
  • SEQUENCE ANALYSIS

Epock: rapid analysis of protein pocket dynamics

Epock:快速分析的蛋白质动力学

Summary: The volume of an internal protein pocket is fundamental to ligand accessibility. Few programs that compute such volumes manage dynamic data from molecular dynamics (MD) simulations. Limited performance often prohibits analysis of large datasets. We present Epock, an efficient command-line tool that calculates pocket volumes from MD trajectories. A plugin for the VMD program provides a graphical user interface to facilitate input creation, run Epock and analyse the results.

Availability and implementation: Epock C++ source code, Python analysis scripts, VMD Tcl plugin, documentation and installation instructions are freely available at http://epock.bitbucket.org.

Contact: benoist.laurent@gmail.com or baaden@smplinux.de

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • STRUCTURAL BIOINFORMATICS

CONSRANK: a server for the analysis, comparison and ranking of docking models based on inter-residue contacts

CONSRANK:服务器的分析、比较和排名的对接模型基于inter-residue联系人

Summary: Herein, we present CONSRANK, a web tool for analyzing, comparing and ranking protein–protein and protein–nucleic acid docking models, based on the conservation of inter-residue contacts and its visualization in 2D and 3D interactive contact maps.

Availability and implementation: CONSRANK is accessible as a public web tool at https://www.molnac.unisa.it/BioTools/consrank/.

Contact: romina.oliva@uniparthenope.it

[详细]

  • Bioinformatics
  • 10年前
  • STRUCTURAL BIOINFORMATICS

RRDistMaps: a UCSF Chimera tool for viewing and comparing protein distance maps

RRDistMaps:UCSF嵌合体工具浏览和比较蛋白质距离地图

Motivation: Contact maps are a convenient method for the structural biologists to identify structural features through two-dimensional simplification. Binary (yes/no) contact maps with a single cutoff distance can be generalized to show continuous distance ranges. We have developed a UCSF Chimera tool, RRDistMaps, to compute such generalized maps in order to analyze pairwise variations in intramolecular contacts. An interactive utility, RRDistMaps, visualizes conformational changes, both local (e.g. binding-site residues) and global (e.g. hinge motion), between unbound and bound proteins through distance patterns. Users can target residue pairs in RRDistMaps for further navigation in Chimera. The interface contains the unique features of identifying long-range residue motion and aligning sequences to simultaneously compare distance maps.

Availability and implementation: RRDistMaps was developed as part of UCSF Chimera release 1.10, which is freely available at http://rbvi.ucsf.edu/chimera/download.html, and operates on Linux, Windows, and Mac OS.

Contact: conrad@cgl.ucsf.edu

[详细]

  • Bioinformatics
  • 10年前
  • STRUCTURAL BIOINFORMATICS