Analysis of nanopore data using hidden Markov models

基于隐马尔可夫模型的纳米孔的数据分析

Motivation: Nanopore-based sequencing techniques can reconstruct properties of biosequences by analyzing the sequence-dependent ionic current steps produced as biomolecules pass through a pore. Typically this involves alignment of new data to a reference, where both reference construction and alignment have been performed by hand.

Results: We propose an automated method for aligning nanopore data to a reference through the use of hidden Markov models. Several features that arise from prior processing steps and from the class of enzyme used can be simply incorporated into the model. Previously, the M2MspA nanopore was shown to be sensitive enough to distinguish between cytosine, methylcytosine and hydroxymethylcytosine. We validated our automated methodology on a subset of that data by automatically calculating an error rate for the distinction between the three cytosine variants and show that the automated methodology produces a 2–3% error rate, lower than the 10% error rate from previous manual segmentation and alignment.

Availability and implementation: The data, output, scripts and tutorials replicating the analysis are available at https://github.com/UCSCNanopore/Data/tree/master/Automation.

Contact: karplus@soe.ucsc.edu or jmschreiber91@gmail.com

Supplementary information: Supplementary data are available from Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • ORIGINAL PAPER

FR database 1.0: a resource focused on fruit development and ripening

FR数据库1:资源集中在果实发育和成熟

Fruits form unique growing period in the life cycle of higher plants. They provide essential nutrients and have beneficial effects on human health. Characterizing the genes involved in fruit development and ripening is fundamental to understanding the biological process and improving horticultural crops. Although, numerous genes that have been characterized are participated in regulating fruit development and ripening at different stages, no dedicated bioinformatic resource for fruit development and ripening is available. In this study, we have developed such a database, FR database 1.0, using manual curation from 38 423 articles published before 1 April 2014, and integrating protein interactomes and several transcriptome datasets. It provides detailed information for 904 genes derived from 53 organisms reported to participate in fleshy fruit development and ripening. Genes from climacteric and non-climacteric fruits are also annotated, with several interesting Gene Ontology (GO) terms being enriched for these two gene sets and seven ethylene-related GO terms found only in the climacteric fruit group. Furthermore, protein–protein interaction analysis by integrating information from FR database presents the possible function network that affects fleshy fruit size formation. Collectively, FR database will be a valuable platform for comprehensive understanding and future experiments in fruit biology.

Database URL: http://www.fruitech.org/

[详细]

  • Database
  • 9年前
  • Database Tool

LocSigDB: a database of protein localization signals

locsigdb:一种蛋白质的定位信号数据库

LocSigDB (http://genome.unmc.edu/LocSigDB/) is a manually curated database of experimental protein localization signals for eight distinct subcellular locations; primarily in a eukaryotic cell with brief coverage of bacterial proteins. Proteins must be localized at their appropriate subcellular compartment to perform their desired function. Mislocalization of proteins to unintended locations is a causative factor for many human diseases; therefore, collection of known sorting signals will help support many important areas of biomedical research. By performing an extensive literature study, we compiled a collection of 533 experimentally determined localization signals, along with the proteins that harbor such signals. Each signal in the LocSigDB is annotated with its localization, source, PubMed references and is linked to the proteins in UniProt database along with the organism information that contain the same amino acid pattern as the given signal. From LocSigDB webserver, users can download the whole database or browse/search for data using an intuitive query interface. To date, LocSigDB is the most comprehensive compendium of protein localization signals for eight distinct subcellular locations.

Database URL: http://genome.unmc.edu/LocSigDB/

[详细]

  • Database
  • 9年前
  • Database tool

Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora

利用人类表型本体的参考和测试套件库自动概念的认识

Concept recognition tools rely on the availability of textual corpora to assess their performance and enable the identification of areas for improvement. Typically, corpora are developed for specific purposes, such as gene name recognition. Gene and protein name identification are longstanding goals of biomedical text mining, and therefore a number of different corpora exist. However, phenotypes only recently became an entity of interest for specialized concept recognition systems, and hardly any annotated text is available for performance testing and training. Here, we present a unique corpus, capturing text spans from 228 abstracts manually annotated with Human Phenotype Ontology (HPO) concepts and harmonized by three curators, which can be used as a reference standard for free text annotation of human phenotypes. Furthermore, we developed a test suite for standardized concept recognition error analysis, incorporating 32 different types of test cases corresponding to 2164 HPO concepts. Finally, three established phenotype concept recognizers (NCBO Annotator, OBO Annotator and Bio-LarK CR) were comprehensively evaluated, and results are reported against both the text corpus and the test suites. The gold standard and test suites corpora are available from http://bio-lark.org/hpo_res.html.

Database URL: http://bio-lark.org/hpo_res.html

[详细]

  • Database
  • 9年前
  • Original Article

PathCards: multi-source consolidation of human biological pathways

PathCards:人体生物途径多源整合

The study of biological pathways is key to a large number of systems analyses. However, many relevant tools consider a limited number of pathway sources, missing out on many genes and gene-to-gene connections. Simply pooling several pathways sources would result in redundancy and the lack of systematic pathway interrelations. To address this, we exercised a combination of hierarchical clustering and nearest neighbor graph representation, with judiciously selected cutoff values, thereby consolidating 3215 human pathways from 12 sources into a set of 1073 SuperPaths. Our unification algorithm finds a balance between reducing redundancy and optimizing the level of pathway-related informativeness for individual genes. We show a substantial enhancement of the SuperPaths’ capacity to infer gene-to-gene relationships when compared with individual pathway sources, separately or taken together. Further, we demonstrate that the chosen 12 sources entail nearly exhaustive gene coverage. The computed SuperPaths are presented in a new online database, PathCards, showing each SuperPath, its constituent network of pathways, and its contained genes. This provides researchers with a rich, searchable systems analysis resource.Database URL: http://pathcards.genecards.org/

[详细]

  • Database
  • 9年前
  • Original Article

PreDREM: a database of predicted DNA regulatory motifs from 349 human cell and tissue samples

predrem:一种预测DNA调控序列349的人类细胞和组织样本数据库

PreDREM is a database of DNA regulatory motifs and motifs modules predicted from DNase I hypersensitive sites in 349 human cell and tissue samples. It contains 845–1325 predicted motifs in each sample, which result in a total of 2684 non-redundant motifs. In comparison with seven large collections of known motifs, more than 84% of the 2684 predicted motifs are similar to the known motifs, and 54–76% of the known motifs are similar to the predicted motifs. PreDREM also stores 43 663–20 13 288 motif modules in each sample, which provide the cofactor motifs of each predicted motif. Compared with motifs of known interacting transcription factor (TF) pairs in eight resources, on average, 84% of motif pairs corresponding to known interacting TF pairs are included in the predicted motif modules. Through its web interface, PreDREM allows users to browse motif information by tissues, datasets, individual non-redundant motifs, etc. Users can also search motifs, motif modules, instances of motifs and motif modules in given genomic regions, tissue or cell types a motif occurs, etc. PreDREM thus provides a useful resource for the understanding of cell- and tissue-specific gene regulation in the human genome.

Database URL: http://server.cs.ucf.edu/predrem/.

[详细]

  • Database
  • 9年前
  • Database Tool

Quantitative visualization of alternative exon expression from RNA-seq data

从RNA序列数据替代外显子表达的定量可视化

Motivation: Analysis of RNA sequencing (RNA-Seq) data revealed that the vast majority of human genes express multiple mRNA isoforms, produced by alternative pre-mRNA splicing and other mechanisms, and that most alternative isoforms vary in expression between human tissues. As RNA-Seq datasets grow in size, it remains challenging to visualize isoform expression across multiple samples.

Results: To help address this problem, we present Sashimi plots, a quantitative visualization of aligned RNA-Seq reads that enables quantitative comparison of exon usage across samples or experimental conditions. Sashimi plots can be made using the Broad Integrated Genome Viewer or with a stand-alone command line program.

Availability and implementation: Software code and documentation freely available here: http://miso.readthedocs.org/en/fastmiso/sashimi.html

Contact: mesirov@broadinstitute.org, airoldi@fas.harvard.edu or cburge@mit.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 10年前
  • APPLICATIONS NOTE

Merging and scoring molecular interactions utilising existing community standards: tools, use-cases and a case study

合并和评分标准:分子间的相互作用,利用现有的社区工具,使用案例和案例研究

The evidence that two molecules interact in a living cell is often inferred from multiple different experiments. Experimental data is captured in multiple repositories, but there is no simple way to assess the evidence of an interaction occurring in a cellular environment. Merging and scoring of data are commonly required operations after querying for the details of specific molecular interactions, to remove redundancy and assess the strength of accompanying experimental evidence. We have developed both a merging algorithm and a scoring system for molecular interactions based on the proteomics standard initiative–molecular interaction standards. In this manuscript, we introduce these two algorithms and provide community access to the tool suite, describe examples of how these tools are useful to selectively present molecular interaction data and demonstrate a case where the algorithms were successfully used to identify a systematic error in an existing dataset.

[详细]

  • Database
  • 9年前
  • Database Tool

EcoliNet: a database of cofunctional gene network for Escherichia coli

ecolinet:大肠杆菌数据库cofunctional基因网络

During the past several decades, Escherichia coli has been a treasure chest for molecular biology. The molecular mechanisms of many fundamental cellular processes have been discovered through research on this bacterium. Although much basic research now focuses on more complex model organisms, E. coli still remains important in metabolic engineering and synthetic biology. Despite its long history as a subject of molecular investigation, more than one-third of the E. coli genome has no pathway annotation supported by either experimental evidence or manual curation. Recently, a network-assisted genetics approach to the efficient identification of novel gene functions has increased in popularity. To accelerate the speed of pathway annotation for the remaining uncharacterized part of the E. coli genome, we have constructed a database of cofunctional gene network with near-complete genome coverage of the organism, dubbed EcoliNet. We find that EcoliNet is highly predictive for diverse bacterial phenotypes, including antibiotic response, indicating that it will be useful in prioritizing novel candidate genes for a wide spectrum of bacterial phenotypes. We have implemented a web server where biologists can easily run network algorithms over EcoliNet to predict novel genes involved in a pathway or novel functions for a gene. All integrated cofunctional associations can be downloaded, enabling orthology-based reconstruction of gene networks for other bacterial species as well.

Database URL: http://www.inetbio.org/ecolinet

[详细]

  • Database
  • 9年前
  • Original Article

Fishing for data and sorting the catch: assessing the data quality, completeness and fitness for use of data in marine biogeographic databases

数据整理钓钓鱼:数据质量评估,使用海洋生物地理数据库数据的完整性和健身

Being able to assess the quality and level of completeness of data has become indispensable in marine biodiversity research, especially when dealing with large databases that typically compile data from a variety of sources. Very few integrated databases offer quality flags on the level of the individual record, making it hard for users to easily extract the data that are fit for their specific purposes. This article describes the different steps that were developed to analyse the quality and completeness of the distribution records within the European and international Ocean Biogeographic Information Systems (EurOBIS and OBIS). Records are checked on data format, completeness and validity of information, quality and detail of the used taxonomy and geographic indications and whether or not the record is a putative outlier. The corresponding quality control (QC) flags will not only help users with their data selection, they will also help the data management team and the data custodians to identify possible gaps and errors in the submitted data, providing scope to improve data quality. The results of these quality control procedures are as of now available on both the EurOBIS and OBIS databases. Through the Biology portal of the European Marine Observation and Data Network (EMODnet Biology), a subset of EurOBIS records—passing a specific combination of these QC steps—is offered to the users. In the future, EMODnet Biology will offer a wide range of filter options through its portal, allowing users to make specific selections themselves. Through LifeWatch, users can already upload their own data and check them against a selection of the here described quality control procedures.

Database URL: www.eurobis.org (www.iobis.org; www.emodnet-biology.eu/)

[详细]

  • Database
  • 9年前
  • Original Article

Comparison of human cell signaling pathway databases--evolution, drawbacks and challenges

人类细胞信号通路数据库的比较——进化,缺点和挑战

Elucidating the complexities of cell signaling pathways is of immense importance to gain understanding about various biological phenomenon, such as dynamics of gene/protein expression regulation, cell fate determination, embryogenesis and disease progression. The successful completion of human genome project has also helped experimental and theoretical biologists to analyze various important pathways. To advance this study, during the past two decades, systematic collections of pathway data from experimental studies have been compiled and distributed freely by several databases, which also integrate various computational tools for further analysis. Despite significant advancements, there exist several drawbacks and challenges, such as pathway data heterogeneity, annotation, regular update and automated image reconstructions, which motivated us to perform a thorough review on popular and actively functioning 24 cell signaling databases. Based on two major characteristics, pathway information and technical details, freely accessible data from commercial and academic databases are examined to understand their evolution and enrichment. This review not only helps to identify some novel and useful features, which are not yet included in any of the databases but also highlights their current limitations and subsequently propose the reasonable solutions for future database development, which could be useful to the whole scientific community.

[详细]

  • Database
  • 9年前
  • Review

PhenoMiner: a quantitative phenotype database for the laboratory rat, Rattus norvegicus. Application in hypertension and renal disease

phenominer:对实验大鼠定量表型数据库,褐家鼠。高血压和肾脏疾病中的应用

Rats have been used extensively as animal models to study physiological and pathological processes involved in human diseases. Numerous rat strains have been selectively bred for certain biological traits related to specific medical interests. Recently, the Rat Genome Database (http://rgd.mcw.edu) has initiated the PhenoMiner project to integrate quantitative phenotype data from the PhysGen Program for Genomic Applications and the National BioResource Project in Japan as well as manual annotations from biomedical literature. PhenoMiner, the search engine for these integrated phenotype data, facilitates mining of data sets across studies by searching the database with a combination of terms from four different ontologies/vocabularies (Rat Strain Ontology, Clinical Measurement Ontology, Measurement Method Ontology and Experimental Condition Ontology). In this study, salt-induced hypertension was used as a model to retrieve blood pressure records of Brown Norway, Fawn-Hooded Hypertensive (FHH) and Dahl salt-sensitive (SS) rat strains. The records from these three strains served as a basis for comparing records from consomic/congenic/mutant offspring derived from them. We examined the cardiovascular and renal phenotypes of consomics derived from FHH and SS, and of SS congenics and mutants. The availability of quantitative records across laboratories in one database, such as these provided by PhenoMiner, can empower researchers to make the best use of publicly available data.

Database URL: http://rgd.mcw.edu

[详细]

  • Database
  • 9年前
  • Original Article

Modeling a microbial community and biodiversity assay with OBO Foundry ontologies: the interoperability gains of a modular approach

微生物群落和生物多样性法与铸造本体建模:一个模块化的方法获得的互操作性

The advent of affordable sequencing technology provides for a new generation of explorers who probe the world’s microbial diversity. Projects such as Tara Oceans, Moorea Biocode Project and Gut Microbiome rely on sequencing technologies to probe community diversity. Either targeted gene surveys (also known as community surveys) or complete metagenomes are evaluated. The former, being the less costly of the two methods, relies on the identification of specific genomic regions, which can be used as a proxy to estimate genetic distance between related species in a Phylum. For instance, 16 S ribosomal RNA gene surveys are used to probe bacterial communities while internal transcribed spacer surveys, for example, can be used for probing fungal communities. With the explosion of projects and frenzy to explore new domains of life, scientists in the field have issued guidelines to report minimal information (following a checklist), ensuring that information is contextualized in a meaningful way. Yet the semantics of a checklist are not explicit. We demonstrate here how a tabular template can be used to collect information on microbial diversity using an explicit representation in the Resource Description Framework that is consistent with community agreed-upon knowledge representation patterns found in the Ontology for Biomedical Investigations.

[详细]

  • Database
  • 9年前
  • Original Article

LFQC: a lossless compression algorithm for FASTQ files

lfqc:一为FASTQ文件无损压缩算法

Motivation: Next-generation sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole-genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large Fastq files using innovative compression techniques.

Results: We introduce a new lossless non-reference-based fastq compression algorithm named lossless FastQ compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz, fqzcomp, G-SQZ, SCALCE, Quip, DSRC, DSRC-LZ etc. This comparison reveals that our algorithm achieves better compression ratios. The improvement obtained is up to 225%. For example, on one of the datasets (SRR065390_1), the average improvement (over all the algorithms compared) is 74.62%.

Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/~rajasek/FastqPrograms.zip.

Contact: rajasek@engr.uconn.edu

[详细]

  • Bioinformatics
  • 10年前
  • ORIGINAL PAPER

OntoMate: a text-mining tool aiding curation at the Rat Genome Database

ontomate:文本挖掘工具辅助治疗的大鼠基因组数据库

The Rat Genome Database (RGD) is the premier repository of rat genomic, genetic and physiologic data. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism databases. RGD spends considerable effort manually curating gene, Quantitative Trait Locus (QTL) and strain information. The rapidly growing volume of biomedical literature and the active research in the biological natural language processing (bioNLP) community have given RGD the impetus to adopt text-mining tools to improve curation efficiency. Recently, RGD has initiated a project to use OntoMate, an ontology-driven, concept-based literature search engine developed at RGD, as a replacement for the PubMed (http://www.ncbi.nlm.nih.gov/pubmed) search engine in the gene curation workflow. OntoMate tags abstracts with gene names, gene mutations, organism name and most of the 16 ontologies/vocabularies used at RGD. All terms/ entities tagged to an abstract are listed with the abstract in the search results. All listed terms are linked both to data entry boxes and a term browser in the curation tool. OntoMate also provides user-activated filters for species, date and other parameters relevant to the literature search. Using the system for literature search and import has streamlined the process compared to using PubMed. The system was built with a scalable and open architecture, including features specifically designed to accelerate the RGD gene curation process. With the use of bioNLP tools, RGD has added more automation to its curation workflow.

Database URL: http://rgd.mcw.edu

[详细]

  • Database
  • 9年前
  • Database Tool

An enteric virus can replace the beneficial function of commensal bacteria

肠道病毒可以代替共生菌的有益作用

Intestinal microbial communities have profound effects on host physiology. Whereas the symbiotic contribution of commensal bacteria is well established, the role of eukaryotic viruses that are present in the gastrointestinal tract under homeostatic conditions is undefined. Here we demonstrate that a common enteric RNA virus can replace the beneficial function of commensal bacteria in the intestine. Murine norovirus (MNV) infection of germ-free or antibiotic-treated mice restored intestinal morphology and lymphocyte function without inducing overt inflammation and disease. The presence of MNV also suppressed an expansion of group 2 innate lymphoid cells observed in the absence of bacteria, and induced transcriptional changes in the intestine associated with immune development and type I interferon (IFN) signalling. Consistent with this observation, the IFN-α receptor was essential for the ability of MNV to compensate for bacterial depletion. Importantly, MNV infection offset the deleterious effect of treatment with antibiotics in models of intestinal injury and pathogenic bacterial infection. These data indicate that eukaryotic viruses have the capacity to support intestinal homeostasis and shape mucosal immunity, similarly to commensal bacteria.

[详细]

  • Nature
  • 9年前
  • Letter

Inhibition of cell expansion by rapid ABP1-mediated auxin effect on microtubules

介导生长素影响微管的细胞迅速膨胀的抑制蛋白

The prominent and evolutionarily ancient role of the plant hormone auxin is the regulation of cell expansion. Cell expansion requires ordered arrangement of the cytoskeleton but molecular mechanisms underlying its regulation by signalling molecules including auxin are unknown. Here we show in the model plant Arabidopsis thaliana that in elongating cells exogenous application of auxin or redistribution of endogenous auxin induces very rapid microtubule re-orientation from transverse to longitudinal, coherent with the inhibition of cell expansion. This fast auxin effect requires auxin binding protein 1 (ABP1) and involves a contribution of downstream signalling components such as ROP6 GTPase, ROP-interactive protein RIC1 and the microtubule-severing protein katanin. These components are required for rapid auxin- and ABP1-mediated re-orientation of microtubules to regulate cell elongation in roots and dark-grown hypocotyls as well as asymmetric growth during gravitropic responses.

[详细]

  • Nature
  • 9年前
  • Letter