TeloPIN: a database of telomeric proteins interaction network in mammalian cells

一telopin:端粒结合蛋白在哺乳动物细胞中的相互作用网络数据库

Interaction network surrounding telomeres has been intensively studied during the past two decades. However, no specific resource by integrating telomere interaction information data is currently available. To facilitate the understanding of the molecular interaction network by which telomeres are associated with biological process and diseases, we have developed TeloPIN (Telomeric Proteins Interaction Network) database (http://songyanglab.sysu.edu.cn/telopin/), a novel database that points to provide comprehensive information on protein–protein, protein–DNA and protein–RNA interaction of telomeres. TeloPIN database contains four types of interaction data, including (i) protein–protein interaction (PPI) data, (ii) telomeric proteins ChIP-seq data, (iii) telomere-associated proteins data and (iv) telomeric repeat-containing RNAs (TERRA)-interacting proteins data. By analyzing these four types of interaction data, we found that 358 and 199 proteins have more than one type of interaction information in human and mouse cells, respectively. We also developed table browser and TeloChIP genome browser to help researchers with better integrated visualization of interaction data from different studies. The current release of TeloPIN database includes 1111 PPI, eight telomeric protein ChIP-seq data sets, 1391 telomere-associated proteins and 183 TERRA-interacting proteins from 92 independent studies in mammalian cells. The interaction information provided by TeloPIN database will greatly expand our knowledge of telomeric proteins interaction network.

Database URL: TeloPIN database address is http://songyanglab.sysu.edu.cn/telopin. TeloPIN database is freely available to non-commercial use.

[详细]

  • Database
  • 9年前
  • Database Tool

SOMP: web server for in silico prediction of sites of metabolism for drug-like compounds

系统:在药物化合物的代谢位点的计算预测的Web服务器

Summary: A new freely available web server site of metabolism predictor to predict the sites of metabolism (SOM) based on the structural formula of chemicals has been developed. It is based on the analyses of ‘structure-SOM’ relationships using a Bayesian approach and labelled multilevel neighbourhoods of atoms descriptors to represent the structures of over 1000 metabolized xenobiotics. The server allows predicting SOMs that are catalysed by 1A2, 2C9, 2C19, 2D6 and 3A4 isoforms of cytochrome P450 and enzymes of the UDP-glucuronosyltransferase family. The average invariant accuracy of prediction that was calculated for the training sets (using leave-one-out cross-validation) and evaluation sets is 0.9 and 0.95, respectively.

Availability and implementation: Freely available on the web at http://www.way2drug.com/SOMP.

Contact: rudik_anastassia@mail.ru

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • APPLICATIONS NOTE

Reference-based compression of short-read sequences using path encoding

基于参考短读序列使用路径压缩编码

Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.

Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/~ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.

Contact: carlk@cs.cmu.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • ORIGINAL PAPER

Methods for the detection and assembly of novel sequence in high-throughput sequencing data

对高通量测序数据的新序列的检测与装配方法

Motivation: Large insertions of novel sequence are an important type of structural variants. Previous studies used traditional de novo assemblers for assembling non-mapping high-throughput sequencing (HTS) or capillary reads and then tried to anchor them in the reference using paired read information.

Results: We present approaches for detecting insertion breakpoints and targeted assembly of large insertions from HTS paired data: BASIL and ANISE. On near identity repeats that are hard for assemblers, ANISE employs a repeat resolution step. This results in far better reconstructions than obtained by the compared methods. On simulated data, we found our insert assembler to be competitive with the de novo assemblers ABYSS and SGA while yielding already anchored inserted sequence as opposed to unanchored contigs as from ABYSS/SGA. On real-world data, we detected novel sequence in a human individual and thoroughly validated the assembled sequence. ANISE was found to be superior to the competing tool MindTheGap on both simulated and real-world data.

Availability and implementation: ANISE and BASIL are available for download at http://www.seqan.de/projects/herbarium under a permissive open source license.

Contact: manuel.holtgrewe@fu-berlin.de or knut.reinert@fu-berlin.de

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • ORIGINAL PAPER

Dumbbell-PCR: a method to quantify specific small RNA variants with a single nucleotide resolution at terminal sequences

哑铃PCR的方法来量化特定的小RNA变异单核苷酸分辨率在末端序列

Recent advances in next-generation sequencing technologies have revealed that cellular functional RNAs are not always expressed as single entities with fixed terminal sequences but as multiple isoforms bearing complex heterogeneity in both length and terminal sequences, such as isomiRs, the isoforms of microRNAs. Unraveling the biogenesis and biological significance of heterogenetic RNA expression requires distinctive analysis of each RNA variant. Here, we report the development of dumbbell PCR (Db-PCR), an efficient and convenient method to distinctively quantify a specific individual small RNA variant. In Db-PCR, 5'- and 3'-stem–loop adapters are specifically hybridized and ligated to the 5'- and 3'-ends of target RNAs, respectively, by T4 RNA ligase 2 (Rnl2). The resultant ligation products with ‘dumbbell-like’ structures are subsequently quantified by TaqMan RT-PCR. We confirmed that high specificity of Rnl2 ligation and TaqMan RT-PCR toward target RNAs assured both 5'- and 3'-terminal sequences of target RNAs with single nucleotide resolution so that Db-PCR specifically detected target RNAs but not their corresponding terminal variants. Db-PCR had broad applicability for the quantification of various small RNAs in different cell types, and the results were consistent with those from other quantification method. Therefore, Db-PCR provides a much-needed simple method for analyzing RNA terminal heterogeneity.

[详细]

  • Nucleic Acids Research
  • 9年前
  • Methods Online

RNABP COGEST: a resource for investigating functional RNAs

rnabp cogest:功能性RNA的资源调查

Structural bioinformatics of RNA has evolved mainly in response to the rapidly accumulating evidence that non-(protein)-coding RNAs (ncRNAs) play critical roles in gene regulation and development. The structures and functions of most ncRNAs are however still unknown. Most of the available RNA structural databases rely heavily on known 3D structures, and contextually correlate base pairing geometry with actual 3D RNA structures. None of the databases provide any direct information about stabilization energies. However, the intrinsic interaction energies of constituent base pairs can provide significant insights into their roles in the overall dynamics of RNA motifs and structures. Quantum mechanical (QM) computations provide the only approach toward their accurate quantification and characterization. ‘RNA Base Pair Count, Geometry and Stability’ (http://bioinf.iiit.ac.in/RNABPCOGEST) brings together information, extracted from literature data, regarding occurrence frequency, experimental and quantum chemically optimized geometries, and computed interaction energies, for non-canonical base pairs observed in a non-redundant dataset of functional RNA structures. The database is designed to enable the QM community, on the one hand, to identify appropriate biologically relevant model systems and also enable the biology community to easily sift through diverse computational results to gain theoretical insights which could promote hypothesis driven biological research.

Database URL: http://bioinf.iiit.ac.in/RNABPCOGEST

[详细]

  • Database
  • 9年前
  • Original Article

EpiDBase: a manually curated database for small molecule modulators of epigenetic landscape

epidbase:手动策划数据库的表观遗传景观的小分子调节剂

We have developed EpiDBase (www.epidbase.org), an interactive database of small molecule ligands of epigenetic protein families by bringing together experimental, structural and chemoinformatic data in one place. Currently, EpiDBase encompasses 5784 unique ligands (11 422 entries) of various epigenetic markers such as writers, erasers and readers. The EpiDBase includes experimental IC50 values, ligand molecular weight, hydrogen bond donor and acceptor count, XlogP, number of rotatable bonds, number of aromatic rings, InChIKey, two-dimensional and three-dimensional (3D) chemical structures. A catalog of all epidbase ligands based on the molecular weight is also provided. A structure editor is provided for 3D visualization of ligands. EpiDBase is integrated with tools like text search, disease-specific search, advanced search, substructure, and similarity analysis. Advanced analysis can be performed using substructure and OpenBabel-based chemical similarity fingerprints. The EpiDBase is curated to identify unique molecular scaffolds. Initially, molecules were selected by removing peptides, macrocycles and other complex structures and then processed for conformational sampling by generating 3D conformers. Subsequent filtering through Zinc Is Not Commercial (ZINC: a free database of commercially available compounds for virtual screening) and Lilly MedChem regular rules retained many distinctive drug-like molecules. These molecules were then analyzed for physicochemical properties using OpenBabel descriptors and clustered using various methods such as hierarchical clustering, binning partition and multidimensional scaling. EpiDBase provides comprehensive resources for further design, development and refinement of small molecule modulators of epigenetic markers.

Database URL: www.epidbase.org

[详细]

  • Database
  • 9年前
  • Original Article

LMPID: A manually curated database of linear motifs mediating protein-protein interactions

lmpid:手动策划数据库的线性序列介导蛋白质-蛋白质相互作用

Linear motifs (LMs), used by a subset of all protein–protein interactions (PPIs), bind to globular receptors or domains and play an important role in signaling networks. LMPID (Linear Motif mediated Protein Interaction Database) is a manually curated database which provides comprehensive experimentally validated information about the LMs mediating PPIs from all organisms on a single platform. About 2200 entries have been compiled by detailed manual curation of PubMed abstracts, of which about 1000 LM entries were being annotated for the first time, as compared with the Eukaryotic LM resource. The users can submit their query through a user-friendly search page and browse the data in the alphabetical order of the bait gene names and according to the domains interacting with the LM. LMPID is freely accessible at http://bicresources.jcbose. ac.in/ssaha4/lmpid and contains 1750 unique LM instances found within 1181 baits interacting with 552 prey proteins. In summary, LMPID is an attempt to enrich the existing repertoire of resources available for studying the LMs implicated in PPIs and may help in understanding the patterns of LMs binding to a specific domain and develop prediction model to identify novel LMs specific to a domain and further able to predict inhibitors/modulators of PPI of interest.

Database URL: http://bicresources.jcbose.ac.in/ssaha4/lmpid

[详细]

  • Database
  • 9年前
  • Original Article

Large-scale exploration and analysis of drug combinations

药物组合大规模的探索与分析

Motivation: Drug combinations are a promising strategy for combating complex diseases by improving the efficacy and reducing corresponding side effects. Currently, a widely studied problem in pharmacology is to predict effective drug combinations, either through empirically screening in clinic or pure experimental trials. However, the large-scale prediction of drug combination by a systems method is rarely considered.

Results: We report a systems pharmacology framework to predict drug combinations (PreDCs) on a computational model, termed probability ensemble approach (PEA), for analysis of both the efficacy and adverse effects of drug combinations. First, a Bayesian network integrating with a similarity algorithm is developed to model the combinations from drug molecular and pharmacological phenotypes, and the predictions are then assessed with both clinical efficacy and adverse effects. It is illustrated that PEA can predict the combination efficacy of drugs spanning different therapeutic classes with high specificity and sensitivity (AUC = 0.90), which was further validated by independent data or new experimental assays. PEA also evaluates the adverse effects (AUC = 0.95) quantitatively and detects the therapeutic indications for drug combinations. Finally, the PreDC database includes 1571 known and 3269 predicted optimal combinations as well as their potential side effects and therapeutic indications.

Availability and implementation: The PreDC database is available at http://sm.nwsuaf.edu.cn/lsp/predc.php.

Contact: yh_wang@nwsuaf.edu.cn

Supplementary Information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • ORIGINAL PAPER

Genomic data assimilation using a higher moment filtering technique for restoration of gene regulatory networks

基因组数据同化使用更高的过滤技术时刻恢复基因调控网络

Background: As a result of recent advances in biotechnology, many findings related to intracellular systems have been published, e.g., transcription factor (TF) information. Although we can reproduce biological systems by incorporating such findings and describing their dynamics as mathematical equations, simulation results can be inconsistent with data from biological observations if there are inaccurate or unknown parts in the constructed system. For the completion of such systems, relationships among genes have been inferred through several computational approaches, which typically apply several abstractions, e.g., linearization, to handle the heavy computational cost in evaluating biological systems. However, since these approximations can generate false regulations, computational methods that can infer regulatory relationships based on less abstract models incorporating existing knowledge have been strongly required. Results: We propose a new data assimilation algorithm that utilizes a simple nonlinear regulatory model and a state space representation to infer gene regulatory networks (GRNs) using time-course observation data. For the estimation of the hidden state variables and the parameter values, we developed a novel method termed a higher moment ensemble particle filter (HMEnPF) that can retain first four moments of the conditional distributions through filtering steps. Starting from the original model, e.g., derived from the literature, the proposed algorithm can sequentially evaluate candidate models, which are generated by partially changing the current best model, to find the model that can best predict the data. For the performance evaluation, we generated six synthetic data based on two real biological networks and evaluated effectiveness of the proposed algorithm by improving the networks inferred by previous methods. We then applied time-course observation data of rat skeletal muscle stimulated with corticosteroid. Since a corticosteroid pharmacogenomic pathway, its kinetic/dynamics and TF candidate genes have been partially elucidated, we incorporated these findings and inferred an extended pathway of rat pharmacogenomics. Conclusions: Through the simulation study, the proposed algorithm outperformed previous methods and successfully improved the regulatory structure inferred by the previous methods. Furthermore, the proposed algorithm could extend a corticosteroid related pathway, which has been partially elucidated, with incorporating several information sources.

[详细]

  • BMC Systems Biology 2015, null:14
  • 9年前

A fluorescence-based helicase assay: application to the screening of G-quadruplex ligands

基于荧光检测的解旋酶:G-四链配体的筛选中的应用

Helicases, enzymes that unwind DNA or RNA structure, are present in the cell nucleus and in the mitochondrion. Although the majority of the helicases unwind DNA or RNA duplexes, some of these proteins are known to resolve unusual structures such as G-quadruplexes (G4) in vitro. G4 may form stable barrier to the progression of molecular motors tracking on DNA. Monitoring G4 unwinding by these enzymes may reveal the mechanisms of the enzymes and provides information about the stability of these structures. In the experiments presented herein, we developed a reliable, inexpensive and rapid fluorescence-based technique to monitor the activity of G4 helicases in real time in a 96-well plate format. This system was used to screen a series of G4 structures and G4 binders for their effect on the Pif1 enzyme, a 5' to 3' DNA helicase. This simple assay should be adaptable to analysis of other helicases and G4 structures.

[详细]

  • Nucleic Acids Research
  • 9年前
  • Methods Online

A statistical framework for revealing signaling pathways perturbed by DNA variants

为揭示信号通路的DNA变异扰动的统计框架

Much of the inter-individual variation in gene expression is triggered via perturbations of signaling networks by DNA variants. We present a novel probabilistic approach for identifying the particular pathways by which DNA variants perturb the signaling network. Our procedure, called PINE, relies on a systematic integration of established biological knowledge of signaling networks with data on transcriptional responses to various experimental conditions. Unlike previous approaches, PINE provides statistical aspects that are critical for prioritizing hypotheses for followup experiments. Using simulated data, we show that higher accuracy is attained with PINE than with existing methods. We used PINE to analyze transcriptional responses of immune dendritic cells to several pathogenic stimulations. PINE identified statistically significant genetic perturbations in the pathogen-sensing signaling network, suggesting previously uncharacterized regulatory mechanisms for functional DNA variants.

[详细]

  • Nucleic Acids Research
  • 9年前
  • Methods Online

Probing a label-free local bend in DNA by single molecule tethered particle motion

探测标签的局部弯曲的DNA单分子束缚粒子的运动

Being capable of characterizing DNA local bending is essential to understand thoroughly many biological processes because they involve a local bending of the double helix axis, either intrinsic to the sequence or induced by the binding of proteins. Developing a method to measure DNA bend angles that does not perturb the conformation of the DNA itself or the DNA-protein complex is a challenging task. Here, we propose a joint theory-experiment high-throughput approach to rigorously measure such bend angles using the Tethered Particle Motion (TPM) technique. By carefully modeling the TPM geometry, we propose a simple formula based on a kinked Worm-Like Chain model to extract the bend angle from TPM measurements. Using constructs made of 575 base-pair DNAs with in-phase assemblies of one to seven 6A-tracts, we find that the sequence CA6CGG induces a bend angle of 19° ± 4°. Our method is successfully compared to more theoretically complex or experimentally invasive ones such as cyclization, NMR, FRET or AFM. We further apply our procedure to TPM measurements from the literature and demonstrate that the angles of bends induced by proteins, such as Integration Host Factor (IHF) can be reliably evaluated as well.

[详细]

  • Nucleic Acids Research
  • 9年前
  • Methods Online

Broadening the versatility of lentiviral vectors as a tool in nucleic acid research via genetic code expansion

扩大的慢病毒载体的通用性在核酸研究工具通过遗传代码膨胀

With the aim of broadening the versatility of lentiviral vectors as a tool in nucleic acid research, we expanded the genetic code in the propagation of lentiviral vectors for site-specific incorporation of chemical moieties with unique properties. Through systematic exploration of the structure–function relationship of lentiviral VSVg envelope by site-specific mutagenesis and incorporation of residues displaying azide- and diazirine-moieties, the modifiable sites on the vector surface were identified, with most at the PH domain that neither affects the expression of envelope protein nor propagation or infectivity of the progeny virus. Furthermore, via the incorporation of such chemical moieties, a variety of fluorescence probes, ligands, PEG and other functional molecules are conjugated, orthogonally and stoichiometrically, to the lentiviral vector. Using this methodology, a facile platform is established that is useful for tracking virus movement, targeting gene delivery and detecting virus–host interactions. This study may provide a new direction for rational design of lentiviral vectors, with significant impact on both basic research and therapeutic applications.

[详细]

  • Nucleic Acids Research
  • 9年前
  • Methods Online

Improving the consistency of domain annotation within the Conserved Domain Database

提高在保守结构域数据库域标注的一致性

When annotating protein sequences with the footprints of evolutionarily conserved domains, conservative score or E-value thresholds need to be applied for RPS-BLAST hits, to avoid many false positives. We notice that manual inspection and classification of hits gathered at a higher threshold can add a significant amount of valuable domain annotation. We report an automated algorithm that ‘rescues’ valuable borderline-scoring domain hits that are well-supported by domain architecture (DA, the sequential order of conserved domains in a protein query), including tandem repeats of domain hits reported at a more conservative threshold. This algorithm is now available as a selectable option on the public conserved domain search (CD-Search) pages. We also report on the possibility to ‘suppress’ domain hits close to the threshold based on a lack of well-supported DA and to implement this conservatively as an option in live conserved domain searches and for pre-computed results. Improving domain annotation consistency will in turn reduce the fraction of NR sequences with incomplete DAs.

URL: http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

[详细]

  • Database
  • 9年前
  • Original Article

Identification of AMP-activated protein kinase targets by a consensus sequence search of the proteome

活化蛋白激酶的识别目标的共识序列搜索蛋白质组

Background: AMP-activated protein kinase (AMPK) is a heterotrimeric serine/threonine protein kinase that is activated by cellular perturbations associated with ATP depletion or stress. While AMPK modulates the activity of a variety of targets containing a specific phosphorylation consensus sequence, the number of AMPK targets and their influence over cellular processes is currently thought to be limited. Results: We queried the human and the mouse proteomes for proteins containing AMPK phosphorylation consensus sequences. Integration of this database into Gaggle software facilitated the construction of probable AMPK-regulated networks based on known and predicted molecular associations. In vitro kinase assays were conducted for preliminary validation of 12 novel AMPK targets across a variety of cellular functional categories, including transcription, translation, cell migration, protein transport, and energy homeostasis. Following initial validation, pathways that include NAD synthetase 1 (NADSYN1) and protein kinase B (AKT2) were hypothesized and experimentally tested to provide a mechanistic basis for AMPK regulation of cell migration and maintenance of cellular NAD+ concentrations during catabolic processes. Conclusions: This study delineates an approach that encompasses both in silico procedures and in vitro experiments to produce testable hypotheses for AMPK regulation of cellular processes.

[详细]

  • BMC Systems Biology 2015, null:13
  • 9年前

Starcode: sequence clustering based on all-pairs search

starcode:基于对搜索序列聚类

Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem.

Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman–Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision.

Availability and implementation: The C source code is available at http://github.com/gui11aume/starcode.

Contact: guillaume.filion@gmail.com

[详细]

  • Bioinformatics
  • 9年前
  • ORIGINAL PAPER

PolyMarker: A fast polyploid primer design pipeline

多点标记:一个快速的多倍体引物设计管道

Summary: The design of genetic markers is of particular relevance in crop breeding programs. Despite many economically important crops being polyploid organisms, the current primer design tools are tailored for diploid species. Bread wheat, for instance, is a hexaploid comprising of three related genomes and the performance of genetic markers is diminished if the primers are not genome specific. PolyMarker is a pipeline that generates SNP markers by selecting candidate primers for a specified genome using local alignments and standard primer design tools to test the viability of the primers. A command line tool and a web interface are available to the community.

Availability and implementation: PolyMarker is available as a ruby BioGem: bio-polyploid-tools. Web interface: http://polymarker.tgac.ac.uk.

Contact: Ricardo.Ramirez-Gonzalez@tgac.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • APPLICATIONS NOTE

A multiobjective memetic algorithm for PPI network alignment

PPI网络的多目标Memetic算法对准

Motivation: There recently has been great interest in aligning protein–protein interaction (PPI) networks to identify potentially orthologous proteins between species. It is thought that the topological information contained in these networks will yield better orthology predictions than sequence similarity alone. Recent work has found that existing aligners have difficulty making use of both topological and sequence similarity when aligning, with either one or the other being better matched. This can be at least partially attributed to the fact that existing aligners try to combine these two potentially conflicting objectives into a single objective.

Results: We present Optnetalign, a multiobjective memetic algorithm for the problem of PPI network alignment that uses extremely efficient swap-based local search, mutation and crossover operations to create a population of alignments. This algorithm optimizes the conflicting goals of topological and sequence similarity using the concept of Pareto dominance, exploring the tradeoff between the two objectives as it runs. This allows us to produce many high-quality candidate alignments in a single run. Our algorithm produces alignments that are much better compromises between topological and biological match quality than previous work, while better characterizing the diversity of possible good alignments between two networks. Our aligner’s results have several interesting implications for future research on alignment evaluation, the design of network alignment objectives and the interpretation of alignment results.

Availability and Implementation: The C++ source code to our program, along with compilation and usage instructions, is available at https://github.com/crclark/optnetaligncpp/

Contact: connor.r.clark@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • ORIGINAL PAPER

Development of a high-resolution NGS-based HLA-typing and analysis pipeline

发展一个高分辨率的NGS基于HLA分型和分析管道

The human leukocyte antigen (HLA) complex contains the most polymorphic genes in the human genome. The classical HLA class I and II genes define the specificity of adaptive immune responses. Genetic variation at the HLA genes is associated with susceptibility to autoimmune and infectious diseases and plays a major role in transplantation medicine and immunology. Currently, the HLA genes are characterized using Sanger- or next-generation sequencing (NGS) of a limited amplicon repertoire or labeled oligonucleotides for allele-specific sequences. High-quality NGS-based methods are in proprietary use and not publicly available. Here, we introduce the first highly automated open-kit/open-source HLA-typing method for NGS. The method employs in-solution targeted capturing of the classical class I (HLA-A, HLA-B, HLA-C) and class II HLA genes (HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1, HLA-DPB1). The calling algorithm allows for highly confident allele-calling to three-field resolution (cDNA nucleotide variants). The method was validated on 357 commercially available DNA samples with known HLA alleles obtained by classical typing. Our results showed on average an accurate allele call rate of 0.99 in a fully automated manner, identifying also errors in the reference data. Finally, our method provides the flexibility to add further enrichment target regions.

[详细]

  • Nucleic Acids Research
  • 9年前
  • Methods Online

GenoMetric Query Language: a novel approach to large-scale genomic data management

genometric查询语言:一种新的大规模基因组数据管理方法

Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art ‘big data’ computing strategies, with abstraction levels beyond available tool capabilities.

Results: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic ‘big data’ analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets.

Availability and implementation: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.

Contact: marco.masseroli@polimi.it

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • ORIGINAL PAPER

mycoCLAP, the database for characterized lignocellulose-active proteins of fungal origin: resource and text mining curation support

mycoclap,木质纤维素的活性蛋白为特征的真菌来源的数据库资源和文本挖掘的管理支持

Enzymes active on components of lignocellulosic biomass are used for industrial applications ranging from food processing to biofuels production. These include a diverse array of glycoside hydrolases, carbohydrate esterases, polysaccharide lyases and oxidoreductases. Fungi are prolific producers of these enzymes, spurring fungal genome sequencing efforts to identify and catalogue the genes that encode them. To facilitate the functional annotation of these genes, biochemical data on over 800 fungal lignocellulose-degrading enzymes have been collected from the literature and organized into the searchable database, mycoCLAP (http://mycoclap.fungalgenomics.ca). First implemented in 2011, and updated as described here, mycoCLAP is capable of ranking search results according to closest biochemically characterized homologues: this improves the quality of the annotation, and significantly decreases the time required to annotate novel sequences. The database is freely available to the scientific community, as are the open source applications based on natural language processing developed to support the manual curation of mycoCLAP.

Database URL: http://mycoclap.fungalgenomics.ca

[详细]

  • Database
  • 9年前
  • Original Article

Combining computational models, semantic annotations and simulation experiments in a graph database

结合计算模型,在图数据库的语义标注和模拟实验

Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models’ structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried.

Database URL: https://sems.uni-rostock.de/projects/masymos/

[详细]

  • Database
  • 9年前
  • Original Article

mFASD: a structure-based algorithm for discriminating different types of metal-binding sites

mfasd:基于结构的区分不同类型的金属结合位点算法

Motivation: A large number of proteins contain metal ions that are essential for their stability and biological activity. Identifying and characterizing metal-binding sites through computational methods is necessary when experimental clues are lacking. Almost all published computational methods are designed to distinguish metal-binding sites from non-metal-binding sites. However, discrimination between different types of metal-binding sites is also needed to make more accurate predictions.

Results: In this work, we proposed a novel algorithm called mFASD, which could discriminate different types of metal-binding sites effectively based on 3D structure data and is useful for accurate metal-binding site prediction. mFASD captures the characteristics of a metal-binding site by investigating the local chemical environment of a set of functional atoms that are considered to be in contact with the bound metal. Then a distance measure defined on functional atom sets enables the comparison between different metal-binding sites. The algorithm could discriminate most types of metal-binding sites from each other with high sensitivity and accuracy. We showed that cascading our method with existing ones could achieve a substantial improvement of the accuracy for metal-binding site prediction.

Availability and implementation: Source code and data used are freely available from http://staff.ustc.edu.cn/~liangzhi/mfasd/

Contact: liangzhi@ustc.edu.cn or hwkobe@mail.ustc.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

[详细]

  • Bioinformatics
  • 9年前
  • ORIGINAL PAPER