The NIAID Office of Data Science and Emerging Technologies (ODSET) highlights publications that feature innovative uses of data science and bioinformatics in infectious, immune-mediated and allergic disease research.
Explore NIAID data science publications on PubMed:
- NIAID-funded publications that use data science or computational biology (since 2023).
- Publications funded or co-funded by ODSET.
- Publications from the Harnessing Big Data to Halt HIV initiative.
If you would like to feature a publication on this page, please contact datascience@niaid.nih.gov. Publications should feature research related to infectious, immunologic, and allergic diseases; include data science or a related discipline; and cite NIAID funding in the manuscript. Please include in your email:
- The title of your published article.
- A link to the article.
- A 50-60 word description of the article.
39 Results
Harnessing Big Heterogeneous Data to Evaluate the Potential Impact of HIV Responses Among Key Populations in Sub-Saharan Africa: Protocol for the Boloka Data Repository Initiative
January 22, 2025 JMIR Publications
This study aimed to leverage and harness big heterogeneous data on HIV in South Africa and harmonize and analyze it to inform a targeted HIV response for greater impact in Sub-Saharan Africa.
Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
January 15, 2025 Nature Communications
Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a significant challenge in nanopore sequencing bioinformatics. Here, authors report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications.
Fragmenstein: predicting protein–ligand structures of compounds derived from known crystallographic fragment hits using a strict conserved-binding–based methodology
January 13, 2025 Journal of Cheminformatics
Current strategies centered on either merging or linking initial hits from fragment-based drug design (FBDD) crystallographic screens generally do not fully leverage 3D structural information. Here, the authors show that an algorithmic approach (Fragmenstein) that ‘stitches’ the ligand atoms from this structural information together can provide more accurate and reliable predictions for protein–ligand complex conformation than general methods such as pharmacophore-constrained docking. This method is likely to further benefit future drug design campaigns and be integrated in future pipelines.
Prediction of cccDNA dynamics in hepatitis B patients by a combination of serum surrogate markers
January 9, 2025 PLoS Computational Biology
Authors employed a recently developed multiscale mathematical model describing intra- and intercellular viral propagation and applied it in HBV-infected patients under treatment. They developed a model that can predict intracellular HBV dynamics by use of extracellular viral markers, including HBsAg, HBV DNA, and HBcrAg in peripheral blood. Importantly, the model prediction of the amount of cccDNA in patients over time was confirmed to be well correlated with the data for quantified cccDNA by paired liver biopsy.
Learning predictive signatures of HLA type from T-cell repertoires
January 6, 2025 PLoS Computational Biology
The deep sequencing of immune repertoires from blood samples promises to offer diagnostic and precision medicine tools, and to help with the analysis of treatments and vaccinations and the design of immunotherapies. This study shows how, by training models on large datasets of annotated and unannotated repertoire sequencing, the human leukocyte antigen (HLA) type of patients can be computationally determined from their T-cell repertoire. This tool could be useful for typing repertoires for which the HLA is unknown, and to gain insight into the HLA restriction of T-cell receptor epitope specificity. It also provides lists of T-cell receptors that are associated to each HLA allele.
Characterizing cell-type spatial relationships across length scales in spatially resolved omics data
January 3, 2025 Nature Communications
Authors introduce CRAWDAD, an R package for quantifying cell-type spatial relationships across length scales in tissues using spatial omics data, enabling the identification of consistent as well as sample-specific celltype spatial relationships across multiple spatial omics datasets.
Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages
January 2, 2025 Nature Communications
The authors propose a Genetic Progression Score (GPS) model to predict autoimmune disease progression from preclinical stages, integrating genome-wide association study summary statistics and electronic health records from the All of Us and BioVU biobanks.
The Impact of Ancestry on Genome-Wide Association Studies
January 1, 2025 Pacific Symposium on Biocomputing
Decisions regarding how the genetic data are combined in genome-wide association studies (GWAS) can mask important findings that might serve individuals of ancestries that have been historically underrepresented in genetic studies. New methods that consider ancestry-specific variants in conjunction with the shared variants need to be developed.
A Prospective Comparison of Large Language Models for Early Prediction of Sepsis
January 1, 2025 Pacific Symposium on Biocomputing
Authors present a comparative study on the performance of two popular open-source large language models for early prediction of sepsis: Llama-3 8B and Mixtral 8x7B. The primary goal was to determine whether a smaller model could achieve comparable predictive accuracy to a significantly larger model in the context of sepsis prediction using clinical data.
A multivalent binding model infers antibody Fc species from systems serology
December 23, 2024 PLoS Computational Biology
Authors developed a computational model that interprets data from systems serology, allowing them to infer detailed, engagement-relevant structural information about antibodies that are difficult to measure directly. The model was applied to existing data from COVID-19 and HIV studies, revealing new insights into how antibody structure relates to vaccine efficacy and disease progression. The model also helped to identify which measurements in systems serology are most informative, potentially streamlining future studies.
Predicting the infecting dengue serotype from antibody titre data using machine learning
December 23, 2024 PLoS Computational Biology
Authors analyzed antibody data from Thai children, collected both before and after dengue infection. By applying machine learning models to this data, they predicted the infecting serotype with an average accuracy of 71% to 80%. This approach improves our understanding of how the immune system responds to different dengue serotypes and has the potential to inform the evaluation of vaccine efficacy in future trials.
Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE
December 13, 2024 PLoS One
Correlated clustering and projection (CCP) was recently introduced as an effective method for preprocessing single-cell RNA sequencing (scRNA-seq) data, and it can be used in many downstream machine learning tasks. In this work, the authors utilize CCP as an initialization tool for uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (tSNE). By using 21 publicly available datasets, authors found that CCP significantly improves UMAP and tSNE visualization and dramatically improve their accuracy.
A near-tight lower bound on the density of forward sampling schemes
December 12, 2024 Oxford Bioinformatics
Sampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e. have a small proportion of sampled k-mers, is an active area of research. Authors prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes.
Borrelia PeptideAtlas: A proteome resource of common Borrelia burgdorferi isolates for Lyme research
December 2, 2024 Scientific Data
Lyme disease is caused by an infection with the spirochete Borrelia burgdorferi, and is the most common vector-borne disease in North America. Authors applied both transcriptome analysis and mass spectrometry-based proteomics to assemble peptide datasets of B. burgdorferi laboratory isolates B31, MM1, and the infective isolate B31-5A4, to provide a publicly available Borrelia PeptideAtlas. The Borrelia PeptideAtlas is an extensible comprehensive peptide repository with proteomic information from B. burgdorferi isolates useful for Lyme disease research.
Machine learning-enhanced immunopeptidomics applied to T-cell epitope discovery for COVID-19 vaccines
November 28, 2024 Nature Communications
The identification of T cell epitopes is a critical step in understanding the immune response to infection and in designing vaccine-based approaches. Here the authors introduce a framework of antigen discovery called MHCvalidator and Epitrack to identify new antigenic features for T-cell COVID-19 vaccines and characterize a novel non-canonical epitope from a truncated Spike variant and mutation of an immunodominant epitope in the BNT162b4 vaccine.
Accurate RNA 3D structure prediction using a language model-based deep learning approach
November 21, 2024 Nature Methods
Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. RhoFold+ is an end-to-end language model-based deep learning method to predict RNA three-dimensional structures of single-chain RNAs from sequences.
Rothman diagrams: the geometry of confounding and standardization
November 14, 2024 International Journal of Epidemiology
Rothman diagrams, where the risk of disease in the unexposed is plotted on the x-axis and the risk in the exposed is plotted on the y-axis, provide a geometric perspective on the control of confounding using standardization. The authors outline a geometric perspective on causal inference in cohort studies that can help epidemiologists understand the role of standardization in controlling for confounding.
Assessing acute kidney injury risk after COVID vaccination and infection in a large cohort study
November 8, 2024 NPJ Vaccines
Acute kidney injury (AKI) has been noticed after both COVID-19 vaccination and infection. The authors used data from the National COVID Cohort Collaborative (N3C) dataset to compare AKI incidence following COVID-19 vaccination and infection, revealing that COVID-19 vaccination is associated with a significant lower AKI risk compared to COVID-19 infection.
The Type 1 Diabetes T Cell Receptor and B Cell Receptor Repository in the AIRR Data Commons: a practical guide for access, use and contributions through the Type 1 Diabetes AIRR Consortium
October 29, 2024 Diabetologia
The authors introduce the Type 1 Diabetes Adaptive Immune Receptor Repertoire (AIRR) Consortium goals and outline methods to use and deposit data to this comprehensive repository. The repository’s goal is to facilitate research community access to rich, carefully annotated immune AIRR datasets to enable new scientific inquiry and insight into the natural history and pathogenesis of type 1 diabetes.
Interpretable GWAS by linking clinical phenotypes to quantifiable immune repertoire components
October 20, 2024 Communications Biology
Analyzing the impact of genetic variants on T cell receptor repertoire components reveals the mechanisms behind susceptibility variants in autoimmune diseases and cancers.
Explainable artificial intelligence and domain adaptation for predicting HIV infection with graph neural networks
October 17, 2024 Annals of Medicine
Researchers investigated deep learning methods for graph neural networks to predict HIV infections with social network information. Using network data from two cohorts of men in different cities, researchers used GNNExplainer to determine feature importance from graph attention network (GAT) models. Their findings reinforced potential application of GAT models in predicting HIV infections.
SepsisLab: Early Sepsis Prediction with Uncertainty Quantification and Active Sensing
October 12, 2024 Knowledge Discovery and Data Mining
Researchers developed a robust active sensing algorithm that actively recommends clinicians to observe the most informative variables for patients at high risk of sepsis. The researchers validated the proposed models in publicly available and proprietary data. The experimental results show that the proposed algorithm outperforms state-of-the-art active sensing methods. The team implemented a SepsisLab system for early sepsis prediction and active sensing based on pre-trained models.
AI and immunology as a new research paradigm
October 4, 2024 Nature Immunology
NIAID’s Division of Allergy, Immunology, and Transplantation (DAIT) hosted a symposium on ‘AI and Immunology: Exploring Challenges and Opportunities’. The event considered how artificial intelligence (AI) can advance research on the multi-scale, adaptive immune system to improve human health, covering basic research and clinical use cases, data and AI model fundamentals.
Pairwise Accelerated Failure Time Regression Models for Infectious Disease Transmission in Close‐Contact Groups With External Sources of Infection
October 3, 2024 Statistics in Medicine
The research team introduces a pairwise accelerated failure time regression model for infectious disease transmission, and shows that this model produces consistent and asymptotically normal parameter estimates. In a simulation study, the researchers evaluate bias and confidence interval coverage probabilities, explore the role of epidemiologic study design, and investigate the effects of model misspecification.
Machine learning approaches identify immunologic signatures of total and intact HIV DNA during long-term antiretroviral therapy
September 9, 2024 eLife
Understanding the interplay between the HIV reservoir and the host immune system may yield insights into HIV persistence during antiretroviral therapy (ART) and inform strategies for a cure. The research team applied machine learning (ML) approaches to cross-sectional high-parameter HIV reservoir and immunology data in order to characterize host–reservoir associations and generate new hypotheses about HIV reservoir biology.