Data Science Publications

The NIAID Office of Data Science and Emerging Technologies (ODSET) highlights publications that feature innovative uses of data science and bioinformatics in infectious, immune-mediated and allergic disease research.

Explore NIAID data science publications on PubMed:

If you would like to feature a publication on this page, please contact datascience@niaid.nih.gov. Publications should feature research related to infectious, immunologic, and allergic diseases; include data science or a related discipline; and cite NIAID funding in the manuscript. Please include in your email:

  • The title of your published article.
  • A link to the article.
  • A 50-60 word description of the article. 

55 Results

TRain: T-cell receptor automated immunoinformatics

March 6, 2025
BMC Bioinformatics

Authors introduced an open-source tool in Python that streamlines going from full T-cell receptor (TCR) sequence information to predicted 3D T-cell receptor to peptide-Major Histocompatibility Complexes, using well-established tools. Analyzing these predicted complexes can provide deeper insights into the binding properties of TCRs, and can help shed light on one of the key steps in adaptive immune responses.

Revisiting the Plasmodium falciparum druggable genome using predicted structures and data mining

March 4, 2025
npj Drug Discovery

Leveraging recent advances in protein structure prediction, authors systematically assessed the Plasmodium falciparum genome, with review eventually yielding 27 high-priority antimalarial target candidates. This study also provides a genome-wide data resource for P. falciparum and implements a generalizable framework for systematically evaluating and prioritizing novel pathogenic disease targets.

Viral genomic features predict Orthopoxvirus reservoir hosts

February 26, 2025
Communications Biology

Authors applied machine learning models incorporating both host ecological and viral genomic features to predict likely reservoirs of orthopoxviruses (OPVs). Authors demonstrated that incorporating viral genomic features in addition to host ecological traits enhanced the accuracy of potential OPV host predictions, highlighting the importance of host-virus molecular interactions in predicting potential host species. Authors also identified hotspots for geographic regions rich with potential OPV hosts. 

VaxBot-HPV: a GPT-based chatbot for answering HPV vaccine-related questions

February 19, 2025
JAMIA Open

Human Papillomavirus (HPV) vaccine is an effective measure to prevent and control the diseases caused by HPV. However, widespread misinformation and vaccine hesitancy remain significant barriers to its uptake. This study focuses on the development of VaxBot-HPV, a chatbot aimed at improving health literacy and promoting vaccination uptake by providing information and answering questions about the HPV vaccine.

A comparative study of antibiotic resistance patterns in Mycobacterium tuberculosis

February 11, 2025
Scientific Reports

This study leverages the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) to analyze over 27,000 Mycobacterium tuberculosis (MTB) genomic strains, providing a comprehensive and large-scale overview of antibiotic resistance (AMR) prevalence and resistance patterns. Authors used MTB++, which is the newest and most comprehensive AI-based MTB drug resistance profiler tool, to predict the resistance profile of each of the 27,000 MTB isolates and then used feature analysis to identify key genes that were associated with the resistance.

Conditional similarity triplets enable covariate-informed representations of single-cell data

February 9, 2025
BMC Bioinformatics

Authors introduce a novel approach for incorporating measured covariates in optimizing model parameters to ultimately specify per-sample encodings that accurately affect both immune signatures and additional clinical information.

Challenging a paradigm: Staggered versus single-pulse mass dog vaccination strategy for rabies elimination

February 7, 2025
PLOS Computational Biology

Authors constructed a stochastic, metapopulation model to examine how the timing of pulsed vaccination campaigns across patches can affect metapopulation dynamics. They explored general metapopulation dynamics for pulsed vaccinations as well as parameterizing the model for canine rabies in Arequipa, Peru, and simulated how the timing of the planned vaccination campaign, staggered over 6 months versus a single yearly pulse, affected the prospects for regional rabies elimination.

zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters

February 5, 2025
Nucleic Acids Research

Many universally and conditionally important genes are genomically aggregated within clusters. Here, the authors introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements, such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes.

Multi-strain phage induced clearance of bacterial infections

February 4, 2025
PLOS Computational Biology

Authors combine theory and computational models of in vivo phage therapy to study the efficacy of a phage cocktail, composed of two complementary phages motivated by the example of Pseudomonas aeruginosa facing two phages that exploit different surface receptors, LUZ19v and PAK_P1.

Leveraging public AI tools to explore systems biology resources in mathematical modeling

February 4, 2025
NPJ Systems Biology and Applications

Authors investigated the usage of public Artificial Intelligence (AI) tools in exploring systems biology resources in mathematical modeling. They tested public AI’s understanding of mathematics in models, related systems biology data, and the complexity of model structures.

RSero: A user-friendly R package to reconstruct pathogen circulation history from seroprevalence studies

February 3, 2025
PLOS Computational Biology

The authors introduce an R package, Rsero, that implements a series of serocatalytic models and estimates the force of infection (FOI, i.e., the rate at which susceptible individuals become infected) from age-stratified seroprevalence data using Bayesian methods. The package also contains a series of features to perform model comparison and visualise model fit. 

A multi-omics spatial framework for host-microbiome dissection within the intestinal tissue microenvironment

January 31, 2025
Nature Communications

Authors introduce Microbiome Cartography (MicroCart), a framework for simultaneous in situ probing of host and microbiome across multiple spatial modalities. They demonstrate MicroCart by investigating gut host and microbiome changes in a murine colitis model, using spatial proteomics, transcriptomics, and glycomics. 

Longitudinal Viral Load Clustering for People With HIV Using Functional Principal Component Analysis

January 29, 2025
AIDS Research and Treatment

Longitudinal measures of viral load (VL) are critical in monitoring the HIV status. While multiple lab indicators exist for monitoring measures of VL, research on clustering historical/longitudinal VL measures is limited. This study uses functional data clustering to classify longitudinal VL patterns and characterize each cluster by demographics, comorbidities, social behaviors, and CD4 count.

Harnessing Big Heterogeneous Data to Evaluate the Potential Impact of HIV Responses Among Key Populations in Sub-Saharan Africa: Protocol for the Boloka Data Repository Initiative

January 22, 2025
JMIR Publications

This study aimed to leverage and harness big heterogeneous data on HIV in South Africa and harmonize and analyze it to inform a targeted HIV response for greater impact in Sub-Saharan Africa.

A practical guide to FAIR data management in the age of multi-OMICS and AI

January 20, 2025
Frontiers in Immunology

With the rapid development and increasing interest in using artificial intelligence (AI) for the life sciences, improving how biologic data is organized and shared is more pressing than ever for scientific progress. Here, the authors outline a practical approach to multi-modal data management and FAIR sharing, in order to extend the longevity and utility of data by allowing facile use and reuse, accelerating scientific discovery in the biomedical sciences.

Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts

January 15, 2025
Nature Communications

Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a significant challenge in nanopore sequencing bioinformatics. Here, authors report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications.

Fragmenstein: predicting protein–ligand structures of compounds derived from known crystallographic fragment hits using a strict conserved-binding–based methodology

January 13, 2025
Journal of Cheminformatics

Current strategies centered on either merging or linking initial hits from fragment-based drug design (FBDD) crystallographic screens generally do not fully leverage 3D structural information. Here, the authors show that an algorithmic approach (Fragmenstein) that ‘stitches’ the ligand atoms from this structural information together can provide more accurate and reliable predictions for protein–ligand complex conformation than general methods such as pharmacophore-constrained docking. This method is likely to further benefit future drug design campaigns and be integrated in future pipelines.

nipalsMCIA: flexible multi-block dimensionality reduction in R via nonlinear iterative partial least squares

January 12, 2025
Bioinformatics

With the increased reliance on multi-omics data for bulk and single-cell analyses, the availability of robust approaches to perform unsupervised learning for clustering, visualization, and feature selection is imperative. Authors introduce nipalsMCIA, an implementation of multiple co-inertia analysis (MCIA) for joint dimensionality reduction that solves the objective function using an extension to Nonlinear Iterative Partial Least Squares.

Prediction of cccDNA dynamics in hepatitis B patients by a combination of serum surrogate markers

January 9, 2025
PLoS Computational Biology

Authors employed a recently developed multiscale mathematical model describing intra- and intercellular viral propagation and applied it in HBV-infected patients under treatment. They developed a model that can predict intracellular HBV dynamics by use of extracellular viral markers, including HBsAg, HBV DNA, and HBcrAg in peripheral blood. Importantly, the model prediction of the amount of cccDNA in patients over time was confirmed to be well correlated with the data for quantified cccDNA by paired liver biopsy.

Learning predictive signatures of HLA type from T-cell repertoires

January 6, 2025
PLoS Computational Biology

The deep sequencing of immune repertoires from blood samples promises to offer diagnostic and precision medicine tools, and to help with the analysis of treatments and vaccinations and the design of immunotherapies. This study shows how, by training models on large datasets of annotated and unannotated repertoire sequencing, the human leukocyte antigen (HLA) type of patients can be computationally determined from their T-cell repertoire. This tool could be useful for typing repertoires for which the HLA is unknown, and to gain insight into the HLA restriction of T-cell receptor epitope specificity. It also provides lists of T-cell receptors that are associated to each HLA allele.

Characterizing cell-type spatial relationships across length scales in spatially resolved omics data

January 3, 2025
Nature Communications

Authors introduce CRAWDAD, an R package for quantifying cell-type spatial relationships across length scales in tissues using spatial omics data, enabling the identification of consistent as well as sample-specific celltype spatial relationships across multiple spatial omics datasets.

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

January 2, 2025
Nature Communications

The authors propose a Genetic Progression Score (GPS) model to predict autoimmune disease progression from preclinical stages, integrating genome-wide association study summary statistics and electronic health records from the All of Us and BioVU biobanks.

The Impact of Ancestry on Genome-Wide Association Studies

January 1, 2025
Pacific Symposium on Biocomputing 

Decisions regarding how the genetic data are combined in genome-wide association studies (GWAS) can mask important findings that might serve individuals of ancestries that have been historically underrepresented in genetic studies. New methods that consider ancestry-specific variants in conjunction with the shared variants need to be developed.

A Prospective Comparison of Large Language Models for Early Prediction of Sepsis

January 1, 2025
Pacific Symposium on Biocomputing 

Authors present a comparative study on the performance of two popular open-source large language models for early prediction of sepsis: Llama-3 8B and Mixtral 8x7B. The primary goal was to determine whether a smaller model could achieve comparable predictive accuracy to a significantly larger model in the context of sepsis prediction using clinical data.

A multivalent binding model infers antibody Fc species from systems serology

December 23, 2024
PLoS Computational Biology

Authors developed a computational model that interprets data from systems serology, allowing them to infer detailed, engagement-relevant structural information about antibodies that are difficult to measure directly. The model was applied  to existing data from COVID-19 and HIV studies, revealing new insights into how antibody structure relates to vaccine efficacy and disease progression. The model also helped to identify which measurements in systems serology are most informative, potentially streamlining future studies.