|
|
Refining microbiome diversity analysis by concatenating and integrating dual 16S rRNA amplicon reads
npj Biofilms and Microbiomes volume 11, Article number: 57 (2025) Cite this article
2405 Accesses
9 Altmetric
Abstract
Understanding the role of human gut microbiota in health and disease requires insights into its taxonomic composition and functional capabilities. This study evaluates whether concatenating paired-end reads enhances data output for gut microbiome analysis compared to the merging approach across various regions of the 16S rRNA gene. We assessed this approach in both mock communities and Korean cohorts with or without ulcerative colitis. Our results indicate that using the direct joining method for the V1-V3 or V6-V8 regions improves taxonomic resolution compared to merging paired-end reads (ME) in post-sequencing data. While predicting microbial function based on 16S rRNA sequencing has inherent limitations, integrating sequencing reads from both the V1-V3 and V6-V8 regions enhanced functional predictions. This was confirmed by whole metagenome sequencing (WMS) of Korean cohorts, where our approach improved taxa detection that was lost using the ME method. Thus, we propose that the integrated dual 16S rRNA sequencing technique serves as a valuable tool for microbiome research by bridging the gap between amplicon sequencing and WMS.
초록
인간 장내 미생물군의 건강과 질병에 대한 역할을 이해하려면
그 분류학적 구성과 기능적 능력을 파악하는 것이 필수적입니다.
본 연구는
16S rRNA 유전자 내 다양한 영역에서
쌍말 읽기 데이터를 연결하는 방법이 쌍말 읽기 데이터를 병합하는 방법에 비해
장내 미생물군 분석의 데이터 출력을 향상시키는지 평가했습니다.
이 접근법을 모의 공동체와 궤양성 대장염 유무에 따라 구분된
한국인 코호트에서 모두 평가했습니다.
결과는 V1-V3 또는 V6-V8 지역에서 직접 결합 방법을 사용한 것이
시퀀싱 후 데이터에서 쌍말 읽기 병합(ME)에 비해 분류학적 해상도를 개선함을 보여주었습니다.
16S rRNA 시퀀싱을 기반으로 미생물 기능을 예측하는 것은
본질적인 한계를 가지고 있지만,
V1-V3 및 V6-V8 지역에서 시퀀싱 읽기를 통합하면
기능 예측이 향상되었습니다.
이는 한국 코호트에 대한 전체 메타게놈 시퀀싱(WMS)을 통해 확인되었으며,
우리 접근법은 ME 방법에서 손실된 세균군 탐지율을 개선했습니다.
따라서
우리는 통합 이중 16S rRNA 시퀀싱 기술이
증폭자 시퀀싱과 WMS 사이의 격차를 메우는 유용한 도구로
미생물군 연구에 기여할 수 있음을 제안합니다.
Similar content being viewed by others

Article Open access18 January 2021

Article Open access13 October 2022

Article Open access22 February 2022
Introduction
Recent explorations into the human gut microbiome have captured widespread interest due to its complex composition, functional capabilities, and significant influence on human health and disease states1,2. The surge in research activity is largely attributed to advancements in next-generation sequencing (NGS) technologies, which have transformed our ability to discern gut microbiota variances associated with a broad range of diseases such as cancer3, obesity4, diabetes5, inflammatory bowel diseases (IBD)6,7, neurological disorders8, and antibiotics resistance9,10. These technological advances have enabled large-scale population studies, providing deeper insights into the epidemiology of infectious diseases11 and facilitating the analysis of extensive microbiome datasets12,13.
Predominantly, 16S rRNA amplicon sequencing and whole metagenome sequencing (WMS) are pivotal in unraveling gut microorganism diversity and exploring the epidemiological factors that influence microbiome configurations14,15. These methods have greatly advanced our understanding of the dynamics that shape the human gut microbiome, encompassing microbial taxa, epidemiological impacts, evolutionary patterns, and demographic variables such as ethnicity, environmental conditions, dietary habits, and age16,17,18. However, gut microbiome studies often face challenges due to inherent experimental biases. Such biases in taxonomic identification may stem from the choice of taxonomic marker genes (e.g., 16S rRNA for bacteria, 18S rRNA for eukaryotes, and ITS regions for fungi) and their target regions19,20, diversity in sequencing platforms21,22, inconsistencies in data quality23, and variations in reference databases24. For example, the selection of the 16S rRNA regions critically affects the resolution and the precision in bacterial detection and classification25, leading to discrepancies in estimating the presence of certain bacterial groups26,27. Notably, V4-V5 region should be avoided in the infant feces28, whereas the V1-V3 region is recommended for soil and saliva samples29. Utilizing the full read length (V1-V9 region) is also recommended to reduce sequencing error rates30.
Both 16S rRNA sequencing and WMS have their unique benefits and face distinct challenges. WMS provides in-depth insights into microbial communities and functional data but requires substantial computational resources and ongoing reference database updates31,32,33. It also deals with challenges, such as host DNA depletion and variability in 16S rRNA primer coverage34,35,36. In contrast, 16S rRNA sequencing is a cost-effective and efficient alternative for specific applications, particularly when using methodologies that minimize inherent biases37. Our study compares analytical methodologies within 16S rRNA sequencing, focusing on merging paired-end reads (ME) and direct joining (DJ). These methods aim to broaden the range of captured microbial data and reduce biases associated with merging methods. ME merges reads based on overlapping sequences, potentially loosing valuable genetic information when overlaps are minimal. DJ, however, concatenate forward and reverse reads directly, retaining all genetic information and enhancing the dataset completeness—essential for accurately depicting microbial communities38,39.
We compare the quality of sequencing data between concatenated and merged reads, focusing on sequencing errors and the impact of different 16S rRNA regions on identifying rare microbial taxa in diverse cohorts, including healthy individuals and patients with ulcerative colitis (UC). Using correction formulas derived from mock community datasets24, we have refined taxonomic classifications precision, aiding in the identification of unique metabolic pathways associated with health and UC. Through comparative functional profiling with multiple analytical pipelines based on 16S rRNA sequencing and WMS, we seek potential diagnostic markers and therapeutic targets. This comprehensive approach elucidates the role of the gut microbiome in health and disease, utilizing dual 16S rRNA amplicon sequencing to improve clarity and specificity, advancing our understanding of microbial ecosystems and promoting targeted interventions that could profoundly affect patient care and therapeutic outcomes.
소개
최근 인간 장내 미생물군집에 대한 연구는
복잡한 구성, 기능적 능력, 그리고 인간 건강과 질병 상태에 미치는 중요한 영향으로 인해
연구 활동의 급증은 주로 차세대 시퀀싱(NGS) 기술의 발전에 기인하며,
이는 암3, 비만4, 당뇨병5, 염증성 장 질환(IBD)6,7, 신경계 장애8, 항생제 내성9,10 등
다양한 질환과 관련된 장 미생물군집의 차이를 구분하는 능력을
혁신적으로 향상시켰습니다.
이러한 기술적 진보는
대규모 인구 연구를 가능하게 하여 전염성 질환의 역학에 대한 깊은 통찰을 제공했으며11,
광범위한 미생물군집 데이터셋 분석을 용이하게 했습니다12,13.
주로 16S rRNA 증폭자 시퀀싱과 전체 메타게놈 시퀀싱(WMS)은
장 미생물 다양성을 규명하고 미생물군집 구성에 영향을 미치는
역학적 요인을 탐구하는 데 핵심적 역할을 합니다14,15.
이러한 방법은
미생물 군집의 분류군, 역학적 영향, 진화적 패턴, 인종, 환경 조건, 식습관, 연령 등
인구학적 변수16,17,18를 포함해 인간 장내 미생물 군집을 형성하는 역학에 대한 이해를
크게 진전시켰습니다.
그러나
장내 미생물 군집 연구는 내재된 실험적 편향으로 인해 종종 어려움을 겪습니다.
분류학적 식별 편향은
분류학적 마커 유전자 선택(예: 세균의 16S rRNA, 진핵생물의 18S rRNA, 곰팡이의 ITS 지역) 및 그 표적 지역19,20,
시퀀싱 플랫폼의 다양성21,22, 데이터 품질의 일관성 부족23, 참조 데이터베이스의 변동성24 등에서
기인할 수 있습니다.
예를 들어,
16S rRNA 지역의 선택은
세균 검출 및 분류의 해상도와 정확성에 결정적인 영향을 미칩니다25,
이는 특정 세균 그룹의 존재를 추정하는 데
특히, 영아 분변에서는 V4-V5 지역을 피해야 하며28, 토
양 및 타액 샘플에는 V1-V3 지역이 권장됩니다29.
전체 읽기 길이(V1-V9 지역)를 사용하는 것도
시퀀싱 오류율을 줄이기 위해 권장됩니다30.
16S rRNA 시퀀싱과 WMS는 각각 고유한 장점과 독특한 과제를 가지고 있습니다. WMS는 미생물 군집과 기능적 데이터에 대한 심층적인 통찰을 제공하지만, 상당한 계산 자원과 지속적인 참조 데이터베이스 업데이트가 필요합니다31,32,33. 또한 호스트 DNA 제거 및 16S rRNA 프라이머 커버리지 변동성34,35,36과 같은 도전 과제를 겪습니다. 반면 16S rRNA 시퀀싱은 특정 응용 분야에 비용 효율적이고 효율적인 대안으로, 특히 내재된 편향을 최소화하는 방법론을 사용할 때 유용합니다37. 본 연구는 16S rRNA 시퀀싱 내 분석 방법론을 비교하며, 특히 페어드-엔드 리드 병합(ME)과 직접 결합(DJ)에 초점을 맞췄습니다. 이 방법들은 포착되는 미생물 데이터의 범위를 넓히고 병합 방법과 관련된 편향을 줄이는 것을 목표로 합니다. ME는 중첩된 시퀀스를 기반으로 리드를 병합하지만, 중첩이 최소일 경우 귀중한 유전적 정보를 잃을 수 있습니다. 반면 DJ는 전방 및 후방 리드를 직접 연결하여 모든 유전적 정보를 유지하고 데이터셋의 완전성을 향상시킵니다—이는 미생물 군집을 정확히 묘사하는 데 필수적입니다38,39.
우리는 연결된 리드와 병합된 리드의 시퀀싱 데이터 품질을 비교하며, 시퀀싱 오류와 다양한 16S rRNA 지역이 건강한 개인과 궤양성 대장염(UC) 환자를 포함한 다양한 코호트에서 희귀 미생물 분류군 식별에 미치는 영향을 중점적으로 분석합니다. 모의 커뮤니티 데이터셋에서 도출된 교정 공식을 활용해 분류학적 분류 정확도를 개선했으며, 이는 건강과 UC와 관련된 고유한 대사 경로 식별에 기여합니다.
16S rRNA 시퀀싱과 WMS를 기반으로 한 다중 분석 파이프라인을 활용한 비교 기능 프로파일링을 통해 잠재적 진단 마커와 치료 표적을 탐색합니다. 이 포괄적인 접근 방식은 장 미생물군의 건강과 질병에서의 역할을 밝히며, 이중 16S rRNA 증폭자 시퀀싱을 통해 명확성과 특이성을 향상시켜 미생물 생태계에 대한 이해를 심화시키고 환자 치료와 치료 결과에 깊은 영향을 미칠 수 있는 맞춤형 개입을 촉진합니다.
Results
Comparative validation of concatenation and merging methods for gut microbiome analysis
Our research assessed the effectiveness of concatenating versus merging pair-end reads across various 16S rRNA regions, using the ZIEL-II mock community datasets (SRP291583), which includes 19 bacteria across 18 genera. We applied ME and DJ alongside inside-out (IO) concatenation techniques (Fig. 1). Observations revealed a decline in sequence quality towards the 3’-end across all regions (Supplementary Fig. 1), with concatenation generally achieving better alignment of non-chimeric reads with the SILVA database across all tested 16S rRNA regions (Fig. 2a and Supplementary Data 1).
Fig. 1: Analytical strategy for region-specific 16S rRNA amplicon sequencing and whole metagenome sequencing (WMS).

This workflow outlines the analysis of amplified 16S rRNA sequences (V1-V3, V6-V8, and V1-V9). The steps include: 1) Adapter trimming with fastp; 2) Merging paired-end sequences for regions excluding V1-V9 using DADA2 in QIIME2, while V1-V9 single-end sequences undergo separate analysis; 3) Concatenating paired-end sequences through JTax using both direct joining (DJ) and inside-out (IO) techniques, with lengths trimmed based on median quality score 20 (Detailed trim positions in Supplementary Fig. 1); 4) Subjecting all region-derived sequences—merged, concatenated, or intact (V1-V9)—to quality filtering, denoising, and chimera removal via DADA2; 5) Classifying amplicon sequence variants (ASVs) produced across three analytical pipelines against 16S rRNA DBs (GG2, SILVA, and RDP); 6) Conduct functional profiling via PICRUSt2. For WMS, data processing ranged from shallow (4 GB) to deep (36 GB) sequencing reads, using Trimmoatic and TRF, aligned against hg38 with default settings, and analyzed for taxonomic and functional profiling with HUMAnN 3.0. This figure was created using Biorender.com.
Fig. 2: Validation of taxonomic resolution and richness of 16S rRNA sequencing reads using mock community datasets.

a The comparison of the mapped reads ratio from the SRP291583 dataset. ME: the method merging raw paired-end reads, DJ: concatenating raw paired-end reads. b Alpha diversity metrics across methods, including observed species and Shannon index. c NMDS plot based on Jaccard distance across the analytical methods. d Family-level relative abundance, with “Others” indicating taxa below 1% relative abundance. The relative abundance for the theoretical composition of ZIEL-II mock community24 is indicated by an asterisk (*). e The distribution of difference values between theoretical and actual values for each family. f Comparison of relative abundance between theoretical values (T.V.) and specific 16S rRNA regions from ZIEL-2 mock datasets. g Evaluation of precision, recall, and F-measure for the ME and DJ pipelines using the V1-V3 and V6-V8 regions, based on the SILVA DB.
Concatenation using the DJ method notably enhanced microbial diversity and evenness, evidenced by higher Richness and Shannon effective numbers compared to the ME method, particularly in the V1-V3, V3-V4 and V7-V9 regions (Fig. 2b). Non-metric multidimensional scaling (NMDS) suggested that adjacent regions within the 16S rRNA gene exhibit similar microbial communities, with significant differences in the V34 and V79 regions (Fig. 2c), highlighting each region’s distinct response to the concatenation and merging techniques. The ME method particularly overestimated Enterobacteriaceae abundance in the V3-V4 (1.95-fold) and V4-V5 (1.92-fold) regions—discrepancies largely corrected by the DJ method, though not entirely in the V4-V5 region (Fig. 2d).
To substantiate the performance benefits of concatenation over the merging approach, we conducted a detailed correlation analysis by comparing the theoretical and actual measured relative abundances across different region-specific methods, excluding the less reliable V1-V2 and V4-V5 regions (Supplementary Fig. 2). The DJ method improved the detection accuracy of several microbial families not identified by the ME method in the V3-V4 region, though it continued to face challenges with overestimations in the V4-V5 region (Fig. 2e). Detailed comparisons to theoretical values (TVs) revealed that specific DJ methods, especially V13-DJ (1.021) and V68-DJ (1.023), achieved median values closest to the ideal of 1.0 (Fig. 2e). However, there were notable discrepancies in other regions, indicating some inherent limitations in family-specific detection accuracy. For instance, V13-DJ significantly underdetected Bifidobacteriaceae (0.17), and V68-DJ overestimated Atopobiaceae (2.13) (Fig. 2e). The results showed that both V68-DJ and V13-DJ methods provided a more accurate and consistent representation of microbial abundances, enhancing the quality of taxonomic and functional insights derived from gut microbiome analyses (Supplementary Fig. 2).
Additional issues were observed in the V6-V8 and V7-V9 regions, where unclassified Enterobacterales were detected. Moreover, the V34-ME analysis demonstrated poor performance with significant outliers such as Microbacteriaceae and Pseudomonadaceae. V34-DJ (0.88) improved but still presented outliers, such as Prevotellaceae (Fig. 2e). In addition, V4-DJ still presented one outlier, such as Microbacteriaceae and V45-DJ still did not detect several families. Given these findings, we excluded V1-V2, V4, V4-V5, and V7-V9 from further analysis due to their lower correlation values (<0.66) and the presence of outliers or undetected families, which could skew the gut microbiome analysis.
In conclusion, our results indicate the importance of selecting suitable 16S rRNA regions for analysis, advocating for the use of V1-V3 and V6-V8 regions when employing concatenating methods with the SILVA database (DB) to increase accuracy and reduce biases in analyzing the gut microbial community (Fig. 2f, g). This approach highlighted that the V1-V3 region consistently achieved higher recall values than the V6-V8 region. The ME method exhibited the lowest F-measure values, with significant discrepancies observed in the detection of families like Listeriaceae, Bifidobacteriacae, and Eggerthellaceae (Fig. 2g). Remarkably, Coprobacillaceae detection was excessively high in the V13-ME method (23.4%), compared to the ideal (9.60%) and V13-DJ (17.4%) approaches (Fig. 2d). The V13-DJ method notably increased precision by 8% and the F-measure value by 5% relative to the V13-ME method. Despite challenges in estimating relative abundance, the V6-V8 region demonstrated superior precision in amplifying gut microbial 16S rRNA genes, underscoring the crucial role of method selection in microbiome analysis (Fig. 2g).
Optimizing accuracy in gut microbiome analysis: the role of concatenated method and selection of 16S rRNA databases
The accuracy of estimating microbial relative abundance critically depends on the choice of 16S rRNA gene regions and read processing methodologies. We conducted an in-depth analysis using mock community data, focusing on the V1-V3 and V6-V8 regions, and calibrated coefficient values for specific family groups, selecting the most appropriate 16S rRNA DBs: Greengenes2 (GG2), SILVA, and the Ribosomal Database Project (RDP). Sequences were trimmed and processed for database matching using either the merging or concatenating method using V1-V3 and V6-V8 regions (Supplementary Fig. 3).
Our findings showed that the V13-ME method consistently overestimates relative abundance, particularly inflating families like Enterobacteriaceae_A and Pseudomonadaceae up to 93% relative to their expected values (24.7%) in the Zymo mock dataset. In contrast, concatenating methods—DJ and IO—yielded more accurate estimations at 22.0% and 24.1%, respectively (Supplementary Fig. 4a). Further assessments across different databases and primer sets revealed that the ME method consistently displayed the lowest correlation coefficients (R-values), particularly in the ZIEL-I mock dataset with the lowest R-values linked to the GG2 database (Supplementary Figs. 5–7). The ME method exhibited biases in the V1-V3 16S rRNA region, notably underrepresenting families such as Bacillaceae (1.1%), Enterococcaceae (0.3%), Lachnospiraceae (0.0%), and Staphylococcaceae (0.8%) (Supplementary Figs. 4a and 6a). Conversely, concatenation approaches, particularly using SILVA and RDP databases, markedly improved accuracy over the ME method. While the ME method achieved the highest R-value with the V6-V8 region and SILVA database in the ZIEL-II dataset, it faced challenges, particularly with the GG2 database (Supplementary Fig. 7). Notably, updates to family names like Eggerthellaceae, Erysipelatoclostridiaceae, and Verrucomicrobiaceae were observed, and the ME method’s failure to detect Listeria welshimeri within the V1-V3 region contrasted with the successful identification of Listeriaceae by the concatenating methods (Supplementary Fig. 4).
Comparative efficacy of concatenation versus merging in gut microbiome analysis
In our analysis, we evaluated the effectiveness of ME and concatenating methods using two significant datasets: SRP131748, which includes 60 samples from fecal and oronasal secretion, and SRP115494, containing 69 rectal samples, serving as supplementary data (Supplementary Data 2)7,40. We investigated the performance of concatenating methods (DJ and IO) over ME in the primary dataset (SRP131748), targeting the V1-V3 region. These methods achieved significantly higher non-chimeric read alignment with 16S rRNA databases, marking 58.64% for DJ and 59.12% for IO, compared to only 47.42% for ME, demonstrating their enhanced efficiency in SRP131748 (P < 0.05) (Supplementary Data 3). In addition, in the SRP115494 dataset targeting the V4 region, the concatenation methods marked 66.05% for DJ and 68.17% for IO, compared to only 42.12% for ME, underlining their improved detection capabilities (P < 0.05).
Further investigations within SRP131748 demonstrated that DJ and IO methods not only provided higher alpha diversity but also depicted more distinct microbial profiles than ME (Supplementary Figs. 8a, 9a–d, and Supplementary Data 3). A notable finding was the detection of an Enterococcus strain in the oronasal secretion-prediabetes group (disease) by the concatenating methods, which was absent in the control group (healthy) analyzed by ME (Supplementary Fig. 9e, f). Additionally, Phascolarctobacterium, previously undetectable by ME, was significantly identified with the DJ and IO methods, highlighting their increased detection capabilities (Supplementary Fig. 10).
The supplementary dataset, SRP115494, targeting the V4 region, also indicated a decline in sequence quality towards the 3’-end (Supplementary Fig. 8b). Like findings in SRP131748, we reaffirmed the consistency of microbiome diversity between concatenating methods (DJ and IO) and ME across different sample types and conditions (Supplementary Fig. 11a and Supplementary Data 3). Microbial alpha diversity analysis revealed distinct differences across patient groups. IBD patients, including those with Crohn’s disease (CD) and UC, showed lower microbial diversity than non-IBD controls, with ME showing the least diversity (Supplementary Figs. 11b). At the family level, the DJ method enriched taxa such as Veillonellaceae, Erysipelotrichales, Pseudomonadales, and Staphylococcaceae, contrasting with ME which predominantly identified Lachnospiraeceae_NK4A136_g (Supplementary Figs. 11c–e). Moreover, taxa like Oscillospirales, [Eubacterium] eligens_g, and Rombutsia were detected only by the concatenating methods (Supplementary Fig. 12a, b).
In conclusion, the concatenation approach has demonstrated its potential to enhance microbial community diversity analysis, enabling a more comprehensive identification of specific microbial families and genera across both IBD and non-IBD cohorts (Supplementary Fig. 10). This method effectively might improve the resolution and accuracy of gut microbiota analysis, bridging the gap between traditional sequencing approaches and the nuanced demands of modern microbiome research.
Impact of 16S rRNA region selection on taxonomic assignment in gut microbiota
To evaluate the effect of different 16S rRNA regions on taxonomic assignments, we analyzed fecal samples from a Korean cohort, comprising healthy individuals (n = 8) and patients with UC (n = 8) (Supplementary Data 4). We employed primer pairs targeting V1-V3, V6-V8, and V1-V9 regions, analyzing sequence quality and read integrity across these regions (Supplementary Fig. 13a). Quality analysis indicated sequence deterioration towards the 3’-end in both forward and reverse reads, which could introduce biases, particularly in the reverse (Supplementary Fig. 13b–d).
The concatenation methods (DJ and IO) showed increased non-chimeric reads compared to the ME method, particularly in the V1-V3 and V6-V8 regions, demonstrating their effectiveness in managing sequence quality impacts (Supplementary Fig. 14a, b). Differences in non-chimeric reads between DJ and IO were not significant, nor were differences in richness and taxonomic resolution from previous datasets analyses (Supplementary Fig. 9). Alpha diversity analyses further supported the increased sensitivity of the DJ method in the V1-V3 region over ME (Supplementary Fig. 14c). Significant variations were observed in the detection of Bacteroidota and Actinobacteriota between in healthy and UC samples across the studied regions. The V1-V3 region showed less variability in detecting Actinobacteriota, whereas the V6-V8 region was more consistent for Bacteroidota (Fig. 3a–c). At the family level, the V1-V3 and V1-V9 regions more consistently identified Bacteroidaceae, whereas the V6-V8 region was more sensitive to Bifidobacteriaceae, irrespective of the method used (Fig. 3b, c).
Fig. 3: Comparison of richness and taxa classification between analytical methods.

a Phylum-level relative abundance comparison. b Family-level relative abundance, with “Others” denoting taxa below 1% relative abundance. c Comparative relative abundances of Bacteroidota, Bacteroidaceae, Actinobacteriota, and Bifidobacteriaceae across ME and DJ methods. Asterisk (*) means a P value < 0.05.
Genus-level heat tree visualizations between healthy and UC groups highlighted the influence of 16S rRNA region selection on taxonomic assignments, showing the DJ method’s ability to reduce bias for families like Akkermansiaceae and Lactobacillaceae in the V1-V3 region (Fig. 4a, b). However, biases persisted for families such as Clostridiaceae and Bifidobacteriaceae based on the V6-V8 region. The V1-V3 region analysis also enriched Enterobacteriaceae and Bacteroidales, including families like Bacteroidaceae, Rikenellaceae, and Marinifilaceae, showing a clear contrast to the V6-V8 region.
Fig. 4: Heat trees and heatmaps of gut microbial communities in health and UC.

a Phylogenetic distribution of bacterial taxa within the gut microbiome. The tree illustrates the hierarchical relationships among various bacterial orders, families, and genera. The hierarchical structure includes labels for all taxa demonstrating significant differences in any pairwise comparison between 16S rRNA regions. b Pairwise comparisons between specific variable 16S rRNA regions in health and UC. Red or blue colors of nodes and leaves indicate a log-2-fold increase in median abundance (Benjamini-Hochberg adjusted Wilcoxon rank sum q < 0.05). c, d Heatmap showing the median relative abundance (%) of detected families across methods (i.e., 16S rRNA sequencing and WMS) for healthy individuals (c) and UC patients (d). Families not detected by each method are indicated with an “x” on a white background.
Comparative taxonomic resolution with the WMS method revealed significant discrepancies in the relative abundance of Bacteroidota, Actinobacteriota, and Bifidobacteriaceae between V1-V3, V1-V9, and WMS methods for both healthy individuals and UC patients (Fig. 3c). Eleven families detected by WMS were missed by the V13-ME method in the healthy group, whereas V13-DJ identified nine families (e.g., Barnesiellaceae, Lactobacillaceae, Streptococcaceae, and Sutterellaceae) also seen in WMS (Fig. 4c). However, Barnesiellaceae, Lactobacillaceae and Sutterellaceae, detected in WMS, were absent in both V68-ME and V68-DJ analyses. In the UC group, families like Eggerthellaceae, Lactobacillaceae, Streptococcaceae, Oscillospiraceae, and Selenomonadaceae, missed by V13-ME, were identified by V13-DJ. Notably, V68-DJ uniquely detected Oscillospiraceae, absent in WMS. Analysis using V19 indicated overestimated abundances of Erysipelotrichaceae, Veillonellaceae, and Selenomonadaceae compared to other methods (Fig. 4d). While concatenation-based methods demonstrated superior microbial detection relative to merging methods when validated with WMS, they still exhibited biased taxonomic resolution.
In conclusion, our findings underscore the critical importance of selecting appropriate 16S rRNA regions and analytical methods to better represent gut microbial diversity and minimize taxonomic biases. Despite the advantages of the concatenating method, relying solely on one 16S rRNA region may still result in biased outcomes, emphasizing the need for comprehensive methodological approaches in microbiome research.
Advancing gut microbiota profiling accuracy with correction coefficient-based adjustments for dual 16S rRNA reads
To improve the precision of gut microbiota analysis, we developed a methodology based on correction coefficients that integrates reads from both the V1-V3 and V6-V8 16S rRNA regions. We introduced the adjusted 16S rRNA (Adj-16S) sequencing method, applying correction coefficients based on analyses of dual 16S rRNA regions to more accurately adjust relative abundance values (Fig. 5a). For instance, the correction coefficient for Enterobacteriaceae using V13-DJ (ωi13.En.) was calculated as 1.26 by comparing the theoretical value (29.5) with the observed value (23.5), and similar for V68-DJ (ωi68.En) (Fig. 5b). These adjustments provided a more accurate representation of taxonomic profiles. Utilizing mock datasets (Zymo, ZIEL-I, ZIEL-II) and the SILVA DB, we calculated correction coefficients for each 16S rRNA region, using weighted averages41 across 22 families. Weighted coefficient values for the V1-V3 (ωi13.f) and V6-V8 (ωi68.f) regions were determined by dividing the relative abundance from the respective regions, V1-V3 (𝓍𝓍i13.f) and V6-V8 (𝓍𝓍i68.f), by the total abundance from both regions. The means of these weighted coefficient values (ϖi13.fandϖi68.f) were computed using data from eight independent datasets across three mock datasets. The adjusted relative abundances for both the V1-V3 (𝓍𝓍i13.f′) and V6-V8 (𝓍𝓍i68.f′) regions were then calculated, leading to a formula representing the total adjusted relative abundance 𝓍𝓍≃∑i=1n(𝓍i13.f′+𝓍i68.f′). These adjustments showed that the V6-V8 region more accurately reflected ideal community compositions, particularly for families like Actinomycetaceae, Bifidobacteriaceae, and Tannerellaceae. Conversely, the V1-V3 region was more responsive to Coprobacillaceae, Microbacteriaceae, and Pseudomonadaceae (Fig. 5c).
Fig. 5: Adjusted relative abundance in Korean gut microbial communities.

a Schematic diagram illustrates a systematic approach to adjusting microbiome composition by applying correction coefficients, leading to a desired microbial balance. b Bar graphs representing detailed taxonomic resolution derived from V1-V3 and V6-V8 regions compared to theoretical values using mock community dataset. c Weighted coefficient values derived from the V1-V3 and V6-V8 regions using eight independent datasets for 22 families. d A raincloud plot showing the adjusted total relative abundance by applying weighted coefficient values for the Korean cohort. e Comparison of adjusted 16S rRNA (Adj-16S) and WMS profiles for healthy and UC cohorts. [Eubacterium]: [Eubacterium]_coprostanoligenes_group. f Heatmap of family-level gut microbiota differences between healthy and UC groups by analytical method. The color scale is gray (0.1 ≤ P value), red (P value < 0.1), and white (Not detected). Statistical analysis was performed using Welch’s t test.
Based on these findings, we recommend adopting the Adj-16S for more precise profiling of 16 Korean gut microbial communities (Fig. 5d, e and Supplementary Data 5). This approach showed that Bacteroidaceae was more prevalent in the V1-V3 region (16.44%) compared to the V6-V8 region (3.50%). In the UC cohort, Bacteroidaceae levels were 8.91% in V1-V3 and only 2.27% in V6-V8. However, the Adj-16S revealed relative abundances of Bacteroidaceae at 9.06% for healthy individuals and 5.22% for UC patients, values that closely align with those obtained from WMS, which were 7.74% for healthy individuals and 5.08% for UC patients. Similarly, Bifidobacteriaceae was more abundant in the V6-V8 region (20.97%) than in the V1-V3 region (0.29%) among healthy individuals. Applying the calculated coefficients to balance discrepancies between the two regions resulted in a uniform representation of abundances, with improved concordance evidenced by correlation metrics at the family and genus levels compared to WMS data (Supplementary Fig. 15). Furthermore, the Adj-16S method detected 21 families, including Butyricicoccaceae, Clostridia_UCG-014, Coprobacillaceae, and Monoglobaceae, which were not identified in WMS analyses (Supplementary Data 5).
The Adj-16S method significantly delineated the microbial differences between healthy individuals and UC patients, revealing disparities in the detection of families like Odoribacteraceae and Bacillota_unclassified that were pronounced in WMS but not in 16S rRNA analyses (Fig. 5f). Families such as [Eubacterium] and Rikenellaceae distinctly categorized healthy from UC groups. Oscillospiraceae and Akkermansiaceae, predominantly found in the healthy cohort, illustrate the nuanced capability of concatenated 16S rRNA methods alongside WMS in detecting critical microbial differences. Marinifilaceae, Ruminococcaceae, and Anaerovoracaceae were only significantly detected in the merging methods. These findings underscore the necessity of methodological precision in 16S rRNA-based profiling, affirming the concatenated approach for its improved accuracy and consistency in representing microbial abundances, closely aligned with the comprehensive insights provided by WMS analyses.
Comparative functional profiling in gut microbiota: insights from Adj-16S and WMS analysis
To further explore the functional capabilities of the gut microbiota, we conducted a comparative analysis using the Adj-16S method alongside traditional 16S rRNA amplicon sequencing techniques and WMS (Fig. 1). For the 16S rRNA data, predictive functional profiling was performed using PICRUSt2, while the WMS data were analyzed using HUMAnN 3.0. Our study identified several key functional pathways that were significantly different (P value < 0.05) between healthy individuals and UC patients (Fig. 6a). The V13-ME method identified 28 pathways in healthy subjects and 35 in UC patients, numbers which were at least twice those identified by other analytical methods. Conversely, the Adj-16S method pinpointed fewer pathways—14 in healthy subjects and 11 in UC patients. The WMS approach uniquely detected 12 pathways not identified by any 16S-based methods, with some pathways found to be common across 16S rRNA methods and WMS (Supplementary Fig. 16a, b). A direct comparison revealed that five pathways were common between Adj-16S and WMS analyses, indicating 25 unique pathways in WMS and 20 unique to Adj-16S across both subject groups.
Fig. 6: Functional pathway analysis in healthy and UC groups by WMS and 16S rRNA-derived PICRUSt2.

a Number of functional pathways associated with each group. b Comparison of 50 genes between healthy (n = 8) and UC (n = 8) groups using qRT-PCR. The analysis aimed to calculate log22−ΔΔCt values (UC vs. Healthy) for candidate genes relative to 16S Ct values. The significance of differences between healthy and UC samples was assessed using a t test (*, P value < 0.1). c Histogram of precision, recall, accuracy, misclassification rate, and F1 score for the datasets. Values in parentheses indicate the sum of true positive (TP) plus true negative (TN) results. For score calculation, metabolic pathways with a P value < 0.1 were used. d The false positive (FP) rate was determined by dividing FP by the sum of FP and TN. e Venn diagrams illustrating the distribution of TPs (left/healthy) and TNs (right/UC) identified by various analytical methods.
Further validation through quantitative real-time PCR (qRT-PCR) analysis of 50 genes representing these pathways confirmed 18 significantly divergent pathways between the groups, with 12 additional pathways differing in the reverse comparison (Fig. 6b and Supplementary Data 7). This validation underlined the predictive accuracy of our methods, with the concatenation-based approach using V13-DJ and V68-DJ demonstrating relatively higher accuracy and F1 scores compared to merged methods (Fig. 6c). The false positive rate (FPR) for Adj-16S was 0.36, showcasing its precision relative to 0.53 for V13-DJ (FRR: 0.53) and V68-DJ (FPR: 0.38). However, the V19 method, despite having the lowest FPR, demonstrated limitations in detecting a higher number of true positives (TPs) or true negatives (TNs) (Fig. 6c, d).
A Venn diagram analysis emphasized the efficacy of the Adj-16S method in detecting the highest number (12) of TPs and TNs pathways compared to other techniques (Fig. 6e). The Adj-16S method detected all pathways except for SULFATE-CYS-PWY, uniquely identified by the V13-DJ method. While the V13-DJ method missed three pathways (PYRIDOXSYN-PWY, P23-PWY, and PWY-3781), the V68-DJ method failed to detect five (PWY-5855, PWY-7456, PWY-6467, PWY-6590, and SULFATE-CYS-PWY). Notably, the Adj-16S method closely aligns with other methods in capturing nearly all TN pathways for UC.
Collectively, our findings emphasize the importance of selecting appropriate 16S rRNA regions and employing concatenating methods to enhance accuracy and reduce biases in the functional profiling of gut microbiota. This study not only clarifies the differences between functional profiles derived from Adj-16S compared to the ME and DJ methods but also highlights the potential for discovering unique biomarkers or therapeutic targets within these methodologies (Fig. 6e).
Discussion
In this study, we enhanced gut microbiota profiling through a concatenation approach using pivotal 16S rRNA gene regions, V1-V3 and V6-V8. This method diverges from conventional practices that rely primarily on single-region amplicons and merged reads. Our goal was to refine taxonomic assignment and deepen functional characterization of the gut microbiome, crucial for deciphering its role in health and disease. Previous research has often failed to provide robust experimental validation linking specific microbes to health outcomes or distinguishing functional differences between diseased and healthy states6,42,43.
Our Adj-16S method aimed to minimize biases inherent in using single 16S rRNA regions. This approach significantly increased mapped read ratios and microbial identification resolution, thereby improving our understanding of taxonomic structures within the gut microbiota (Supplementary Figs. 4 and 9e–g). For instance, it clarified the presence of specific taxa such as Oscillospirales and Romboutsia in non-IBD individuals and taxa such as the Family_XII_AD3011_group in CD, with greater clarity compared to conventional methods. Notably, Romboutsia, known for its beneficial acetate and propionate production44, is diminished in CD compared to healthy individuals45,46. Similarly, the Oscillospiraceae family, associated with anti-inflammatory valeric acid, shows higher abundance in healthy individuals than in those with CD47. Notably, beneficial taxa associated with anti-inflammatory and metabolic benefits showed differential abundance in healthy versus CD individuals, highlighting potential therapeutic targets. Additionally, our analysis has refined the taxonomic resolution for taxa such as Roseburia and Phascolarctobacterium in the V1-V3 region (Supplementary Figs. 9 and 10). Furthermore, the concatenating method improved differentiation between gut microbiota of healthy individuals and those with UC, emphasizing the value of multiple regions for a comprehensive analysis (Fig. 3). By applying correction coefficients derived from mock community datasets, we aligned our relative abundance profiles more closely with WMS data, enhancing the accuracy of our analyses (Fig. 5 and Supplementary Fig. 15).
Our predictive functional profiling further delineated significant metabolic pathways associated with both healthy individuals and UC patients. Notably, pathways such as HEME-BIOSYTHESIS-II, PWY-5845 (menaquinol-9 biosynthesis), and PWY-5850 (menaquinol-6 biosynthesis) were prevalent in healthy individuals, which were not detected by other 16S rRNA-based methods and WMS (Supplementary Fig. 16b). These pathways are implicated in vitamin K deficiencies commonly observed in IBD patients48 and may serve as vital biomarkers for IBD diagnosis48,49. Interestingly, these pathways did not appear significant in WMS findings, underscoring the unique strengths of the Adj-16S method. Conversely, pathways such as NAGLIPASYN-PWY were identified as more prevalent in healthy individuals than in UC, which contradicts some reports associated with CD50. For UC, consistent detection of six metabolic pathways such as PWY-5189 (tetrapyrrole biosynthesis II), PWY-621 (sucrose degradation III), METH-ACETATE-PWY, and PWY-7315 (dTDP-N-acetylthomosamine biosynthesis) across all analytical methods, including Adj-16S and various concatenated 16S rRNA sequencing methods, corresponded with literature suggesting altered metabolic states in UC patients (Fig. 6e). These findings align with observations that healthy individuals have higher levels of tetrapyrrole and its derivatives compared to the UC group51, suggesting a compensatory biosynthesis in UC might instigate heightened biosynthesis (PWY-5189). The β-fructofuranosidase gene linked to Eubacterium rectale was found in healthy samples and genomes related to Lachnospiraceae bacterium isolate MGYG-HGUT-02492, which is more abundant in UC. Furthermore, we observed the activation of starch degradation pathways in individuals with inflammatory bowel syndrome with diarrhea52, possibly indicating a connection to gut dysbiosis in UC. Pathways involved in the biosynthesis of compounds linked to inflammation, such as kynurenic acid (METH-ACETATE-PWY) and lipopolysaccharide biosynthesis (PWY-7315) were found to be elevated in UC53,54, suggesting potential involvement in inflammatory processes. Interestingly, these pathways did not manifest as significant in WMS data, except for NAGLIPASYN-PWY and METH-ACETATE-PWY, indicating nuanced differences in the detection capabilities of various sequencing methodologies. Additionally, ARG + POLYAMINE-SYN, GALACT-GLUCUROCAT-PWY, PWY-5130, PWY-7663, and PWY-8073, all associated with a healthy status, and 1CMET2-PWY, HSERMETANA-PWY, P125-PWY, PWY-1861, PWY-6270, PWY-6527 associated with UC, were corroborated by the WMS method.
The metabolic pathways uncovered using the Adj-16S and WMS methodologies offer promising avenues for a deeper understanding of UC, providing potential pathways for diagnostics and therapeutic development. The advanced 16S rRNA-based analytical method and WMS could illuminate our understanding of gut microbiota structure55,56. Despite exploring deep shotgun sequencing analysis, our findings indicate that this technique did not notably improve our discrimination between the functional pathways of healthy individuals and those with UC (Supplementary Fig. 17). Even when varying sequencing depths—4 GB, 18 GB, and 36 GB—were employed, the ability to distinguish between these states did not significantly change57. However, WMS is recognized for its capability to capture both taxonomic and functional features of bacteria and fungi, which may remain elusive with 16S rRNA sequencing55. Recent advancements in metagenome-assembled genomes (MAGs) present a promising opportunities for exploring the ‘dark matter’ of the human gut microbiota58. Yet, the WMS approach in this study, based on reference-based methods like MetaPhlAn3, faces limitations in detecting only cataloged species, thus missing a vast array of uncultivated microbes. To overcome these challenges, newer methods such as MetaPhlAn 4 integrate both reference genomes and MAGs to expand species-level genome bins, enabling more comprehensive taxonomic profiling59. Continuous updates to databases (e.g., GTDB60 and UHGG61) and hybrid approaches that blend reference-based and assembly-based strategies, are crucial for refining metagenomic analysis. Although functional analysis was partially validated through qRT-PCR in this study, the predictive results from 16S rRNA-based functional analysis using PICRUSt2 rely on a limited set of reference data. Therefore, to enhance the efficiency of the Adj-16S method, updating transitional ecological classifications to align with continuously updated databases would be beneficial. Additionally, assessing these computational methods with integrated multiomics data is critical for advancing our understanding of microbial functions and interactions in the gut microbiota.
Our study acknowledges the limitations inherent in the scale and diversity of our mock community dataset. To refine the accuracy of our equations, expanding this dataset with a more extensive and diverse range of mock communities is essential. Such expansion would strengthen our foundation for using concatenated V1-V3 and V6-V8 16S rRNA regions to achieve a thorough gut microbiome analysis. While this method is optimized for analyzing the adult gut microbiome, its applicability to other environments (e.g., soil and marine environments) or human body sites (e.g., skin, saliva, and urinary tract) might require tailored analytical approaches. Additionally, the acquisition of robust results regarding differences in bacterial compositions between healthy individuals and UC patients requires the application of multiple differential abundance methods (e.g., ALDEx2 and ANCOM-II)62 rather than just LEfSe method used in this study. However, a small sample size (fewer than 10 per group) can lead to a higher false discovery rate with bias correction algorithms such as ANCOM-II compared to Wilcoxon63. In addition, the method’s performance can be optimized by implementing pipelines (e.g., TIC pipeline) that enhance the clustering of unclassified taxa64.
In summary, concatenating unmergeable reads has fine-tuned the resolution of our gut microbiome profiling, allowing us a more detailed representation of the gut ecosystem. We have identified distinct metabolic pathways that differentiate healthy individuals from those with UC. Our approach offers an efficient, cost-effective, and labor-intensive approach for unraveling the complex interactions between hosts and microbes in the gut. This advancement enhances our ability to accurately map and comprehend these interactions is poised to make substantial impacts on developing targeted interventions, potentially revolutionizing patient care and therapeutic approaches.
Methods
Microbial community datasets
We utilized the SRP115494 (Longitudinal Multiomics of the Human Microbiome in IBD)7 and SRP131748 (Human Metagenome on pre-diabetic humans)40 datasets from NCBI for our microbiome analysis strategy. Additionally, the SRP291583 dataset24, comprising mock community datasets, was employed to validate gut microbiome analysis and develop correction coefficient formulas.
|
|