|
|
Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity
Show authors
Nature Microbiology volume 7, pages2128–2150 (2022)Cite this article
Abstract
Despite advances in sequencing, lack of standardization makes comparisons across studies challenging and hampers insights into the structure and function of microbial communities across multiple habitats on a planetary scale. Here we present a multi-omics analysis of a diverse set of 880 microbial community samples collected for the Earth Microbiome Project. We include amplicon (16S, 18S, ITS) and shotgun metagenomic sequence data, and untargeted metabolomics data (liquid chromatography-tandem mass spectrometry and gas chromatography mass spectrometry).
We used standardized protocols and analytical methods to characterize microbial communities, focusing on relationships and co-occurrences of microbially related metabolites and microbial taxa across environments, thus allowing us to explore diversity at extraordinary scale. In addition to a reference database for metagenomic and metabolomic data, we provide a framework for incorporating additional studies, enabling the expansion of existing knowledge in the form of an evolving community resource. We demonstrate the utility of this database by testing the hypothesis that every microbe and metabolite is everywhere but the environment selects.
Our results show that metabolite diversity exhibits turnover and nestedness related to both microbial communities and the environment, whereas the relative abundances of microbially related metabolites vary and co-occur with specific microbial consortia in a habitat-specific manner. We additionally show the power of certain chemistry, in particular terpenoids, in distinguishing Earth’s environments (for example, terrestrial plant surfaces and soils, freshwater and marine animal stool), as well as that of certain microbes including Conexibacter woesei (terrestrial soils), Haloquadratum walsbyi (marine deposits) and Pantoea dispersa (terrestrial plant detritus). This Resource provides insight into the taxa and metabolites within microbial communities from diverse habitats across Earth, informing both microbial and chemical ecology, and provides a foundation and methods for multi-omics microbiome studies of hosts and the environment.
초록
시퀀싱 기술의 발전에도 불구하고 표준화 부족으로 인해 연구 간 비교가 어려우며,
이는 행성 규모에서 다양한 서식지에 걸쳐
미생물 군집의 구조와 기능에 대한 이해를 방해합니다.
본 연구에서는
지구 미생물군집 프로젝트(Earth Microbiome Project)를 위해 수집된
다양한 880개의 미생물 군집 샘플에 대한
다중 오믹스 분석을 제시합니다.
본 연구에는
증폭자(16S, 18S, ITS) 및 샷건 메타게놈 시퀀싱 데이터, 그리고
비표적 대사체학 데이터(액체 크로마토그래피-이중 질량 분석법 및 가스 크로마토그래피 질량 분석법)가
포함됩니다.
We include amplicon (16S, 18S, ITS) and shotgun metagenomic sequence data, and untargeted metabolomics data (liquid chromatography-tandem mass spectrometry and gas chromatography mass spectrometry).
16S rRNA 유전자 부위를 증폭시킨 DNA 조각을 16S 엠플리콘이라고 합니다. 이 엠플리콘은 박테리아 군집 분석에 많이 사용돼요.
18S rRNA는 진핵생물의 리보솜 RNA 유전자이며,
ITS는 진균류의 종을 식별하는 데 사용되는 유전자 부위입니다.
샷건 메타게노믹스는 환경 시료에서 추출한 DNA 전체를 분석하는 방법
표준화된 프로토콜과 분석 방법을 사용하여 미생물 군집을 특성화했으며,
환경 간 미생물 관련 대사체와 미생물 분류군의 관계 및 공존에 초점을 맞추어,
초고해상도에서 다양성을 탐구할 수 있었습니다.
메타게놈 및 대사체 데이터용 참조 데이터베이스 외에도
추가 연구를 통합할 수 있는 프레임워크를 제공하여,
진화하는 커뮤니티 리소스 형태로 기존 지식을 확장할 수 있습니다.
이 데이터베이스의 유용성을 입증하기 위해
'모든 미생물과 대사산물이 존재하지만 환경이 선택한다'는
가설을 검증했습니다.
every microbe and metabolite is everywhere
but the environment selects.
결과적으로 대사체 다양성은
미생물 군집과 환경 모두와 관련된 회전과 중첩성을 보였으며,
미생물 관련 대사체의 상대적 풍부도는 서
식지 특이적으로 특정 미생물 군집과 함께 변동하고 공존했습니다.
또한 특정 화학물질,
특히 테르펜오이드가
지구의 환경(예: 육상 식물 표면과 토양, 담수 및 해양 동물 분변)을 구분하는 데 강력한 역할을 하며,
특정 미생물(예: Conexibacter woesei(육상 토양), Haloquadratum walsbyi(해양 침전물), Pantoea dispersa(육상 식물 잔해))의
구분에도 기여함을 보여줍니다.
이 자료는 지구 전역의 다양한 서식지에서 미생물 군집 내의 분류군과 대사물을 이해하는 데 기여하며,
미생물 생태학과 화학 생태학에 정보를 제공하며,
호스트와 환경의 다중 오믹스 미생물군 연구를 위한 기반과 방법을 제공합니다.
Similar content being viewed by others
Article 23 January 2025
Long-read metagenomics of soil communities reveals phylum-specific secondary metabolite dynamics
Article Open access18 November 2021
A genomic catalog of Earth’s microbiomes
Article Open access09 November 2020
Main
A major goal in microbial ecology is to understand structure in microbial communities, how this is related to microbial taxonomic, phylogenetic and functional composition, and how those relationships vary across space and time. As any single study is not able to sample all environments repeatedly to allow for such inferences, fostering the use of standardized methods that permit meta-analysis across distinct studies is of utmost importance1,2,3,4. Initial efforts focused on standardized protocols for 16S ribosomal RNA (rRNA) sequencing of bacterial/archaeal communities provided insight into how communities structure in the environment, supporting strong axes of separation of microbes along gradients of host association and salinity1,5. More recent efforts focused on shotgun metagenomics data6,7,8,9 have begun to provide additional insight regarding functional potential across environments10,11,12,13,14, and the current state-of-the-art methods employ multi-omics approaches including metagenomics, transcriptomics, proteomics and/or metabolomics15,16,17,18,19,20,21,22,23,24.
Microbes produce diverse secondary metabolites that perform vital functions from communication to defence25,26,27 and can benefit human health and environmental sustainability28,29,30,31,32,33,34. Whereas metagenome mining and transcriptomics are powerful ways to characterize function in microbial communities10,14,24, a more powerful approach to understanding functional diversity is to generate chemical evidence that confirms the presence of metabolites19,20,21 and accurately describes their distribution across Earth. Here we present an approach that directly assesses the presence and relative abundance of metabolites, and provides an accurate description of metabolite profiles in microbial communities across Earth’s environments. Although several studies have previously employed tandem metagenomics and metabolomics22,23,35,36,37,38,39,40, many employed relatively limited technical methods or profiled a relatively small number of classes of metabolites23,35,40, preventing comparison across studies that could expand our understanding. Further, several previous studies are limited in scope to a single environment or habitat20,23,24,35,36,37,38,39. Our work goes substantially beyond what has been reported previously regarding multi-omics analysis of microbial communities using metagenomics and metabolomics, by including multiple ecosystems. The approach we apply complements metagenomics with a direct survey of secondary metabolites using untargeted metabolomics.
Liquid chromatography with untargeted tandem mass spectrometry (LC–MS/MS) is a versatile method that detects tens of thousands of metabolites in biological samples19. Although LC–MS/MS metabolomics has historically suffered from low metabolite annotation rates when applied to non-model organisms, recent computational advances can systematically assign chemical classes to metabolites using their fragmentation spectra41. Untargeted mass-spectrometry-based metabolomics provides the relative abundance (that is, intensity) of each metabolite detected across samples rather than just counts of unique structures (that is, presence/absence data), and thus provides a direct readout of the surveyed environment, complementing a purely genomics-based approach. Although there is a clear need to use untargeted metabolomics to quantify the metabolic activities of microbiota, the approach has been limited by the challenge of distinguishing the secondary metabolites produced exclusively by microbes from other compounds detected in the environment (for example, those produced by multicellular hosts). To resolve this bottleneck, we devised a computational method for recognizing and annotating putative secondary metabolites of microbial origin from fragmentation spectra (see Online Methods).
We used this methodology to quantify microbial secondary metabolites from diverse microbial communities from the Earth Microbiome Project (EMP, http://earthmicrobiome.org). The EMP was founded in 2010 to sample Earth’s microbial communities at unprecedented scale, in part to advance our understanding of biogeographic processes that shape community structure. To avoid confusion with terminology, we define ‘microbial community’ as consisting of members of the domains Bacteria and Archaea. To build on the first analysis of the EMP archive focused on profiling bacterial and archaeal 16S rRNA1, we crowd-sourced a previously undescribed set of roughly 900 samples from the scientific community specifically for multi-omics analysis. We expanded the scalable framework of the EMP to include standardized methods for shotgun metagenomic sequencing and untargeted metabolomics for cataloguing microbiota globally. As a result, we provide a rich resource for addressing outstanding questions and to serve as a benchmark for acquiring additional data. To provide an example for using this resource, we present a multi-omics analysis of this undescribed sample set, tracking not just individual sequences but also genomes and metabolites. Our analysis includes diverse studies with sample types classified using an updated and standardized environmental ontology, describes large-scale ecological patterns and explores important questions in microbial ecology.
주요
미생물 생태학 microbial ecology 의 주요 목표는
미생물 군집의 구조를 이해하고,
이 구조가 미생물의 분류학적, 계통학적 및 기능적 구성과 어떻게 관련되어 있는지,
그리고 이러한 관계가 공간과 시간에 따라 어떻게 변하는지 파악하는 것입니다.
단일 연구로는
모든 환경을 반복적으로 채취하여 이러한 추론을 가능하게 할 수 없기 때문에,
서로 다른 연구 간 메타분석을 허용하는 표준화된 방법의 활용을 촉진하는 것이
초기 노력은
세균/고세균 공동체의 16S 리보솜 RNA (rRNA) 시퀀싱을 위한 표준화된 프로토콜 개발에 초점을 맞췄으며,
이는 환경에서 공동체가 어떻게 구조화되는지에 대한 통찰을 제공했습니다.
이는 호스트 연관성과 염도 gradient에 따라
미생물이 강한 분리 축을 형성한다는 것을 지원했습니다1,5.
최근 연구는
환경 간 기능적 잠재력에 대한 추가적인 통찰을 제공하기 시작했으며10,11,12,13,14,
현재 최첨단 방법은
메타게노믹스, 트랜스크립토믹스, proteomics and/or metabolomics15,16,17,18,19,20,21,22,23,24.
Microbes produce diverse secondary metabolites that perform vital functions from communication to defence25,26,27 and can benefit human health and environmental sustainability28,29,30,31,32,33,34.
메타게놈 마이닝과 트랜스크립토믹스는
미생물 군집의 기능을 특성화하는 강력한 방법입니다10,14,24.
그러나
기능적 다양성을 이해하는 더 강력한 접근 방식은
대사체의 존재를 확인하는 화학적 증거를 생성하고19,20,21
지구 전역에 걸친 분포를 정확히 설명하는 것입니다.
본 연구에서는
지구 환경 전반에 걸친 미생물 군집에서
대사체의 존재와 상대적 풍부도를 직접 평가하고
대사체 프로파일을 정확히 설명하는 접근 방식을 제시합니다.
이전 연구에서 tandem 메타게노믹스와 대사체학22,23,35,36,37,38,39,40을 결합한 방법이 사용되었지만,
많은 연구는 상대적으로 제한된 기술적 방법을 사용하거나
상대적으로 적은 수의 대사체 클래스를 프로파일링23,35,40하여
연구 간 비교를 통해 이해를 확장하는 데 한계가 있었습니다.
또한 이전 연구의 대부분은
단일 환경이나 서식지에 한정되어 있습니다
본 연구는
메타게노믹스와 대사체학을 활용한 미생물 군집의 다중 오믹스 분석에 있어
이전 보고를 크게 넘어,
다중 생태계를 포함합니다.
우리가 적용한 접근 방식은
메타게노믹스에 무표적 대사체학을 통해
2차 대사체의 직접 조사를 결합하여 보완합니다.
액체 크로마토그래피와 비표적 tandem 질량 분석법(LC–MS/MS)은
생물학적 시료에서 수만 개의 대사물을 검출할 수 있는 다목적 방법입니다19.
LC–MS/MS 대사체학은
비모델 생물에 적용될 때 대사물 주석 부여율이 낮은 문제점을 역사적으로 겪어왔지만,
최근 계산 기술의 발전으로 분해 스펙트럼을 기반으로
대사물에 화학 클래스를 체계적으로 할당할 수 있게 되었습니다41.
비표적 질량 분석 기반 대사체학은 샘플 간 검출된 각 대사체의 상대적 풍부도(즉, 강도)를 제공하며, 단순히 고유 구조의 개수(즉, 존재/부재 데이터)를 제공하는 것이 아니라 조사된 환경의 직접적인 지표를 제공하여 순수 유전체학 기반 접근법을 보완합니다. 미생물군의 대사 활동을 정량화하기 위해 비표적 대사체학의 활용이 명확히 필요하지만, 이 접근법은 미생물이 단독으로 생성한 2차 대사체를 환경에서 검출된 다른 화합물(예: 다세포 호스트에 의해 생성된 화합물)과 구분하는 어려움으로 인해 제한을 받아왔습니다. 이 병목 현상을 해결하기 위해, 우리는 분해 스펙트럼으로부터 미생물 기원 추정 2차 대사체를 인식하고 주석화하는 계산적 방법을 개발했습니다(온라인 방법 참조).
이 방법을 사용하여 지구 미생물군집 프로젝트(EMP, http://earthmicrobiome.org)에서 수집된 다양한 미생물군집으로부터 미생물 2차 대사물을 정량화했습니다. EMP는 2010년에 설립되어 지구의 미생물군집을 전례 없는 규모로 채취하여 커뮤니티 구조를 형성하는 생지리학적 과정을 이해하는 데 기여하기 위해 설립되었습니다. 용어 혼란을 피하기 위해 '미생물군집'은 박테리아와 아키아 도메인의 구성원으로 정의합니다. EMP 아카이브의 첫 번째 분석(박테리아와 아키아의 16S rRNA 프로파일링에 초점을 맞춘)을 기반으로, 우리는 과학 커뮤니티로부터 다중 오믹스 분석을 위해 이전에 설명되지 않은 약 900개의 샘플을 크라우드소싱했습니다.
우리는 EMP의 확장 가능한 프레임워크를 확장하여 전 세계 미생물군집을 기록하기 위해 샷건 메타게놈 시퀀싱과 비표적 대사체학에 대한 표준화된 방법을 포함시켰습니다. 이로써 우리는 미해결 질문을 해결하고 추가 데이터 수집의 기준점이 될 풍부한 자원을 제공합니다. 이 자원의 활용 예를 제시하기 위해, 이 미기록 샘플 집합에 대한 다중 오믹스 분석을 제시합니다. 이 분석은 개별 시퀀스뿐 아니라 게놈과 대사물까지 추적합니다. 우리의 분석은 업데이트되고 표준화된 환경 온톨로지를 사용하여 샘플 유형을 분류한 다양한 연구를 포함하며, 대규모 생태학적 패턴을 설명하고 미생물 생태학의 중요한 질문을 탐구합니다.
Specifically, we explore the hypothesis that ‘everything is everywhere but the environment selects’42,43,44,45,46. We predict that although most major classes of metabolites have cosmopolitan distributions14, their relative abundances will vary strongly among different environments. Therefore, whereas the presence/absence of metabolites alone may show profiles that are relatively uniform across samples, their relative abundances will provide great power in distinguishing among habitats. We predict that similar to microbes1, metabolites will exhibit both turnover and nestedness across habitats. Furthermore, we expect variation in metabolite profiles among environments to be in part driven by variation in microbial community composition. Therefore, we explore the hypothesis that metabolite alpha- and beta-diversity will be strongly correlated with microbial diversity. We anticipate strong positive relationships between microbial diversity and metabolite diversity, but that environmental similarity based on microbial composition may be distinct from that based on metabolite composition. We suspect that this is in part due to deterministic processes unique to microbial community assembly and similarity in metabolite profiles across the microbial phylogeny47,48,49. Regardless, if profiles for metabolites and microbes are habitat-specific, we predict that certain members can be used to classify samples among environments. We also predict that metabolites will co-occur with specific microbial taxa such that metabolite–microbe pairs can be described as features in the environment that define specific habitats.
특히, '모든 것이 어디에나 있지만 환경이 선택한다'는
우리는 대부분의 주요 대사체 클래스가
세계적 분포를 보이지만,14
그들의 상대적 풍부도는 서로 다른 환경에서 강하게 변동될 것이라고 예측합니다.
따라서
대사산물의 존재/부재만으로는 샘플 간에 상대적으로 일관된 프로파일을 보일 수 있지만,
상대적 풍부도는 서식지를 구분하는 데 큰 힘을 발휘할 것입니다.
우리는
미생물1과 유사하게 대사산물이 서식지 간에 회전율과 중첩성을 보일 것으로 예측합니다.
또한 환경 간 대사산물 프로파일의 변동은 미생물 군집 구성의 변동 일부에 의해驱动될 것으로 예상합니다.
따라서
우리는 대사체 알파 다양성과 베타 다양성이
미생물 다양성과 강하게 상관관계가 있을 것이라는 가설을 탐구합니다.
우리는 미생물 다양성과 대사체 다양성 사이에 강한 양의 상관관계가 있을 것으로 예상하지만,
미생물 구성에 기반한 환경 유사성과 대사체 구성에 기반한 환경 유사성은 서로 다를 수 있습니다.
이는 미생물 군집 조립에 고유한 결정론적 과정과 미생물 계통나무를 따라 대사체 프로파일의 유사성47,48,49 때문일 수 있습니다.
어쨌든, 대사체와 미생물의 프로파일이 서식지 특이적이라면,
특정 구성원을 사용하여 환경 간 샘플을 분류할 수 있을 것으로 예측합니다.
또한 대사체가 특정 미생물 분류군과 함께 존재하여
대사체-미생물 쌍이 특정 서식지를 정의하는 환경의 특징으로 설명될 수 있을 것으로 예측합니다.
Results
A resource for multi-omics in microbial ecological research
Here we generated data for 880 environmental samples that span 19 major environments contributed by 34 principal investigators as part of the Earth Microbiome Project 500 (EMP500). The EMP500 is a previously unreported sample set for multi-omics protocol development and data exploration (Fig. 1 and Supplementary Table 1). To normalize sample collection for this and future studies, we updated and followed the existing Earth Microbiome Project (EMP) sample submission guide (https://earthmicrobiome.org/protocols-and-standards/emp500-sample-submission-guide/)50, which we highlight here to encourage its use. In parallel, we followed standardized protocols for sample collection, sample tracking, sample metadata curation, sample shipping and data release, which are also detailed on the EMP website (https://earthmicrobiome.org/protocols-and-standards/) and described here (see Online Methods). Importantly, we updated the previous EMP Metadata Guide to accommodate the EMP500 sampling design as well as updates to other standardized ontologies (see Online Methods), including the Earth Microbiome Project Ontology (EMPO). EMPO classifies microbial environments (level 4) on the basis of host association (level 1), salinity (level 2), host kingdom (if host-associated) or phase (if free-living) (level 3) (Fig. 1a). EMPO now recognizes an important split within host-associated samples representing saline and non-saline environments (Fig. 1a) not detected in the EMP’s previous analysis of 16S rRNA from a separate set of <23,000 samples1.
결과
미생물 생태학 연구를 위한 다중 오믹스 자원
우리는 34명의 주요 연구자가 Earth Microbiome Project 500 (EMP500)의 일환으로 제공한 19개 주요 환경을 포괄하는 880개의 환경 샘플 데이터를 생성했습니다. EMP500은 다중 오믹스 프로토콜 개발 및 데이터 탐색을 위한 이전에 보고되지 않은 샘플 세트입니다(그림 1 및 보충 표 1). 이 연구 및 향후 연구를 위해 샘플 수집을 표준화하기 위해, 우리는 기존 Earth Microbiome Project (EMP) 샘플 제출 가이드(https://earthmicrobiome.org/protocols-and-standards/emp500-sample-submission-guide/)를 업데이트하고 준수했습니다. 이 가이드를 강조하여 사용을 장려합니다. 동시에, 샘플 수집, 샘플 추적, 샘플 메타데이터 정리, 샘플 배송 및 데이터 공개에 대한 표준화된 프로토콜을 준수했으며, 이는 EMP 웹사이트(https://earthmicrobiome.org/protocols-and-standards/)에 상세히 설명되어 있으며 여기에서도 설명되어 있습니다(온라인 방법 참조). 중요하게도, 우리는 EMP500 샘플링 설계 및 기타 표준화된 온톨로지(온라인 방법 참조)의 업데이트를 반영하기 위해 이전 EMP 메타데이터 가이드를 업데이트했습니다. 이에는 Earth Microbiome Project Ontology (EMPO)도 포함됩니다. EMPO는 호스트 연관성(레벨 1), 염분도(레벨 2), 호스트 왕국(호스트 연관성 있는 경우) 또는 단계(자유 생활인 경우)(레벨 3)를 기반으로 미생물 환경(레벨 4)을 분류합니다(그림 1a). EMPO는 이제 호스트 연관 샘플 내에서 염분 환경과 비염분 환경을 구분하는 중요한 분할을 인식합니다(그림 1a). 이는 EMP의 이전 분석에서 16S rRNA를 분석한 별도의 <23,000개 샘플 세트에서 탐지되지 않았습니다1.
Fig. 1: Environment type and provenance of samples.
a, Distribution of samples (n = 880) among the Earth Microbiome Project Ontology (EMPO version 2) categories. EMPO recognizes strong axes of variation in microbial communities, and thus organizes all microbial environments (level 4) on the basis of host association (level 1), salinity (level 2), host taxon (for host-associated) or phase (free-living) (level 3). For EMPO 3 and EMPO 4: n-s, non-saline; s, saline. Colours indicate environments. Numbers indicate sample counts for each environment. Made with JSFiddle. b, Geographic distribution of samples with points coloured by EMPO 4. Points are transparent to highlight cases where multiple samples derive from a single location. We note here that our intent was to sample across environments rather than geography, in part because we previously showed that microbial community composition is more influenced by the former rather than the latter, but also to motivate finer-grained geographic exploration as sample analyses decrease in cost. Extensive information about each sample set is described in Supplementary Table 1. Made with Natural Earth.
a, 샘플 분포 (n = 880)를 Earth Microbiome Project Ontology (EMPO 버전 2) 카테고리에 따른 분류. EMPO는 미생물 군집의 강한 변동 축을 인식하며, 따라서 모든 미생물 환경 (레벨 4)을 호스트 연관성 (레벨 1), 염분도 (레벨 2), 호스트 분류군 (호스트 연관성 있는 경우) 또는 단계 (자유 생활) (레벨 3)를 기반으로 분류합니다. EMPO 3 및 EMPO 4: n-s, 비염분; s, 염분. 색상은 환경을 나타냅니다. 숫자는 각 환경별 샘플 수를 표시합니다. JSFiddle로 제작되었습니다.
b, EMPO 4에 따라 색상이 구분된 샘플의 지리적 분포. 점은 투명하게 표시되어 단일 위치에서 다중 샘플이 유래한 경우를 강조합니다. 여기서 우리는 지리적 분포보다는 환경 간 샘플링을 목표로 했다는 점을 밝히며, 이는 이전 연구에서 미생물 군집 구성은 후자보다 전자에 더 크게 영향을 받는다는 점을 보여준 데서 비롯되었으며, 또한 샘플 분석 비용이 감소함에 따라 더 세분화된 지리적 탐구를 장려하기 위함입니다. 각 샘플 세트에 대한 자세한 정보는 보충 표 1에 설명되어 있습니다. Natural Earth를 사용하여 제작되었습니다.
For the majority of samples, we successfully generated data for bacterial and archaeal 16S rRNA, eukaryotic 18S rRNA, internal transcribed spacer (ITS) 1 of the fungal ITS region, bacterial full-length rRNA operon, shotgun metagenomics and untargeted metabolomics (that is, LC–MS/MS and gas chromatography coupled with mass spectrometry (GC–MS)) (Supplementary Table 2). To foster exploration of this previously unreported dataset, we have made the raw sequence and metabolomics data publicly available through Qiita (https://qiita.ucsd.edu; study ID: 13114)51 and GNPS (https://gnps.ucsd.edu; MassIVE IDs: MSV000083475, MSV000083743)52, respectively. We also provide complete protocols for laboratory and computational workflows for both metagenomics and metabolomics data for use by the broader community (available on GitHub at https://github.com/biocore/emp/blob/master/methods/methods_release2.md). We hope that the dataset and workflows presented here serve as useful tools for others, in addition to providing a framework for launching additional future studies. As an example of the utility of the dataset for addressing important questions in microbial community ecology, we present an analysis of microbially related metabolites and microbe–metabolite co-occurrences across Earth’s environments (Extended Data Fig. 1).
Metabolite intensities reveal habitat-specific distributions
In total, we generated untargeted metabolomics data (that is, LC–MS/MS) for 618 of 880 samples (Supplementary Table 2), resulting in 52,496 unique molecular structures, or metabolites, across all samples. We then refined that dataset to include only putative, microbially related metabolites (that is, defined as being produced, modified by, or otherwise associated with a microbe), resulting in 6,588 metabolites across all samples (12.55% of all metabolites). Focusing on this subset, we found that although the presence/absence of major classes of microbially related metabolites is relatively conserved across habitats, their relative intensities (that is, analogous to relative abundances for microbes) reveal specific chemistry that is lacking or enriched in particular environments (Fig. 2 and Extended Data Fig. 2).
대부분의 샘플에서
세균 및 고세균 16S rRNA, 진핵생물 18S rRNA, 곰팡이 ITS 지역의 내부 전사 스페이서(ITS) 1,
세균 전체 길이의 rRNA 오페론, 샷건 메타게노믹스 및 비표적 대사체학
(즉, LC–MS/MS 및 가스 크로마토그래피와 질량 분석기 결합(GC–MS)) 데이터(보충 표 2)를
성공적으로 생성했습니다.
이 이전에 보고되지 않은 데이터셋의 탐구를 촉진하기 위해, 우리는 원시 시퀀스 및 대사체학 데이터를 Qiita (https://qiita.ucsd.edu; 연구 ID: 13114)51 및 GNPS (https://gnps.ucsd.edu; MassIVE IDs: MSV000083475, MSV000083743)52를 통해 공개적으로 접근 가능하게 만들었습니다. 또한 메타게노믹스와 대사체학 데이터에 대한 실험실 및 계산 워크플로우의 완전한 프로토콜을 GitHub(https://github.com/biocore/emp/blob/master/methods/methods_release2.md)에서 제공하여 광범위한 커뮤니티에서 활용할 수 있도록 했습니다. 이 데이터셋과 워크플로가 다른 연구자들에게 유용한 도구로 활용되기를 희망하며, 추가적인 미래 연구를 위한 프레임워크를 제공하기를 바랍니다. 미생물 군집 생태학의 중요한 질문을 해결하는 데 데이터셋의 유용성을 보여주는 예시로, 지구 환경 전반에 걸친 미생물 관련 대사체 및 미생물-대사체 공존 분석을 제시합니다(확장 데이터 그림 1).
대사체 강도는 서식지 특이적 분포를 드러냅니다
총 880개 샘플 중 618개(보충 표 2)에 대해 비표적 대사체학 데이터(즉, LC–MS/MS)를 생성했으며, 이는 모든 샘플에서 52,496개의 고유 분자 구조(대사체)를 포함합니다. 이 데이터를 미생물 관련 대사산물(즉, 미생물에 의해 생성되거나 변형되거나 기타 방식으로 연관된 것으로 정의된 대사산물)로 제한하여 정제했으며, 이 결과 모든 샘플에서 6,588개의 대사산물(전체 대사산물의 12.55%)이 확인되었습니다. 이 하위 집합에 초점을 맞추어 분석한 결과, 미생물 관련 대사체 주요 클래스의 존재 여부는 서식지 간에 상대적으로 보존되어 있지만, 그들의 상대적 강도(즉, 미생물의 상대적 풍부도와 유사한 개념)는 특정 환경에서 결핍되거나 풍부한 특정 화학적 특성을 드러냈습니다(그림 2 및 확장 데이터 그림 2).
Fig. 2: Distribution of microbially related secondary metabolite pathways and superclasses among environments.
a–d, Individual metabolites are represented by their higher-level classifications. Both chemical pathway and chemical superclass annotations are shown on the basis of presence/absence (a,c) and relative intensities (b,d) of molecular features, respectively. For superclass annotations in c and d, we included pathway annotations (when possible) for metabolites where superclass annotations were not available, and colours identify superclasses and pathways.
Importantly, when considering differences in the relative intensities of all microbially related metabolites, profiles for each habitat were so distinct that we could identify particular metabolites whose abundances were significantly enriched in certain environments (Fig. 3a and Supplementary Table 3). For example, metabolites annotated as carbohydrates (that is, excluding glycosides) were enriched in aquatic samples (log fold change (LFC)Water (non-saline) = 0.31 ± 1.22, LFCWater (saline) = 0.54 ± 1.45) (Fig. 3a). Similarly, sediment, marine plant surface and fungal samples were enriched in polyketides (LFCSediment (non-saline) = 1.69 ± 0.64, LFCSediment (saline) = 1.56 ± 1.11, LFCPlant surface (saline) = 1.22 ± 0.35, LFCFungus corpus (non-saline) = 1.68 ± 1.10) and soil, lake sediment and marine plant surface samples were enriched in shikimates and phenylpropanoids (LFCSediment (non-saline) = 1.90 ± 0.69, LFCSoil (non-saline) = 1.33 ± 0.65, LFCPlant surface (saline) = 1.09 ±0.43) (Fig. 3a).
Fig. 3: Structural-level associations between microbially related secondary metabolites and specific environments.
a, Differential abundance of metabolites across environments. For each panel, the y axis represents the natural log-ratio of the intensities of ingroup metabolites divided by the intensities of reference group metabolites (that is, pathway reference: Amino acids and peptides, n = 615; superclass reference: Flavonoids, n = 42). The number of metabolites in each ingroup and the chi-squared statistic from a Kruskal–Wallis (KW) test for differences across environments are shown. For each test, n = 606 samples and P < 2.2 × 10−16. Boxplots are Tukey’s, where the centre indicates the median, lower and upper hinges the first and third quartiles, respectively, and each whisker is 1.5× the interquartile range (IQR) from its hinge. b, Relationship between metabolite richness and microbial taxon richness, with significant correlations noted. P values are from two-tailed tests and were adjusted using the Benjamini-Hochberg procedure. c, Turnover in composition of metabolites across environments, visualized using RPCA, showing samples separated on the basis of metabolite abundances. Shapes represent samples. Arrows represent metabolites and are coloured by chemical pathway. The direction and magnitude of each arrow corresponds to the correlation between the metabolite’s abundance and the ordination axes. Samples close to arrow heads have strong positive associations, samples at arrow origins have no association, and those beyond arrow origins have strong negative associations. Metabolites are described in Supplementary Table 4. Metabolites annotated in red and purple were also highly differentially abundant across environments (Supplementary Table 3), and those in purple were also identified as important in co-occurrence analyses (Fig. 4). d, Turnover in composition of microbial taxa across environments, visualized using PCoA of weighted UniFrac distances. For c and d, results from PERMANOVA (999 permutations) for each level of EMPO are shown (all tests had P = 0.001; group sizes for metabolites: kEMPO1 = 2, kEMPO2 = 4, kEMPO3 = 9, kEMPO4 = 18; group sizes for microbial taxa: kEMPO1 = 2, kEMPO2 = 4, kEMPO3 = 9, kEMPO4 = 19). Sample sizes in a refer to metabolites, but in all other panels refer to samples.
The total number of distinct metabolites (that is, richness) also varied strongly across environments (Fig. 3b). We note that whereas saline sediments were most rich, the surfaces of terrestrial plants were especially lacking in metabolite diversity (Fig. 3b). This contrasted with metabolite diversity in detritus of terrestrial plants, which was also high (Fig. 3b).
When considering the identity and relative intensity of each metabolite in the analysis of beta-diversity, we observed a separation of samples based on host association and salinity (permutational multivariate analysis of variance (PERMANOVA) for EMPO 2: pseudo-F = 92.66, P = 0.001), and among specific environments (PERMANOVA for EMPO 4: pseudo-F = 48.63, P = 0.001). We also observed specific environments clustering in ordination space and identified certain metabolite features that differentiate all samples (Fig. 3c and Supplementary Table 4). For the latter, we identified three metabolites also listed among the 10 most differentially abundant metabolites for each environment (Supplementary Table 3): one chalcone associated with the surfaces of terrestrial plants (C13H10O, ID: 4949), one glycerolipid associated with freshwater (C28H58O15, ID: 14665) and one cholane steroid associated with the distal guts of terrestrial animals (C24H34O2, ID: 25552) (Fig. 3c). As the separation of samples based on metabolite profiles appeared to mirror those based on microbial taxa (Fig. 3c,d), we additionally explored our shotgun metagenomics data.
Correlation between metabolite and microbial alpha-diversity
We first explored whether metabolite alpha-diversity was related to microbe alpha-diversity. We found significant positive correlations between metabolite richness and microbial taxon richness across all samples (r = 0.20, P < 0.001), within host-associated samples (r = 0.19, P < 0.01), within free-living samples (r = 0.18, P < 0.05) and for certain environments: Animal proximal gut (saline) (r = 0.73, P < 0.01), Plant detritus (non-saline) (r = 0.74, P < .001), Sediment (non-saline) (r = 0.42, P = 0.05) and Water (saline) (r = 0.57, P = 0.01) (Fig. 3b and Supplementary Table 6). We observed non-significant trends in correlations for Plant surface (non-saline) (r = −0.36, P = 0.2) and Sediment (saline) (r = 0.27, P = 0.1) (Fig. 3b and Supplementary Table 6). Relationships for other environments were weaker (Fig. 3b and Supplementary Table 6). Sediment samples had the highest alpha-diversity of both microbial taxa and metabolites (Fig. 3b). Correlations with metabolite richness were weaker when using Faith’s phylogenetic diversity (PD) and weighted Faith’s PD for microbial taxa (Supplementary Table 6).
Turnover and nestedness are related to the environment
Next, we examined whether metabolite diversity among environments (that is, beta-diversity) was driven by either turnover (that is, the replacement of features) or nestedness (gain/loss of features leading to differences in richness)1,53. We first looked at turnover. We already noted similarity in the clustering of samples by environment between microbially related metabolite and microbial taxon datasets (Fig. 3c,d). We also observed a strong correlation between sample–sample distances based on metabolites vs microbial taxa (Table 1). Interestingly, we observed a stronger effect of salinity when comparing samples on the basis of microbial taxa vs metabolites (PERMANOVA on salinity: pseudo-F = 40.94 for microbes vs 8.25 for metabolites, P = 0.001 for both tests) (Fig. 3c,d). Furthermore, when focusing on the separation of samples within a single environment such as soil, we observed much more variability between metabolite and microbial taxon datasets (Mantel r = 0.32 for soil vs 0.43 for all environments, P = 0.001 for both tests). This highlights the unique composition among soil samples from distinct locations (Extended Data Fig. 3), and also the insight that was gained from analysis at different scales (that is, only soils vs all habitats). To assess whether metabolite profiles were more similar to those for microbial taxa vs microbial functions, we annotated our metagenomic reads to profile enzymes. We found the separation of samples based on microbial functions to be unique and largely driven by animal gut samples as compared to separation based on either metabolites or microbial taxa (Extended Data Fig. 4). However, correlations in sample–sample distances between microbial functional data and other datasets were strong (Table 1).
Table 1 Mantel test results comparing data layers generated for the EMP500 samples
In the absence of complete turnover in metabolites and microbial taxa across environments, apparent in the overlap of clusters representing different habitats in our ordinations (Fig. 3c,d), we quantified nestedness. Nestedness describes the degree to which features in one environment are nested subsets of another environment, and can provide insight into community assembly dynamics1,53. We found that samples were significantly nested on the basis of both metabolites (Extended Data Fig. 5) and microbial taxa (Extended Data Fig. 6), and that certain environments were consistently nested within others, although this pattern varied between datasets. For example, on the basis of microbial taxa, we observed host-associated samples to be nested within free-living ones (Extended Data Fig. 6a); however, the opposite was true for metabolites (Extended Data Fig. 5a). When considering host association and salinity (that is, EMPO 2) for metabolites, free-living samples were more nested than host-associated ones, and within each group, non-saline samples were more nested than saline ones (Extended Data Fig. 5d). This pattern remained consistent when describing metabolites at the superclass, class and molecular formula levels (Extended Data Fig. 5d). Patterns of nestedness were less consistent across taxonomic levels when based on microbial taxa, although non-saline, free-living samples were the most nested across the family, genus and species levels (Extended Data Fig. 6d). When considering all environments together (that is, for EMPO 3 and 4), we observed stronger patterns of nestedness among environments for microbial taxa (Extended Data Fig. 5b,c) vs metabolites (Extended Data Fig. 6b,c). However, we observed that patterns of nestedness were somewhat similar between microbial taxa and metabolites for host-associated environments, except for plant surfaces (Extended Data Figs. 5e and 6e).
Metabolites and microbes distinguish habitats
On the basis of the strong relationships among metabolites, microbes and the environment, we next tested the hypothesis that specific metabolites, microbial taxa or microbial functional products (that is, enzymes) could be used to classify samples among environments. Importantly, features useful in classifying samples among habitats can be used as indicators, which can be useful for detecting certain environmental states, environmental change, or in predicting the diversity of other features. Using a machine-learning classifier (see Online Methods), we identified specific metabolites that classified samples among environments with 88.0% overall accuracy (Fig. 4a, Extended Data Fig. 7a, and Supplementary Fig. 1 and Table 7). After ranking all metabolites on the basis of their impact in distinguishing environments, we found those top ranked to include a diterpenoid negatively associated with non-saline soils (C20H32, ID: 04492), an undescribed metabolite positively associated with marine sediments (ID: 42202) and a lignan negatively associated with freshwater sediments (C20H20O5, ID: 07899) (Fig. 4a and Supplementary Table 7). Among the top 20 ranked metabolites with annotations, the majority were alkaloids, fatty acids or terpenoids, with terpenoids being the most impactful among the top 10 ranked metabolites, including the most highly ranked one (Fig. 4a and Supplementary Table 7).
Fig. 4: Machine-learning analysis of microbially related metabolites, microbial taxa and microbial functions, highlighting the top 20 most impactful features for each dataset.
a, The top 20 most impactful microbially related metabolites. Features are coloured by metabolite pathway. Metabolites in bold font are those also identified as important in differential abundance analysis (Supplementary Table 3). b, The top 20 most impactful microbial taxa (that is, OGUs). Taxa are coloured by phylum. c, The top 20 most impactful microbial functions (that is, KEGG ECs). Boxplots are in the style of Tukey, where the centre line indicates the median, lower and upper hinges the first and third quartiles, respectively, and each whisker is 1.5× IQR from its respective hinge. Enzymes are coloured by class. For all features, ranks are based on impacts derived from SHAP values. Associations with environments are indicated, where + indicates a positive association and – indicates a negative association based on feature abundances. Diamonds and values to the right of boxes indicate means. Values in parentheses indicate (1) the number of iterations (n = 20) in which a feature had no impact and (2) the number of iterations in which the reported association was observed, for cases in which values were <20. Environments are described by the Earth Microbiome Project Ontology (EMPO 4).
We also found strong support among methods for the importance of particular metabolites in distinguishing environments. For example, the undescribed metabolite positively associated with marine sediments (that is, ID: 42202) and one fatty acid—a monoacylglycerol (that is, ID: 42202)—revealed as useful in classification in this analysis also stood out in our analysis of differential abundance (Fig. 4a, and Supplementary Tables 3 and 7). Similarly, distinct analytical approaches identified specific metabolites as particularly important for distinguishing aquatic samples (that is, one glycerolipid, C28H58O15, ID: 14665 and one pseudoalkaloid, C18H22N7O5, ID: 14675), non-saline plant surface samples (that is, one chalcone, C13H10O, ID: 4949) and non-saline animal distal gut samples (that is, one cholane steroid, C24H38O4, ID: 2552 and one prenyl quinone monoterpenoid, C29H46O2, ID: 22299) (Fig. 3c, and Supplementary Tables 3 and 4).
Using the same machine-learning approach on our metagenomic sequence data, we identified specific microbial taxa and microbial functional products (that is, enzymes) useful in classifying samples to environments, with 88.8% and 88.9% overall accuracy, respectively (Fig. 4b,c, Extended Data Fig. 7a, and Supplementary Figs. 2 and 3). We observed that the majority of the top 20 ranked microbial taxa with respect to classification performance were Proteobacteria (Fig. 4b). Cyanobacteria, Firmicutes and Actinobacteria were represented by a few members each, and Candidatus Tectomicrobia and Euryarchaeota were represented as singletons (Fig. 4b). The most highly ranked taxon, Conexibacter woesei (G000424625, Actinobacteria), was positively associated with non-saline soils, and is an early-diverging member of the class Actinobacteria first isolated from temperate forest soil in Italy54 (Fig. 4b). Also among the top ranked taxa were Haloquadratum walsbyi (Euryarchaeota) positively associated with saline soils, and Pantoea dispersa (Gammaproteobacteria) positively associated with the detritus of terrestrial plants (Fig. 4b). For microbial functions, we note that the majority of the top 20 most highly ranked enzymes with respect to classification performance were oxidoreductases or transferases, followed by hydrolases, and then isomerases and lyases (Fig. 4c). The most highly ranked enzyme was positively associated with non-saline soils and was a trehalohydrase (enzyme code (EC): 3.2.1.141), an enzyme that binds trehalose, a carbon-source commonly produced by soil inhabitants including plants, invertebrates, bacteria and fungi, with potential roles in symbioses55. Also among the most highly ranked enzymes were a glutamate carboxylase (EC: 4.1.1.90) positively associated with the surfaces of marine plants, and a linoleate lipoxygenase (EC: 1.13.11.60) positively associated with lichen thalli (Fig. 4c).
Metabolite–microbe co-occurrences are habitat-specific
In addition to exploring relationships between metabolite and microbial diversity, we sought to explicitly quantify metabolite–microbe co-occurrence patterns. Beyond relating metabolites to the microbes that potentially interact with them, certain metabolite–microbe pairs may have stronger associations with the environment than any one feature set alone and may serve as emergent indicators. In particular, we examined associations between metabolites and the environment (for example, Fig. 3a,c) while also considering each metabolite’s co-occurrence with all microbes in the dataset (Extended Data Fig. 1). In that regard, we first generated metabolite–microbe co-occurrences learned from both LC–MS/MS- and shotgun metagenomic profiles across all samples, for a cross-section of 6,501 microbially related metabolites and 4,120 microbial taxa (Extended Data Figs. 8 and 9). Whereas most metabolites co-occurred with at least a few microbes, few metabolites were found to co-occur with many microbes (Extended Data Fig. 8a). The distribution of co-occurrences was not heavily shifted towards any particular pathway (Extended Data Fig. 8b); however, certain superclasses exhibited co-occurrences with many microbes, including diarylheptanoids and phenylethanoids (C6-C3) (Extended Data Fig. 8c). Similarly for microbes, co-occurrences with metabolites were not heavily skewed towards particular phyla, although specific clades were enriched, such as the most recently diverged members of the Bacteroidetes (Extended Data Fig. 9). In contrast to their co-occurrences with metabolites, changes in microbial abundances with respect to the environment appear to be phylogenetically conserved, and correlated with salinity and association with the animal gut environment (Extended Data Fig. 9).
Next, using metabolite–metabolite distances based on co-occurrence profiles considering all microbes, we ordinated metabolites in microbe space. We then examined correlations between metabolite loadings on the principal coordinates of that co-occurrence ordination and (1) log fold changes of metabolites across environments (for example, Fig. 3a) and (2) distributions of metabolites across all samples (that is, loadings and overall magnitude from ordination of all samples) (Fig. 3c), and found strong relationships with each (Fig. 5a). In particular, the abundances of microbially related metabolites in plant surface (saline), sediment (saline) and aquatic samples (that is, those from water) had strong correlations with microbe–metabolite co-occurrences (Fig. 5a). Focusing on seawater (that is, Water (saline)), we visualized the correlation between metabolite loadings on PC1 of the co-occurrence ordination, which represent differences based on co-occurrences with microbes (Fig. 5b), and log fold changes in metabolite abundances with respect to seawater (Fig. 5c). In this space, features with high values for both vectors should be associated with the same microbes and also highly abundant in the ocean, whereas features with low values for both vectors should be associated with the same microbes and have low-to-zero abundance in the ocean (Fig. 5c). Focusing on one group of carbohydrates (excluding glycosides) and one group of terpenoids (Fig. 5c,d), we found significant differences in their intensities in seawater vs all other environments (Fig. 5e), as well as in the abundances of their top co-occurring microbial taxa (Fig. 5f). Importantly, by relying on our metabolite intensity data, this result validates patterns identified in our analyses of differential abundance across environments and co-occurrence with microbial taxa. We used this same approach to explore metabolite–microbe co-occurrences specific to other environments (Extended Data Fig. 10 and Supplementary Table 5), further revealing strong turnover in metabolite–microbe co-occurrences across habitats.
Fig. 5: Metabolite–microbe co-occurrences vary across environments.
a, Correlation between metabolite loadings from the co-occurrence ordination (that is, co-occurrence PCs) and (1) log fold changes in metabolite abundances across environments, (2) metabolite loadings from the ordination in Fig. 3d (that is, Global distribution, axes 1–3) and (3) a vector representing the overall magnitude of microbial taxon abundances from the ordination in Fig. 3d (that is, Global distribution, Overall magnitude). Values are Spearman correlation coefficients. Asterisks indicate significant correlations (*P < 0.05, **P < 0.01, ***P < 0.001). b, The relationship between log fold changes in metabolite abundance with respect to ‘Water (non-saline)’ and the first three PCs of the co-occurrence ordination. Points represent metabolites, and the distance between metabolites indicates similarity in their co-occurrences with microbial taxa. Metabolites are coloured on the basis of log fold changes with respect to ‘Water (non-saline)’. Arrows represent specific microbial taxa (colours), distances between arrow tips indicate similarity in their co-occurrence with specific metabolites, and the direction of each arrow indicates which metabolites each microbe co-occurs most strongly with. c, The relationship between log fold changes in metabolite abundances with respect to ‘Water (non-saline)’ and loadings for metabolites on PC1 of the co-occurrence ordination. The correlation is one example from a. Metabolites are coloured by pathway. Select carbohydrates (excluding glycosides) (the focal group) and select terpenoids (the reference group) are highlighted. d, The top 10 co-occurring microbial taxa for all select carbohydrates and all select terpenoids, with a heat map showing co-occurrence strength. e, Log-ratio of metabolite intensities for select carbohydrates and select terpenoids. f, Log-ratio of abundances of the top 10 microbial taxa associated with select carbohydrates and with select terpenoids. For e and f, points represent samples, and results from a t-test comparing ‘Water (saline)’ vs all other environments are shown. Boxplots are Tukey’s, where the centre indicates the median, lower and upper hinges the first and third quartiles, respectively, and each whisker represents 1.5× IQR from its hinge. For a, c, e and f, P values are from two-sided tests. For a and c, P values were adjusted using the Benjamini-Hochberg procedure.
Correlations with amplicon sequence data and GC–MS data
To begin to explore the additional data generated for EMP500 samples, including GC–MS and amplicon sequence data (that is, bacterial and archaeal 16S and full-length rRNA operon, eukaryotic 18S, fungal ITS), we compared sample–sample distances (that is, beta-diversity) between each pair of datasets. Beyond providing insight into how certain community data are related, strong correlations between datasets may indicate similarity in the structuring of features among samples or habitats. Importantly, we found further support for a strong relationship between microbially related metabolites and microbial taxa (LC–MS/MS vs 16S; r = 0.27, P = 0.001) (Table 1). The relationships between the metabolomics data (that is, LC–MS/MS or GC–MS) and sequence data from eukaryotes (that is, 18S or ITS) were weaker (for example, LC–MS/MS vs ITS; r = 0.07, P = 0.006) (Table 1). The weakest relationships were between sequence data from Bacteria and Archaea (that is, 16S or shotgun metagenomics) and sequence data from eukaryotes (that is, 18S or ITS) (for example, shotgun metagenomics for taxa vs. 18S; r = –0.002, P = 0.9) (Table 1). The strongest relationships were between different layers of sequence data from Bacteria and Archaea (Table 1). For example, correlations between 16S rRNA profiles and those from full-length rRNA operons had r = 0.55 (P = 0.001), and 16S vs shotgun metagenomics (taxa) had r = 0.51 (P = 0.001) (Table 1). These results highlight the strong relationship between metabolic profiles and microbial taxonomic composition across habitats spanning the globe.
Discussion
Here we discuss some of the caveats and limitations of our study, and further highlight how our approach advances understanding of microbial community dynamics and functional diversity. Due to their extensive nature, we provide additional important points of discussion as Supplementary Information. We begin by recognizing that certain environments included in EMPO are represented here by only a handful of samples (Fig. 1) and/or a single sample set (Supplementary Table 1), and note that we had to exclude them from some of our analyses due to low representation (for example, machine learning and co-occurrence analyses). We recommend that future efforts focus on additional sampling of these environments to further generalize our findings to those habitats. Similarly, we hope to expand sampling geographically to broaden our scope of inference, as many important environments and locations could not be included here (or, indeed, in the EMP’s 27,000-sample dataset1). We also note that the inherent design of the EMP (that is, crowd-sourced samples from experts in respective fields) prevented us from explicitly exploring causation with respect to the environment in our analysis, and thus our findings are based largely on observations and correlations among feature sets and associated metadata.
In our example analysis, we explored whether every metabolite is everywhere but the environment selects (that is, the Baas Becking hypothesis42,43, but for microbially related metabolites). Whereas we interpret our findings as strong evidence that every metabolite is everywhere but the environment selects, our study was not designed to address this hypothesis explicitly, and further evidence is needed to support this hypothesis. For example, features at abundances below the detection limit of our approach could not be considered here, but may alter our view of these patterns. Similarly, although input sample volumes were normalized as best as possible, they may influence estimates of alpha-diversity, and the values reported here probably exhibit some error in part due to this influence. We also identified metabolite–microbe co-occurrences, and note that our approach for characterizing co-occurrences, ‘mmvec’56, does not currently allow for controlling for covariates and this may influence results. However, in our analysis we were able to include EMPO as a variable, which we designed to account for variation among environments that may not be captured by available metadata.
Here we described patterns of turnover, nestedness and co-occurrence of metabolites and microbes across a diverse set of environments while addressing ecological questions surrounding the distribution of metabolites and their relationships with microbial taxonomic and functional diversity. One outstanding question in microbial ecology asks how microbial taxon profiles can be integrated with functional ones57. Here, in addition to describing microbial taxa, their functions and their metabolites, we explicitly tested for metabolite–microbe co-occurrences and explored how they relate to the environment, for which we have outlined our approach (Extended Data Fig. 1). Our analysis provides insight into biological processes including microbial community assembly and links microbial taxonomic profiles with metabolism and functional diversity (that is, enzymes) at planetary scale. Our work provides an initial view of how microbially related metabolites are structured with respect to factors including host association, salinity and the presence of certain microbes (Figs. 3 and 5). Importantly, we identified the most abundant and highly ranked pathway representing the metabolites best able to distinguish environments to be terpenoids58, highlighting the importance of this group of metabolites in distinguishing Earth’s environments (Fig. 4a and Supplementary Table 7).
We acknowledge that previous studies describing microbial taxa and function using globally distributed sample sets, such as for the human gut, soils and the ocean, have shown that both can vary across locations59,60,61,62. Similarly, studies examining metabolite profiles across changes in microbial community composition, or environmental stress such as from heat, have shown variation associated with either20,21 or both23. Furthermore, among previous multi-omics studies combining metagenomics with metatranscriptomics, metaproteomics and/or metabolomics, some of which have shown the correlation between data layers to vary across sites, the majority are focused on a single environment63,64,65,66,67,68,69,70,71,72,73. Here we performed multi-omics integration of a dataset encompassing a diversity of environmental sample types representing several habitats, generated using standardized methods allowing for robust meta-analysis with data from other studies using the same approach.
Our approach illustrates that recent advances in computational annotation tools offer a powerful toolbox to interpret untargeted metabolomics data41. We anticipate that parallel advances in metagenomic sequencing, genome assembly and genome mining will improve the discovery and classification of functional products from among microbes and provide additional insight into these findings. By following standardized methods available on GitHub and making this dataset publicly available in Qiita and GNPS, this study will serve as an important resource for continued collaborative investigations. In the same manner, the development of optimized instrumentation and computational methods for metabolomics will expand the depth of metabolites surveyed in microbiome studies.
논의
여기서는 본 연구의 한계와 주의사항을 논의하며, 우리 접근 방식이 미생물 군집 동역학 및 기능적 다양성에 대한 이해를 어떻게 심화시키는지에 대해 추가로 강조합니다. 그 광범위한 특성으로 인해, 우리는 보충 자료에 추가적인 중요한 논의 사항을 제공합니다. 먼저, EMPO에 포함된 특정 환경이 여기서는 소수의 샘플(그림 1)과/또는 단일 샘플 세트(보충 표 1)로만 대표되고 있음을 인정하며, 대표성 부족으로 인해 일부 분석(예: 기계 학습 및 동시 발생 분석)에서 이를 제외해야 했다는 점을 지적합니다. 향후 연구에서는 이러한 환경에 대한 추가 샘플링을 통해 우리 결과를 해당 서식지로 일반화하는 데 집중할 것을 권장합니다. 마찬가지로, 많은 중요한 환경과 위치가 여기(또는 EMP의 27,000개 샘플 데이터셋1)에 포함되지 않았기 때문에, 지리적 범위를 확장하여 추론의 범위를 넓히는 것도 희망합니다. 또한 EMP의 본질적인 설계(즉, 해당 분야 전문가로부터 수집된 크라우드소싱 샘플)로 인해 분석에서 환경에 대한 인과 관계를 명시적으로 탐구할 수 없었으며, 따라서 우리의 결과는 주로 특징 집합과 관련 메타데이터 간의 관찰과 상관관계에 기반합니다.
예시 분석에서 우리는 모든 대사산물이 환경에 존재하지만 환경이 선택한다는 가설(즉, Baas Becking 가설42,43, 미생물 관련 대사산물에 적용)을 탐구했습니다. 우리는 연구 결과를 모든 대사산물이 환경에 존재하지만 환경이 선택한다는 강력한 증거로 해석하지만, 이 가설을 명시적으로 검증하기 위해 설계된 연구가 아니며, 이 가설을 지지하기 위해 추가 증거가 필요합니다. 예를 들어, 우리 접근법의 검출 한계 이하의 농도를 가진 특징은 여기에서 고려되지 않았지만, 이러한 패턴에 대한 우리의 관점을 변경할 수 있습니다. 마찬가지로, 입력 샘플 양은 가능한 한 표준화되었지만, 알파 다양성 추정에 영향을 미칠 수 있으며, 여기 보고된 값은 이 영향으로 인해 일부 오류가 포함될 수 있습니다. 또한 대사체-미생물 공존을 식별했으며, 공존을 특성화하는 우리 접근법인 ‘mmvec’56은 현재 공변량을 통제할 수 없으며, 이는 결과에 영향을 미칠 수 있습니다. 그러나 분석에서 우리는 환경 간 변이를 포착하지 못할 수 있는 메타데이터를 고려하기 위해 설계된 변수인 EMPO를 포함할 수 있었습니다.
여기서 우리는 다양한 환경에서 대사체와 미생물의 회전율, 중첩성, 공존 패턴을 설명하며, 대사체의 분포와 미생물 분류학적 및 기능적 다양성 간의 관계에 대한 생태학적 질문을 다루었습니다. 미생물 생태학의 주요 질문 중 하나는 미생물 분류군 프로필을 기능적 프로필과 어떻게 통합할 수 있는지입니다57. 여기서는 미생물 분류군, 그 기능 및 대사물을 설명하는 것 외에도 대사물-미생물 공존을 명시적으로 테스트하고 환경과의 관계를 탐구했으며, 이를 위해 접근 방법을 개요로 제시했습니다(확장 데이터 그림 1). 본 분석은 미생물 군집 조립과 같은 생물학적 과정에 대한 통찰을 제공하며, 미생물 분류학적 프로필을 대사 및 기능적 다양성(즉, 효소)과 행성 규모에서 연결합니다. 우리의 연구는 미생물 관련 대사산물이 호스트 연관성, 염분 농도 및 특정 미생물의 존재와 같은 요인에 따라 어떻게 구조화되는지 초기적인 관점을 제공합니다(그림 3 및 5). 중요하게도, 우리는 환경을 가장 잘 구분하는 대사산물을 대표하는 가장 풍부하고 높은 순위의 경로가 테르펜류임을 확인했습니다(그림 4a 및 보충 표 7), 이는 이 대사물질 그룹이 지구의 환경을 구분하는 데 중요함을 강조합니다.
우리는 인간 장내, 토양, 해양 등 전 세계적으로 분포된 샘플 세트를 사용하여 미생물 분류군과 기능을 설명한 이전 연구들이 위치에 따라 둘 다 변동될 수 있음을 보여주었다는 점을 인정합니다59,60,61,62. 마찬가지로, 미생물 군집 구성의 변화나 열과 같은 환경 스트레스에 따른 대사체 프로파일 변화를 조사한 연구들은 변동이20,21 또는 둘 다23와 연관되어 있음을 보여주었습니다. 또한, 메타게노믹스와 메타트랜스크립토믹스, 메타프로테오믹스 및/또는 대사체학을 결합한 이전 다중 오믹스 연구 중 일부는 데이터 층 간의 상관관계가 사이트 간에 변동된다는 결과를 보여주었지만, 대부분의 연구는 단일 환경에 초점을 맞췄습니다63,64,65,66,67,68,69,70,71,72,73. 본 연구에서는 다양한 환경 샘플 유형을 포함하는 데이터셋을 표준화된 방법으로 생성하여 다른 연구와 동일한 접근법을 사용한 데이터와의 견고한 메타분석이 가능한 다중 오믹스 통합을 수행했습니다.
우리의 접근법은 계산 기반 주석 도구 분야의 최근 진전이 표적화되지 않은 대사체학 데이터 해석을 위한 강력한 도구 상자를 제공함을 보여줍니다.41 메타게놈 시퀀싱, 게놈 조립 및 게놈 마이닝 분야의 병행된 진보는 미생물로부터 기능적 제품의 발견 및 분류를 개선하고 이러한 결과에 대한 추가적인 통찰을 제공할 것으로 기대됩니다. GitHub에서 표준화된 방법을 따르고 이 데이터셋을 Qiita 및 GNPS에서 공개적으로 제공함으로써, 이 연구는 지속적인 협업 연구를 위한 중요한 자원으로 기능할 것입니다. 마찬가지로, 대사체학에 최적화된 장비 및 계산 방법의 개발은 미생물군집 연구에서 조사되는 대사체의 깊이를 확장할 것입니다.
Methods
Dataset descriptionSample collection
Our research complies with all relevant ethical regulations following policies at the University of California, San Diego (UCSD). Animal samples that were sequenced were not collected at UCSD and are not for vertebrate animals research at UCSD following the UCSD Institutional Animal Care and Use Committee (IACUC). Samples were contributed by 34 principal investigators of the Earth Microbiome Project 500 (EMP500) Consortium and are samples from studies at their respective institutions (Supplementary Table 1). Relevant permits and ethics information for each parent study are described in the ‘Permits for sample collection’ section below. Samples were contributed as distinct sets referred to here as studies, where each study represented a single environment (for example, terrestrial plant detritus). To achieve more even coverage across microbial environments, we devised an ontology of sample types (microbial environments), the EMP Ontology (EMPO) (http://earthmicrobiome.org/protocols-and-standards/empo/)1, and selected samples to fill out EMPO categories as broadly as possible. EMPO recognizes strong gradients structuring microbial communities globally, and thus classifies microbial environments (level 4) on the basis of host association (level 1), salinity (level 2), host kingdom (if host-associated) or phase (if free-living) (level 3) (Fig. 1a). As we anticipated previously1, we have updated the number of levels as well as states therein for EMPO (Fig. 1b) on the basis of an important additional salinity gradient observed among host-associated samples when considering the previously unreported shotgun metagenomic and metabolomic data generated here (Fig. 3c,d). We note that although we were able to acquire samples for all EMPO categories, some categories are represented by a single study.
Samples were collected following the Earth Microbiome Project sample submission guide50. Briefly, samples were collected fresh, split into 10 aliquots and then frozen, or alternatively collected and frozen, and subsequently split into 10 aliquots with minimal perturbation. Aliquot size was sufficient to yield 10–100 ng genomic DNA (approximately 107–108 cells). To leave samples amenable to chemical characterization (metabolomics), buffers or solutions for sample preservation (for example, RNAlater) were avoided. Ethanol (50–95%) was allowed as it is compatible with LC–MS/MS although it should also be avoided if possible.
Sampling guidance was tailored for four general sample types: bulk unaltered (for example, soil, sediment, faeces), bulk fractionated (for example, sponges, corals, turbid water), swabs (for example, biofilms) and filters. Bulk unaltered samples were split fresh (or frozen), sampled into 10 pre-labelled 2 ml screw-cap bead beater tubes (Sarstedt, 72.694.005 or similar), ideally with at least 200 mg biomass, and flash frozen in liquid nitrogen (if possible). Bulk fractionated samples were fractionated as appropriate for the sample type, split into 10 pre-labelled 2 ml screw-cap bead beater tubes, ideally with at least 200 mg biomass, and flash frozen in liquid nitrogen (if possible). Swabs were collected as 10 replicate swabs using 5 BD SWUBE dual cotton swabs with wooden stick and screw cap (281130). Filters were collected as 10 replicate filters (47 mm diameter, 0.2 um pore size, polyethersulfone (preferred) or hydrophilic PTFE filters), placed in pre-labelled 2 ml screw-cap bead beater tubes, and flash frozen in liquid nitrogen (if possible). All sample types were stored at –80 °C if possible, otherwise –20 °C.
To track the provenance of sample aliquots, we employed a QR coding scheme. Labels were affixed to aliquot tubes before shipping when possible. QR codes had the format ‘name.99.s003.a05’, where ‘name’ is the PI name, ‘99’ is the study ID, ‘s003’ is the sample number and ‘a05’ is the aliquot number. QR codes (version 2, 25 pixels × 25 pixels) were printed on 1.125’ × 0.75’ rectangular and 0.437’ circular cap Cryogenic Direct Thermal labels (GA International, DFP-70) using a Zebra model GK420d printer and ZebraDesigner Pro 3 software for Windows. After receipt but before aliquots were stored in freezers, QR codes were scanned into a sample inventory spreadsheet using a QR scanner.
Sample metadata
Environmental metadata were collected for all samples on the basis of the EMP Metadata Guide, which combines guidance from the Genomics Standards Consortium MIxS (Minimum Information about any Sequence) standard74 and the Qiita Database (https://qiita.ucsd.edu)51. The metadata guide provides templates and instructions for each MIxS environmental package (that is, sample type). Relevant information describing each PI submission, or study, was organized into a separate study metadata file (Supplementary Table 1).
Metabolomics
|
|