Barcoding coffee grounds—Exploring pteropod gastropod biodiversity with dregs in collection jars

Summary: Despite their cosmopolitan occurrence and massive plankton sampling during expeditions, the genetic diversity within Pteropoda Cuvier, 1804 is still largely unexplored. In this study we present a next-generation environmental bar-coding approach to zooplankton bulk samples, which were collected during the circumglobal 2010 Malaspina expedition to evaluate pteropod diversity. We introduce a technique that avoids destructive procedures and leaves material intact for further morphological investigations. We extracted DNA out of the dregs (organic material such as mucus or body parts) of 27 sample containers for molecular barcoding (average 100-260 bp of COI). We were able to identify 7128 operational taxonomic units corresponding to the species composition contained in the examined samples. Among them were three species of thecosome pteropods, Creseis acicula , Creseis virgula and Cavolinia inflexa , which are discussed with respect to their taxonomy and their geographic distribution. Unidentified gymnosomes were also present in our samples from warmer regions in oceanic waters of the southern Indian Ocean. To facilitate identification of species, it is beneficial to create a better database of pteropod COI barcodes. Furthermore, gathering environmental barcoding data on a broad global scale will help to better understand species abundance and distribution of pteropods in the world’s oceans, and potentially those of other planktonic organisms.


INTRODUCTION
There are more than 230000 known metazoan species populating the world's oceans. However, as a result of climate change, ocean acidification and marine pollution, the increasing loss of biodiversity presents a daunting challenge to taxonomists, requiring the discovery and analysis of biodiversity at a greatly accelerated pace. In the face of growing extinction rates that are without much doubt outpacing the number of discoveries of new taxa, fast and accurate biodiversity analysis methods are urgently needed (Bucklin et al. 2011). Especially problematic is the taxonomic treatment of large-scale environmental bulk samples such as phyto-and zooplankton. Sorting and identifying the various organisms requires a lot of time before reliable diversity assessments are possible. Furthermore, traditional morphological approaches are limited and less efficient for analysing bulk samples or samples lacking distinguishing phenotypic features (for example immature or damaged specimens). It is well established that genetic markers, and especially COI barcoding, are a complementary tool to traditional morphology-based taxonomic research for the identification and delimitation of different lineages (Hajibabaei et al. 2007).
Applying barcoding methods to (environmental) organismic DNA material improves traditional biomonitoring activity: excluding uncertainties such as morphological identification, low detection probabilities and sampling methods (challenges of gear deployment) increases confidence in the monitoring results (e.g. Bucklin et al. 2021, Wang et al. 2021, Di Capua et al. 2022. DNA barcoding could thus accelerate the inventory analysis of biological diversity, especially of bulk samples such as those of sediments or plankton, and of older museum samples. A good example of such bulk sample collection is the worldwide multidisciplinary Malaspina expedition, in which over 70000 samples of water, air and plankton were gathered in different ocean regions from the surface down to 5000 m depth. This immense collection was sorted and divided into several sub-collections that were accessible for scientific research. Our focus is on the holopelagic group Pteropoda (thecosomes and gymnosomes), a group of gastropods with an important ecological role in the marine environment as microplankton grazers and as prey for fish and other zooplankton. Despite their cosmopolitan distribution, the genetic diversity within this group is still largely uncertain (Burridge et al. 2017a,b).
A comprehensive insight into the present diversity and distribution of these planktonic molluscs is an important prerequisite for stating possible future changes in species composition and also in species-specific responses to changing conditions in the marine environment. So far, studies focusing on pteropod species distribution patterns were geographically limited to certain marine regions (e.g. Jennings et al. 2010, Burridge et al. 2017b, and there are still areas where pteropod diversity remains unknown (Burridge et al. 2017b). Here, with the help of the 2010 Malaspina zooplankton samples, we generated new data using an environmental barcoding approach on the debris taken from 27 selected bulk samples from different oceans in order to gain a better understanding of species abundance and distribution of pteropods and to explore the potential of the method for broad-scale application.

Sampling
Plankton samples were gathered globally at 154 locations. The hauls were conducted in the morning and in the evening. Out of 154 we selected 27 locations of interest in the Caribbean Sea, the Atlantic Ocean and the Indian Ocean (Table 1). The locations were chosen to coincide with those where frequent occurrence of certain pteropod species had been previously described (e.g. Burridge et al. 2017b), (Fig. 1).

Sample preparation
Out of 27 bulk samples the preservation medium and the bottom content in the collection jars were extracted and filtered for organic material. Species identification of organic material was performed using DNA metabarcoding following the protocol published in Hausmann et al. (2020). Each single sample was dried in a 60°C oven for at least eight hours and subsequently homogenized in a FastPrep96 machine (MP Biomedicals) using sterile steel beads in order to generate a homogeneous mixture of faeces material before it was submitted for metabarcoding (conducted by AIM GmbH). Prior to DNA extraction, 1 mg of each homogenizate was weighed into sample vials and processed using adapted volumes of lysis buffer with the DNeasy 96 Blood and Tissue Kit (Qiagen) following the manufacturer's instructions. For amplification of the CO1-5P target region and preparation of the MiSeq libraries, a two-step PCR was performed. First, a 313-bp-long mini-barcode region was amplified by PCR (Leray et al. 2013, Morinière et al. 2016) using forward and reverse high-throughput sequencing (HTS) primers equipped with complementary sites for the Illumina sequencing tails. In a subsequent PCR reaction, index primers with unique i5 and i7 inline tags and sequencing tails were used for amplification of indexed amplicons. Equimolar amplicon pools were then created and size-selected using preparative gel electrophoresis. Cleanup and concentration of amplicons were performed using the GeneJet Extraction Kit (Life Technologies). A bioanalyser (High Sensitivity DNA Kit, Agilent Technologies) was used for a final check of the bp distribution and concentration of the amplicons before the creation of the final library. All samples were pooled into one library, equimolar adjusted to 100 ng µL -1 . Samples had DNA concentrations between 10.4 and 12.4 ng µL -1 . HTS was performed on an Illumina MiSeq using v2 chemistry (2*250 bp, 500 cycles, maximum of 20mio reads) (Illumina). All samples were analysed on a single MiSeq run. The bioinformatics processing of raw FASTQ files from Illumina was carried out using the VSEARCH suite v2.9.1 (Rognes et al. 2016) and Cutadapt v1.18 (Martin 2011). Forward and reverse reads in each sample were merged using the VSEARCH program "fastq_mergepairs" with a minimum overlap of 10 bp, yielding approximately 313 bp sequences. Forward and reverse primers which were not reliably detected at >90% identity, were removed with Cutadapt using the "discard_untrimmed" option. Quality filtering was done with the "fastq_filter" in VSEARCH, and sequences with zero expected errors were kept ("fastq_maxee" 1).
Sequences were dereplicated with "derep_fulllength," first at the sample level and then concatenated into one FASTA file, which was subsequently dereplicated. Chimeric sequences were filtered out from the FASTA file using the "uchime_denovo" VSEARCH program. The remaining sequences were then clustered into operational taxonomic units (OTUs) at 97% identity with "cluster_size", a greedy centroid-based clustering program. OTUs were blasted against a custom Animalia database downloaded from BOLD on 28 November 2018, including taxonomy and barcode index number (BIN) information, by means of Geneious (v.10.2.5, Biomatters, Auckland, New Zealand) and following methods described in Morinière et al. (2016). The resulting csv file, which included the OTU ID, BOLD Process ID, BIN, Hit-%-ID value (percentage of overlap similarity (identical basepairs) of an OTU query sequence with its closest counterpart in the database), length of the top BLAST hit sequence, phylum, class, order, family, genus, and species information for each detected out, was exported from Geneious and combined with the OTU table generated by the bioinformatic pipeline. The combined results table was then filtered by Hit-%-ID value and total read numbers per OTU. All entries with identifications below 97% and total read numbers below 0.01% of the summed reads per sample were removed from the analysis. OTUs were then assigned to the respective BIN. Additionally, the API provided by BOLD was used to retrieve BIN species and BIN countries for every OTU, and the Hit-%-IDs were aggregated over OTUs that found a hit in the same BIN and shown in the corresponding column as % range. To validate the BOLD BLAST results, a separate BLAST search was carried out in Geneious (using the same parameters) against a local copy of the NCBI nucleotide database downloaded from ftp://ftp.ncbi.nlm.nih.gov/ blast/db/. Interactive Krona charts were produced from the taxonomic information using KronaTools v1.3 (Ondov et al. 2011). Species identification was based on the HTS of OTUs, before blasting and assignment to BINs (Ratnasingham and Hebert 2013) which are considered to be a good proxy for species numbers (Hausmann et al. 2013, Ratnasingham andHebert 2013).

Analyses
Generated sequence data were further analysed with a focus on three objectives: 1) assessment of species diversity using ABGD molecular species delineation (Puillandre et al. 2012); 2) discovery of new spe-cies or potentially cryptic species assemblages; and 3) testing of distribution ranges of the respective lineages, especially in regions where pteropod diversity remains poorly understood so far.
In the final analysis, publicly available sequence data from GenBank and BOLD were included for phylogenetic testing. To code the COI gene, the sequences were first aligned in protein and then converted into nucleotide using ClustalW implemented in the software package MEGA Version 3.0. This method allowed us to maximize the homology between nucleotide positions when amino acid deletion/insertion occurred.

Experimental approach
Standard DNA barcoding approaches for pteropod gastropods have already been successfully applied (e.g. Hunt et al. 2010, Jennings et al. 2010) and provide a suitable tool for assessing large-scale biodiversity (Makiola et al. 2020, Chimeno et al. 2022. With our approach we established a feasible protocol that overcomes current obstacles of having to sort the visible specimens or tissues. The sediment on the bottom of the provided plankton samples contained enough organic material (torn body parts, mucus and other secretions) to extract DNA. By using the dreg of the bottom, this method accelerates taxonomic procedures as the specimens in the sample jar remained unharmed and therefore stayed in suitable condition for further morphologic studies that might be of interest. This protocol might also be suitable in processing older collection material and therefore serves in the growing field of museomics, here enhancing comparative studies with modern and historical DNA material.

Quality of data
This non-destructive method is based on the use of scarce material. The quantity of DNA material is already limited by the relatively small size of our targeted specimens as well as their restricted geographical and circadian presence in the field. DNA degradation, collection age, preparation treatment and storage conditions have a big impact on the quality and quantity of the already limited DNA material (Janik et al. 2020). Nevertheless, we were able to detect targeted molluscan and pteropod DNA. Many benthic marine gastropod species have pelagic larvae, so they may be encountered in traces in any plankton samples (e.g. Pulmonata, Caenogastropoda s.o and Stylommatophora). Surprisingly, we found quite a large amount of DNA of terrestrial specimens in our samples, even though secure laboratory guidelines applied in order to avoid cross contamination during extraction and amplification were reasonable. Cross contamination with land snail material in the field due to net storage ashore is possible. Contamination is ruled out regarding the amount and diversity of terrestrial gastropods in our data set. Scarce availability of DNA material in combination with too many PCR amplification cycles can lead to formation of chimeric products (Fonseca et al. 2012). Using a smaller number of PCR cycles served as a precaution of formation of such chimeric DNA fragments here, so we discard this as a main explanation. Terrestrial DNA matches with species identity percentages of 0.787% to 0.837%. This might be due to a lack of comparative data as it is well known that the likelihood of finding matches in public databases for invertebrates is lower than for vertebrates (Harris et al. 2016). This could lead to mis-assignment of a barcode to the wrong species with high confidence. Missing target taxa on reference databases leads to the risk of making both false positive and false negative taxonomic assignments. False positives occur due to mis-assignment of a barcode to the wrong species with high confidence because the target species is missing in the database, and false negatives occur because of gaps in the database resulting in low confidence assignments (Porter and Hajibabaei 2018). Matches to terrestrial data were not as high as those of our data assigned to marine snail DNA (0.93-1). As organisms in our samples varied in size, shape and anatomy (e.g. crustaceans vs. molluscans), we expected a disproportion in organismic material and therefore in the presence of usable DNA material. Length of the sequences is ruled out as they had a satisfying length of >221 bp. For further analyses the application of more sensitive and stricter quality filters seems a beneficial recommendation to avoid a trade-off between quality and quantity, low-input DNA material and bioinformatic obstacles (chimera), and most importantly, to eradicate statistical imbalances of species material in the dreg and preservation medium to avoid such statistic outbreaks in the future.
We hope to contribute to and update the state of knowledge by proceeding with future investigations on interoceanic differences among the available sampling material on Cavolinia inflexa, Creseis acicula and Creseis virgula, as so far molecular backup is missing (e.g. Gasca and Janssen 2014). We plan to use more markers and longer barcodes in our future studies and to investigate the respective collection jars and pursue morphological studies if necessary.

Gymnosomata
Shelled pteropod taxa have been in focus because of their usefulness in studying global climate change (i.e. ocean acidification), but unshelled pteropods have not yet been examined to the same extent. Some work has been done on an ecological and anatomic level, but little on genetics despite an increase in the recent years (e.g. Stromek et al. 2015, Yamazaki et al. 2017, Kohnert et al. 2020. Gymnosomes are less abundant than thecosomes, but they are ecologically very important because of their feeding manners, primarily predating on thecosomes (Lalli and Gilmer 1989). Interestingly, some of our sequenced material from samples from a free water day catch in the Indian Ocean (roughly 28° S, 66° E) was assigned to Gymnosomata sp. According to Burridge et al. (2017b), the presence of gymnosomes (and thecosomes) in their sampling material came from (sub)tropical free waters within a longitude gradient of ∼28°N and ∼28°S (Atlantic Ocean), and were most abundant in sub-Antarctic waters and rather less in warmer waters (e.g. Weldrick et al. 2019). Burridge et al. (2017b) did not assign any species level. In our study, comparative data from online databanks were lacking at lower levels beyond order. To our knowledge, emergence of gymnosomes in our sampling locality (oceanic free water) is rather rare. Putatively, a connection between migration via the Agulhas current and influences by cold waters from the circumpolar current led to the appearance in our material.

Thecosomata
Thecosome material indicates presence of the "usual suspects" in the sample jars: the cosmopolite species Creseis acicula, Creseis virgula and Cavolinia inflexa (Lesueur, 1813). Present in almost all the world's oceans, these species are mostly found in warm water territories. C. inflexa was the most dominant we had in our samples, which we anticipated, as it is in general the most common representative of Cavolinia in the Atlantic Ocean. Our findings (appearance/detection of C. inflexa in spacious, scattered samples) agree with the accepted knowledge that C. inflexa is common and distributed widely and provide further input in the ongoing discussions about its taxonomic status, as it is a putative species complex (Rampal 2002, Janssen et al. 2019). The same is true for Creseis acicula and Creseis virgula. Morphological distinctions of Creseis species have always been subject to many controversies, as can be seen in Frontier (1965), Rampal (1985Rampal ( , 2002, Janssen (2007), Gasca and Janssen (2014). Genetic studies (e.g. Klussmann-Kolb and Dinapoli (2006) and Corse et al. (2013)) did help to build a more solid classification, but it is still open for taxonomic debates.

E-DNA barcoding approach for Pteropoda
Using tools like BOLD and GenBank allows quick and easy barcode and phylogenetic analysis. But there are persisting issues. Identification errors found in already published sequence data are rarely re-evaluated. Using standard procedures there is a 95% probability of finding incorrectly described metazoan sequences in GenBank, ranging from 1% (Mollusca and Arthropoda) to 6.9% (Gastrotricha). Consequently, the increasing popularity of DNA barcoding and metabarcoding analysis may lead to overestimation of species diversity (e.g. Mioduchowska et al. 2018). Lack of species-specific comparative data and misidentifications/ incorrect sequence data in previously and newly published data are due to amplification of non-target taxa and insufficient analysis of the obtained sequences (Mioduchowska et al. 2018). The difficult taxonomic history of Creseis is still mirrored in our bioinformatic analyses (current study: Bold and Genbank). Unfortunately, outdated synonyms are still used for certain taxa, in our case C. acicula, which was wrongly assigned to C. clava. Furthermore, BIN sharing in two samples occurred. This is correctable on a small scale, but for a bigger approach it might lead to problematic consequences and therefore shows the need for updates.

Barcoding as a tool for monitoring
We advocate using collection material from previous expeditions to monitor environmentally influenced changes in pteropod abundance/behaviour, using it as a benchmark for the occurrence of common traits or derivations. In addition to measuring biodiversity by monitoring species availability, putative changes in behaviour might also help to understand the effects of climate change on these marine organisms. Diel vertical migration is a known phenomenon for most thecosome species, and in our samples we found DNA material in night and day hauls in anticipated quantities. Swimming and sinking behaviour by these pelagic snails is important in their ecology, predator-prey interaction, and vertical distribution (Karakas et al. 2020). Despite the costs, benefits like niche partitioning, metabolic advantage due to colder temperatures at depth, avoidance of light, high temperatures and predators seem to advocate this behaviour (Hays 2003, Antezana 2009). Factors such as ocean acidification might harmfully affect the migrating ability by altering shell condition (and thickness), leading to misbalancing in factors involved in locomotion and buoyancy processes, as already shown for other thecosomes (Limacina retroversa, here: Manno et al. 2012, Adhikari et al. 2016). This will have fatal consequences for the individuals and will potentially result in ecological cascades in the long run. This process will be mirrored by the catch success in future plankton hauls. Thus, if data are accessible for longer time periods and large geographic areas, comparisons and statements about harmful effects and future developments can be made for certain ecological key species, as in our case thecosome pteropods.

CONCLUSION
In this pioneering study we applied a non-invasive, next-generation environmental barcoding approach to several selected (meso)zooplankton bulk samples collected during the 2010 Malaspina global circumnavigation. On a small scale we were able to support existing knowledge of the distribution of Cavolinia inflexa, C. virgula and C. acicula, and we made surprising findings about the putative broader distribution of gymnosomes hypothesized today. Environmental DNA approaches may streamline the search for new pteropod species in unsorted museum jars and streamline historic and future monitoring efforts. Limitations to our approach are related to the lack of comparative barcoding data from pteropods, especially of gymnosome data, and outdated use of synonyms and potential misidentifications in online sequence bases, which call for re-evaluation and up-dating of existing published data. With more and taxonomically broader barcoding sequences available in public databases, our environmental barcoding approach will improve our understanding of global species diversity and distribution patterns of Pteropoda and other planktonic organisms.