Previous research using mitochondrial and microsatellite DNA markers found some evidence of population structure, as well as evidence that the European invasion originated from a single population in central Chile . However, determination of higher-resolution population structure, migration events, divergence times, and population size can benefit from using a larger number of markers, such as what is produced from genome-wide sequencing studies . Additionally, few genetic studies have been conducted to understand how Tuta absoluta has performed so successfully as an agricultural pest beyond targeted examinations of known insecticide resistance alleles. One reason for this has been the lack of a highly contiguous genome with annotated genes. A short-read based assembly has been previously published for the purpose of developing molecular diagnostics ; however, it is highly fragmented and duplicated. In this study, we addressed these issues by using long-read sequencing technology to produce a highly contiguous genome assembly for Tuta absoluta. We then use short-read technology to sequence genomes of individuals collected across Latin America, as well as a Spanish population, blueberry pot to identify single nucleotide polymorphisms in an unbiased manner. We use these SNPs to detect population structure and estimate population history parameters to understand how and when Tuta absoluta spread across Latin America. Finally, we use genome scanning statistics to identify genes putatively under selection that may explain Tuta absoluta’s success as an agricultural pest. We expect the genome assembly and population data will be an asset toward developing new strategies to manage this pest.
For genome assembly, a single Tuta absoluta larva was collected from a colony originally sourced from the Institute of Agrifood Research and Technology , Cabrils, Spain and held in the Contained Research Facility in UC Davis and frozen on dry ice. The larva was pulverized in liquid nitrogen with a pestle in a 2 mL microcentrifuge tube using 740 mL of lysis buffer , 135 ug/mL Prot K. After a 37℃ overnight incubation step, 240 μL of 5M NaCl was added and gently mixed in by rocking before centrifuging at 10,000 RCF, 4℃, for 15 minutes. Supernatant was transferred using a wide-bore pipetteto a 2 mL DNA low-bind tube , precipitated with 1 mL of 100% ethanol, and centrifuged at 10,000 RCF, 4℃, for 5 minutes. The DNA pellet was washed with 500 μL of ice-cold 70% ethanol twice before air-drying for 5 minutes. Dry pellet was resuspended in DEPC-treated water and allowed to dissolve for 1 hour at room temperature before being stored at 4℃ for no more than 2 weeks. Absorbance ratios were measured with a Nanodrop Lite , DNA concentration was measured with a Qubit 4 Fluorometer using a dsDNA High-Sensitivity Assay , and DNA fragment size was measured with a Tapestation genomic DNA ScreenTape . Approximately 700 ng of DNA was sent to QB3-Berkeley for library preparation and PacBio HiFi sequencing with 1 SMRTcell.We analyzed whole-genome sequencing data from individuals previously collected from field and greenhouse sites across South America and Costa Rica, as well as a lab colony from Spain . Mapping rates to the new genome assembly ranged between 70%-90%, although sequencing depth per individual was low . One population from Argentina had extremely low mapping rates and read depth, so we excluded it from further analysis. Wherever possible, we used methods based on genotype likelihoods, rather than genotype calls, to account for uncertainty that results from the low read depth.
To investigate population structure in our samples, we used Principal Component Analysis and admixture estimation based on allele frequencies from over 900,000 SNPs. The first two PCs captured 18.7% and 16.3% of the total variance in the data, with the remaining PCs each capturing less than 5% of the total data variance . Samples primarily cluster together based on collection site but also formed three distinct regional groups . Samples from Chile, Peru, and Ecuador form an “Andes” cluster west of the Andes Mountains; samples from Brazil, Uruguay, Paraguay, and Argentina form a “Central” cluster, east of the Andes Mountains; and samples from Columbia and Costa Rica form a “North” cluster. Spanish samples grouped tightly with the Andes cluster, particularly the VA site. When three clusters were allowed in admixture estimation, samples group into the same three clusters as in PCA, while at four clusters, the Spanish samples become their own group, with VA samples sharing a large proportion of admixture. Compared to other Andes populations, the VA samples are more differentiated from Central and North sites as well, with little signal of admixture at all levels of k tested. The other Andes populations all had low admixture proportions from Central at k=2, although at k=3 we see that all RI samples exhibited admixture from the North populations. This suggests that the non-VA Andes populations are more closely related to Central populations than VA, and that VA could represent an admixture between the population that gave rise to the Spanish lineage and the other Andes populations. Additionally, we see that RI represents an intermediate population between the Andes and North, which makes sense given its geographic location between the two clusters. To further quantify population structure between these clusters, we calculated nucleotide diversity, Tajima’s D, and Fst using genotype likelihoods . For all clusters, nucleotide diversity was approximately 2%, which is fairly high compared to most Lepidopterans .
If we look at the weighted Fst, we see differentiation between clusters is high, particularly between North and all other clusters. The combination of high diversity levels and high Fst could mean these regions diverged from each other a long time ago, prior to the detection of Tuta absoluta by growers across Latin America in the 1960s to 1980s. If divergence had occurred recently, we might expect reduced diversity levels in invasive populations relative to the ancestral population.To detect potential migration events between populations, we used Treemix to build a maximum likelihood tree based on allele frequencies, as well as predict migration edges and calculate F3 statistics. As Treemix was designed to take allele count data per population, we called genotypes using PCAngsd using a 95% accuracy cutoff and counted alleles within each sampling location. After filtering out loci with missing data, 47,535 SNPs were available for use. In general, the tree topography aligns with results from PCA and admixture analyses. We see sampled sites cluster into the same three clusters, North, Andes, and Central, with the Spanish samples sister to the VA site . In agreement with Fst estimates, North populations have experienced more genetic drift from the Andes and Central populations, compared to the Andes and Central populations with each other. Interestingly, the RI site does not form a clade with other Central populations but descends from the common ancestor of the Central/Andes group. Based on admixture analysis that showed low levels of admixture in RI from the North, the position of RI in the tree could be further evidence that Ecuador represents an intermediate mixing zone between populations north and south of it. To investigate further, we re-ran Treemix allowing between one to five migration events and calculated the F3 statistic between all combinations of three populations to see if admixture was supported. At m=2,4 and 5, Treemix reported a strong migration from the Spain or Spain/VA branch to RI, while at m=3 and 5, Treemix reported a weak migration event from the North to RI. F3 statistics F3 and F3 were significantly negative , indicating that a simple bifurcating tree does not explain RI’s relationship with CH, CR, and SP. While Treemix infers a migration from the Spanish branch to the RI branch, it is important to remember that this migration is inferred to have occurred somewhere along the branch between the current day Spanish population and the most recent common ancestor of Spain and VA . This migration could have occurred early in the branch, nursery pots when the population was still in Chile, or late in the branch, when the population moved to Spain. Based on fresh tomato trade between the two countries, in 2006 Chile shipped over 29,000 kg of fresh tomatoes to Spain while importing none back . This makes admixture from Spain back to RI unlikely and suggests that RI contains admixture from North and Chilean populations. In addition to admixture in RI, Treemix and F3 statistics also detected admixture in AR from the Spanish population. At m=1,2,3, and 4, Treemix detected a migration edge from the Spanish branch to AR with migration weight varying between 9% to 17%. F3 statistics of AR and Spain with any population from North or Central resulted in a significantly negative value, providing strong evidence of a migration event from a Spanish ancestor like the signal detected with RI.
This suggests that the admixture signal we see in AR is from the same Chilean population that gave rise to the Spanish invasion and RI admixture.Based on the high levels of nucleotide diversity and Fst between the Andes, Central, and North clusters, we hypothesized that the three regions may have diverged many generations ago, before the appearance and detection of Tuta absoluta in agricultural crops throughout South America in the mid- 20th century. This would suggest a model in which Tuta absoluta may have adapted from local, wild host plants to nearby tomato fields independently, rather than a single population that became adapted to tomatoes and was spread through human activity. To investigate this, we calculated the folded two dimension site frequency spectrum between populations and estimated parameter values under various population models using maximum likelihood coalescent methods . We excluded the VA samples from the Andes cluster to avoid potential modeling issues due to VA appearing to originate from a distinct ancestor than other Andes populations. The simplest model allows for two population splits with constant population sizes, while the exponential growth model adds an exponential growth rate to each population. As exponential growth may not be appropriate if divergence times are long, we also tested a model with a simple resizing event for each population at some point in time. We used a post-hoc comparison of simulated linkage disequilibrium decay rates between models to test model fit. We found that while all three models simulated decay rates within the 95% confidence interval of the Andes population data, none simulated decay rates that overlapped with Central and North decay rate estimates, although the resizing population model was closest. Under the resizing population model, divergence of the North occurred 252,383 generations ago , followed by a Central population divergence 187,034 generations ago . Reports of Tuta absoluta generation times can be as high as 6 to 12 or more generations per year , dating these divergence events to tens of thousands of years ago. This suggests that Tutaabsoluta was already present across Latin America prior to the 1960s, and as tomato agriculture surged, adapted locally to the new host plant.The Population Branch Statistic is an Fst-based statistic that uses Fst data between 3 populations to calculate the population-specific allele frequency changes. Regions of the genome with abnormally high PBS may be under strong selective forces, causing the loci allele frequencies to change faster than expected by drift. We calculated PBS across the genome for all three populations and found several peaks in contigs 2, 9, 15, and 22 that were exceptionally high and broad, particularly in the North cluster . The peak in contig 9 contained the gene paralytic , a neuronal sodium channel protein that is the active target of pyrethroid insecticides . While PBS peaks in the North population between 13.1 and 13.2Mb on contig009, we note that the allelic diversity was low in the Andes and Central clusters relative to the North . We calculated allele frequencies of known resistance-inducing mutations in each cluster , and found one mutation, an alanine to leucine substitution at position 1014, was fixed in the Central and Andes, while at 41% frequency in the North . In addition, we found low to intermediate frequencies of other resistance alleles, including M918T, T929I, V1016G, L925M, and I254T. A similar selective sweep signal was also seen in the PBS hotspot on contig 2 , with high PBS and diversity levels in the North, and a large region of low allelic diversity in the Andes and Central .