There are many industrial sites on the northern Adriatic Sea resulting in decades of hydrocarbon pollution. In order to establish a baseline for future studies, the concentrations of hydrocarbons in surface sediments and the bacterial diversity at seven selected sites were studied (1). This paper expands the previous study relying on bioprospecting, process of systemic searching for genetic and biochemical potential of bacterial communities, for hydrocarbon degradation activity. Potential was assessed by searching metagenomic sequences using 22 hidden Markov model (HMM) profiles of most characterised hydrocarbon-degrading enzymes. Many studies have characterised the bacterial aerobic degradation of linear n-alkanes, which are the major component of petroleum products. As n-alkanes are rather inert chemically, the first step in the degradation pathway is activation, usually involving oxidation to an alcohol by an enzyme using molecular oxygen as a substrate (2). Short-chain n-alkanes (C2–C4) are oxidised by enzymes related to methane monooxygenases: soluble methane monooxygenases (sMMO) and particulate methane monooxygenases (PMO). Medium-chain (C5–C17) n-alkanes can be activated by soluble cytochrome P450s (e.g. CYP153) or integral membrane non-heme iron monooxygenases, e.g. AlkB. Long-chain (>C18) n-alkanes are hydroxylated by unrelated enzymes including AlmA and LadA (3). Anaerobic biodegradation of hydrocarbons also occurs, with the most widely reported mechanism for activation of the substrate being enzymatic addition of the hydrocarbon across the double bond of fumarate. Recently, three phylogenetically related enzymes catalyzing this addition have been identiﬁed: alkylsuccinate synthase (AssA) activates n-alkanes, benzylsuccinate synthase (BssA) activates toluene and xylene and 2-methylnaphthylsuccinate synthase (Nms) activates 2-methylnaphthalene (4-6). The University of Minnesota Biocatalysis/Biodegradation Database (7) provides comprehensive information about known hydrocarbon-degrading enzymes.
Hydrocarbon-degrading enzymes are interesting for the development of bioremediation strategies for polluted sites. They are also interesting as potential industrial enzymes and the identification of novel enzymes will increase the armoury of the biotechnologists for the development of new processes. Most bacterial species in the environment cannot be grown as pure cultures in the laboratory and a metagenomic approach (8) can be used to discover enzymes not present in culturable species. Such bioprospecting of metagenome data presents considerable problems for bioinformatics. Low sequence coverage makes assembly of genes difficult. The probable functions of genes are deduced on the basis of similarity to known genes. This can be achieved using BLAST similarity searches (9) of in silico-translated DNA sequences. However, BLAST may not be effective for identifying dissimilar to known sequences and, thus, may miss some of the most interesting enzymes. The use of HMM profiles (10) has the advantage that amino acid residues are weighted according to their degree of conservation in protein families and is better for identifying dissimilar sequences in a protein family.
In this paper, we describe metagenome sequences from surface sediments in the northern Adriatic Sea. The sequences were incorporated in a newly developed custom database called REDPET (REDucing environmental impact from local PETrochemical industry by novel bioremediation strategies), which was designed for bioprospecting novel hydrocarbon-degrading enzymes. The use of this database is exemplified by the choice of five putative AlkB sequences.
MATERIALS AND METHODS
Three metagenomes were characterized by shotgun sequencing. The metagenome MET1 was derived from a heavily polluted sediment sampled from the Uljanik shipyard (marked as BN in Fig. 1, BN 44.866665°N 13.840400°E) as previously described (1). The second metagenome, MET2, was derived from an unpolluted coastal surface sediment sample from Cuvi beach at Rovinj (marked as CU in Fig. 1, 45.062290°N 13.652326°E). The third metagenome, MET3, was derived by pooling two moderately polluted sediment samples taken from a tanker berth station (marked as TV in Fig. 1, 45.276365°N 14.549654°E) (1). These two samples had been enriched for potential hydrocarbon-degrading bacteria by incubating under aerobic or anaerobic conditions in the presence of crude oil (1). The sampling procedure for CU sample and the assay for hydrocarbon content were as previously described for the samples BN and TV (1).
Total DNA was isolated from 10 g of each sample with the PowerMax Soil DNA Isolation Kit (MO BIO, Carlsbad, CA, USA) according to the manufacturer’s instructions. The column was eluted in 5 mL of 10 mM Tris. Sodium acetate (1:10 by volume; Kemika, Zagreb, Croatia) was added and DNA was precipitated for 30 min at -20 °C in one volume of isopropanol (Alkaloid, Skopje, Macedonia). After centrifugation (Eppendorf 5430 R, Leipzig, Germany) for 21 min at 4 °C and 20 000×g the DNA pellet was washed with 2 mL of 70% ethanol solution (Gram-mol, Zagreb, Croatia). A second centrifugation step lasted for 5 min (4 °C, 20 000×g). The pellet was dissolved in 120 µL of 10 mM Tris. Samples were sent to Eurofins MWG GmbH, Ebersberg, Germany for sequencing with the Roche 454 GS FLX+ chemistry. The sequences have been deposited in the European Nucleotide Archive (study accession number: PRJEB13497 and respective sample accession numbers: SAMEA3928486, SAMEA3928487 and SAMEA3928488).
The REDPET database was constructed using the MEGGASENSE platform (11). HMM profiles for hydrocarbon-degrading enzymes were generated with HMMER v. 3.0 (10) using protein sequences downloaded from the KEGG database (12) as a primary source. If less than 10 sequences were present for specific KEGG orthologue (KO), all protein sequences of respective KO were used to search UniRef50 database (13). Identified unique UniRef50 clusters were used to build HMMs. Metagenomic read was assigned to specific HMM profile if E-value was lower than 10-5 and the length of alignment was greater than a third of the HMM profile length. Taxonomic profiling of metagenomes was done using Kaiju (14) on entire metagenomic dataset. Default settings were used and RefSeq Genomes (15) was used as a reference database.
RESULTS AND DISCUSSION
A metagenomic library MET1 was constructed from a heavily polluted sample from the Uljanik shipyard (BN, Fig. 1). In order to understand the effects of pollution, it was also necessary to have a sample from an uncontaminated area in the northern Adriatic Sea. A surface sediment sample was collected from Cuvi beach in Rovinj (CU, Fig. 1). This had low levels on dry mass basis of aliphatic hydrocarbons (resolved n-alkanes 4.13 μg/g, unresolved complex mixture 22.48 μg/g) and polycyclic aromatic hydrocarbons (PAHs; 0.08 μg/g) compared to the two polluted sites, BN and TV (resolved n-alkanes 38.07 and 6.59 μg/g, unresolved complex mixture 518.19 and 10.75 μg/g, PAHs 73.53 and 0.40 μg/g) previously reported (1). This sample was used to construct the metagenomic library MET2. The third library, MET3, was derived from a moderately polluted sample from a tanker berth station (TV, Fig. 1), which was grown under the crude oil selection pressure. Samples from aerobic and anaerobic selection were pooled to construct the library.
The three libraries were sequenced using pyrosequencing and each yielded similar amounts of sequence data (Table 1) with similar read lengths. However, while over 60% of the reads in MET3 could be assembled into contigs, less than 3% of the reads in the other two libraries could be assembled.
All three metagenomic libraries were used for taxonomic classification using Kaiju (14). Each read was translated in all six possible reading frames and searched against RefSeq Genome database (15) to identify originating taxon. For MET1 and MET2, 48% of reads were unclassified while for MET3 only 24% of reads were unclassified. The most abundant in all three metagenomes is phylum Proteobacteria with 56, 52 and 75% of all classified reads. Second most abundant is the phylum Bacteroidetes with 23, 23 and 11% of all classified reads, followed by Firmicutes (4, 4 and 5%) and Actinobacteria (4, 5 and 1%). Most abundant species in MET1 is Woeseia oceani with 3% of all classified reads, followed by Halioglobus pacificus (2%) and Halioglobus japonicus (2%). Most abundant species in MET2 is also Woeseia oceani with 3% of all classified reads, followed by Maribacter sp. HTCC2170 (2%). MET3 is dominated by Immundisolibacter cernigliae, corresponding to 16% of all classified reads, followed by several Marinobacter species (15%). All three metagenomes have high diversity with identified 3280, 3349 and 3279 bacterial species reflecting on high Shannon index (6.8, 6.7 and 5.5). The reason for such high diversity might be due to the method used for taxonomic classification that tries to assign taxa to every metagenomic read and retention of taxa present in low numbers, as singletons were kept for analysis. When taking into account only bacterial species present with more than 0.1% abundance, number of identified species falls to 193, 182 and 111. Enrichment with crude oil reflected on the bacterial composition on species level with dominant bacterial species in MET3 being Immundisolibacter cernigliae, making up 16% of all classified reads. With several Marinobacter species, they represent 31% of all classified reads in MET3 and both genera are involved in hydrocarbon degradation (4, 6, 16, 17). Dominance of several bacteria in MET3 metagenome is also reflected on greater number of reads assembled into contigs.
Development of the REDPET database
The results of the analyses of the metagenomes were used to construct the REDPET relational database (18), which was based on the MEGGASENSE platform (11) and incorporates the generic functionality of this platform. This includes assembly of reads, taxonomic analysis and an intelligent search engine. General functional analysis of the metagenomes uses a set of profiles based on KEGG orthologues (KO), which are ordered according to the BRITE hierarchical functional scheme (12). The taxonomic analysis was carried out using the Kaiju program (14) and the database incorporates the Krona viewer (19) for display of the phylogenetic composition of the metagenome. An example is shown for the MET2 library (Fig. 2) at a phylum level. As Krona is an interactive viewer, it is possible to view the taxonomy results at different hierarchical levels down to the individual species level.
As the main aim of the project was bioprospecting for potential hydrocarbon-degrading enzymes, it was necessary to add some custom features to the REDPET database. A collection of HMM profiles was developed to recognise these enzymes (Table 2). Some of the enzyme activities were already included in the KEGG orthology database, so that the MEGGASENSE profiles corresponding to the KEGG KOs could be used (e.g. AlkB, PmoA). If there were only a small number of sequences present in the KO (less than 10 protein sequences), UniRef50 clusters of all members belonging to KO were obtained and distinct UniRef50 clusters were used to construct the profile (e.g. MmoB, MmoX). If the enzyme did not have corresponding KO entry (e.g. AlmA), its sequence was identified in the UniProt database (13) based on a literature search and the UniRef50 cluster of the identified protein was then used to construct the HMM profile. The collection of 22 HMM profiles (Table 2) was used to mine the assembled contigs of the metagenomic libraries. The HMM profiles are available from the REDPET database (18).
The number of reads corresponding to putative hydrocarbon-degrading genes found in each of the metagenomic libraries is shown in Table 3. In addition to scanning the reads, the assembled contigs were also analysed to find complete genes. As expected for the small contig sizes (Table 1), no complete genes were found in MET1 and MET2, but 67 complete genes were found in MET3 (Table 3).
As the three metagenomic libraries have similar numbers of reads and total sequence lengths (Table 1), it is possible to compare the numbers of hits directly. It can be seen (Table 3) that the three libraries have comparable total numbers of hits. There is little difference in the sample from the highly polluted shipyard (MET1) and the sample from the unpolluted site (MET2). However, the samples selected on crude oil (MET3) have an increased proportion of genes for aerobic n-alkane degradation. Genes for short- (MmoC), medium- (AlkB) and long-chain (AlmA, LadA) n-alkane degradation are all present in an increased proportion. However, there was a lower proportion of the anaerobic n-alkane-degrading enzyme AssA and the anaerobic cyclic hydrocarbon-degrading enzymes (Bss and NMS).
An important aim of this research was to identify novel hydrocarbon-degrading enzymes with initial interest focusing on the 19 complete alkB genes (Table 3) encoded in the MET3 metagenomic library. BLAST searches (9) were carried out with the translated sequences and all best hits showed at least 85% coverage of the query sequence. Most of the hits were annotated as alkane 1-monooxygenase (Table 4). Five of the hits were chosen for further work, including cloning by synthesis. Table 5 shows the six putative almA genes. None of them are annotated as n-alkane monooxygenase.
Three metagenomic libraries were constructed from samples collected in the northern Adriatic Sea (Fig. 1). All the data from the metagenomes are presented in the REDPET database (18). The main analysis tool used to assign gene function in the REDPET database is the use of HMM profiles (10). The profiles are derived from sequence families and match residues in a weighted manner depending on their degree of conservation in the family. This gives better recognition of function than BLAST searches (9), which use no weighting. This advantage is particularly important when short sequences or sequences evolutionarily distant from most of those in the databases are being analysed; both these cases occur with metagenomic data. Use of BLAST searches (9) with uncurated databases such as GenBank (20) is particularly problematic, because there is no fixed standard of annotation and there are a lot of mistakes. For general analysis of gene function, HMM profiles derived from KEGG orthologues are used (12), which are presented using the hierarchical BRITE classification, thus simplifying functional analysis of the genes present. In order to detect putative hydrocarbon-degrading enzymes and potential enzymatic functions of interest, 22 HMM profiles were developed.
Although similar amounts of sequence data were gathered from each sample, MET3 allowed a much greater degree of sequence assembly than the other two libraries (Table 1). This is because the selection on crude oil has resulted in a large reduction in the number of species present (1) so that the depth of sequencing for MET1 and MET2 was lower. Use of an alternative method such as Illumina sequencing would increase the depth of sequencing, but might cause problems due to the shorter length of reads (21).
The REDPET database includes a taxonomic browser (Fig. 2) so that the biodiversity of samples can be assessed. The metabolic activities of the microbial consortia represented in the metagenomes can be assessed using the results of the analyses with the HMM profiles. General metabolic activities are organised using the BRITE hierarchical classification of the KEGG database (12) allowing systematic analyses. The hydrocarbon-degrading enzymes (Table 3) in the different metagenomes were analysed using the specially constructed HMM profiles (Table 2). It is intended to undertake a more detailed analysis of the metabolic activities and establish base lines to assess the future development of pollution and remediation in the northern Adriatic Sea. Metagenomic analyses have already proven useful for the analysis of the consequences of oil spills (4).
For bioprospecting, it is necessary to analyse the putative genes of interest in more detail. The REDPET database allows easy downloading of genes for external analysis. The graphical user interface also allows BLAST searches (9) to be launched. Although useful results may also be extracted from partial gene sequences, it is much easier to assess complete assembled genes. When putative almA genes were analysed (Table 5), none of the best hits were annotated as long-chain (>C18) n-alkane monooxygenases despite high sequence identity. This reflects the fact that such enzymes are less well studied than enzymes degrading shorter-chain n-alkanes (3) and illustrates the advantage of using HMM profiles to identify putative genes. In contrast, for the well-studied alkB genes, many of the putative genes were annotated as n-alkane monooxygenases (Table 4). The sequence identity in the BLAST searches, combined with prediction of active sites, can be used as a criterion for the potential novelty of the genes. Sequences for five potentially novel alkB genes will next be synthesised and putative degradation of medium-chain (C5–C17) n-alkanes then assessed following expression in a heterologous host.
Metagenomics is a powerful tool for dealing with environmental problems and for bioprospecting for novel enzymes. The volume of generated data and the quality make analyses difficult. Standard analysis pipelines and databases do not provide the tools needed for specific projects. The REDPET database was constructed with metagenome sequences from sediments in the Adriatic Sea and was designed to facilitate analysis of genes involved in hydrocarbon degradation. This allowed easy identification of interesting target genes for bioprospecting. The construction of specialized databases using the MEGGASENSE platform is a general strategy, which can be applied to any metagenomic datasets.