Contigs smaller than 1800 bp were assembled using Newbler (Life Technologies) to generate larger contigs Selleckchem MK2206 (flags: − tr, − rip, − mi 98, − ml 80). Contigs larger than 1800 bp, as well as contigs generated from the final Newbler run, were combined using minimus 2 (flags: − D MINID = 98 − D OVERLAP = 80) [AMOS (http://sourceforge.net/projects/amos)]. Read depth estimates are based on mapping the trimmed, screened, paired-end Illumina reads to assembled contigs using BWA (http://bio-bwa.sourceforge.net/). Un-assembled, paired reads were merged with FLASH (http://sourceforge.net/projects/flashpage). Assembled contigs along with the merged, un-assembled reads were submitted to
the Integrated Metagenome Analysis System (https://img.jgi.doe.gov/) for functional annotation. Submitted sequences were trimmed to remove low quality regions and stretches of
undetermined sequences at the ends of contigs were removed. Each sequence was checked with the DUST algorithm (Morgulis et al., 2006) for low complexity regions. Sequences with less than 80 unmasked nt were removed. Additionally very similar sequences (similarity > 95%) with identical 5′ pentanucleotides are replaced CHIR 99021 by one representative using UCLUST (www.drive5.com). The feature prediction pipeline included the detection of non-coding RNA genes followed by prediction of protein coding genes. Identification of tRNAs was performed using tRNAScan-SE-1.23 (Lowe and Eddy, 1997). In case of conflicting predictions, G protein-coupled receptor kinase the best scoring predictions were selected. The last 150 nt of the sequences were also checked
by comparing these to a database containing tRNA sequences identified in isolate genomes using blastn (Altschul et al., 1997). Hits with high similarity were kept. Ribosomal RNA genes were predicted using the hmmsearch (Eddy, 2011) with internally developed models for the three types of RNAs for the domains of life. Identification of protein-coding genes was performed using four different gene calling tools, GeneMark (v.2.6r) (Besemer and Borodovsky, 2005), Metagene (v. Aug08) (Noguchi et al., 2006), Prodigal (v2.50) (Hyatt et al., 2010) and FragGeneScan (Rho et al., 2010) all of which are ab initio gene prediction programs. We typically followed a majority rule based decision scheme to select the gene calls. When there was a tie, we selected genes based on an order of gene callers determined by runs on simulated metagenomic datasets (Genemark > Prodigal > Metagene > FragGene-Scan). Finally, CDS and other feature predictions were consolidated. Regions identified previously as RNA genes were preferred over protein-coding genes. Subsequent functional prediction involved comparison of predicted protein sequences to the public IMG database using the USEARCH algorithm (www.drive5.com), the COG db using the NCBI developed PSSMs ( Tatusov et al., 2003), and the PFAM database ( Punta et al., 2012) using hmmsearch.