For training the Random Forest ML classifier based on protein-features, Refseq annotated ORFs (<= 150 amino acids) of E.coli str. K12 substr. MG1655 (NC_000913.3) were used as the positive set. A random sample of the same set size was taken in the length range (20-150 aa) from all theoritically enumerated ORFs other than those in the annotation in the E.coli genome. The equal-sized positive and negative data had a size of 851 ORFs.The training set can be downloaded
here
Benchmarking of the performance was carried out by comparing the prediction results with PRODIGAL and RANSEPS on Refseq annotated small ORFs (<=100 aa) from a large dataset of bacterial genomes. The average length of small ORFs had average length around 70-80 residues.
Additional benchmarking was also carried out using the datasets of experimentally identified small ORFs in the length of 20-50 residues. The set of experimentally identified ORFs in E.coli str. K12 substr. MG1655 was compiled from multiple studies
(Pubmed:
19121005,
30904393,
27013550,
29645342,
30837344). Other datasets of experimental ORFs available for the species Salmonella enterica serovar Typhimurium str. 14028s (Pubmed:
28122954), Salmonella enterica serovar Typhimurium str. SL1344 (
Link), Staphylococcus aureus subsp. aureus str. Newman (Pubmed:
34061833), Caulobacter crescentus (Pubmed:
25078267), Streptococcus pneumoniae D39 (Pubmed:
35852327) and Bacteroides thetaiotamicron (Pubmed:
31402174,
31841667) were also used. A subset of these ORFs, that is validated at protein level, the gold-standard set was also used. For the purpose of benchmarking, ORFs catalogued from these studies were first length filtered (10-100 amino acids) and then filtered for overlaps, such that only one start for each stop was considered. For both analysis, positive set was derived from annotated or experimental smORFs while the negative set was all smORFs in the length range (10-100 aa) other than those annotated or experimentally identified.
Supplementary Datasets Discussed in the Paper:
Supplementary Dataset I : Positive and negative training set.
Supplementary Dataset II : All experimentally identified smORFs used for benchmarking.
Supplementary Dataset III : Subset of experimentally identified smORFs validated using epitope-tagging(protein level).
Supplementary Dataset IV : Assembly accessions of 3153 prokaryotic whole genomes used for conservation analysis.