ProsmORFpred - Benchmarking

1. Datasets:

For training the Random Forest ML classifier based on protein-features, Refseq annotated ORFs (<= 150 amino acids) of E.coli str. K12 substr. MG1655 (NC_000913.3) were used as the positive set. A random sample of the same set size was taken in the length range (20-150 aa) from all theoritically enumerated ORFs other than those in the annotation in the E.coli genome. The equal-sized positive and negative data had a size of 851 ORFs.The training set can be downloaded here
Benchmarking of the performance was carried out by comparing the prediction results with PRODIGAL and RANSEPS on Refseq annotated small ORFs (<=100 aa) from a large dataset of bacterial genomes. The average length of small ORFs had average length around 70-80 residues.
Additional benchmarking was also carried out using the datasets of experimentally identified small ORFs in the length of 20-50 residues. The set of experimentally identified ORFs in E.coli str. K12 substr. MG1655 was compiled from multiple studies (Pubmed:19121005, 30904393, 27013550, 29645342, 30837344). Other datasets of experimental ORFs available for the species Salmonella enterica serovar Typhimurium str. 14028s (Pubmed:28122954), Salmonella enterica serovar Typhimurium str. SL1344 (Link), Staphylococcus aureus subsp. aureus str. Newman (Pubmed:34061833), Caulobacter crescentus (Pubmed:25078267), Streptococcus pneumoniae D39 (Pubmed:35852327) and Bacteroides thetaiotamicron (Pubmed:31402174,31841667) were also used. A subset of these ORFs, that is validated at protein level, the gold-standard set was also used. For the purpose of benchmarking, ORFs catalogued from these studies were first length filtered (10-100 amino acids) and then filtered for overlaps, such that only one start for each stop was considered. For both analysis, positive set was derived from annotated or experimental smORFs while the negative set was all smORFs in the length range (10-100 aa) other than those annotated or experimentally identified.
Supplementary Datasets Discussed in the Paper:
Supplementary Dataset I : Positive and negative training set.
Supplementary Dataset II : All experimentally identified smORFs used for benchmarking.
Supplementary Dataset III : Subset of experimentally identified smORFs validated using epitope-tagging(protein level).
Supplementary Dataset IV : Assembly accessions of 3153 prokaryotic whole genomes used for conservation analysis.

2. Comparison with PRODIGAL and RANSEPS on Refseq annotated smORFs

Fig 1: The plot shows the performance of PRODIGAL, RANSEPS and ProsmORF-pred on annotated smORFs across representatives from diverse bacterial groups. A single species was chosen per each group. As is clearly seen our tool ProsmORF-pred achieves an median sensitivity of 80 % at an average False Positive Rate of 0.40 %. Detailed information shown in Table 1 and Table 2.

3. Benchmarking Results on smORFs identified in recent experimental studies

Fig 2: Performance of ProsmORF-pred and other tools (PRODIGAL,RANSEPS,SMORFER and SMORFINDER) on experimental datasets. While (A) and (B) show the sensitivity and specificity of all tools on these datasets, (C) indicates the length distribution of the respective datasets. Detailed information in Table 3.

Datasets of smORFs identified in recent experimental studies used for benchmarking:

Datasets	Species	Positive	Negative	Source
Set A1	E. coli K12 sbstr. MG1655	268	96373	Ribo seq/Epitope Tag
Set A2	E. coli K12 sbstr. MG1655	87	96373	Epitope Tag
Set B1	S. enterica Typhimurium str. SL1344	131	94957	Ribo seq/Transcription/Epitope Tag
Set B2	S. enterica Typhimurium str. SL1344	13	94957	Epitope Tag
Set C1	S. enterica Typhimurium str. 14028s	116	94772	Ribo seq/Epitope Tag
Set C2	S. enterica Typhimurium str. 14028s	18	94772	Epitope Tag
Set D	Staphylococcus aureus str. Newman	164	68609	Mass Spectrometry
Set E	Caulobacter crescentus	61	41219	Ribo seq/MS
Set F	Streptococcus pneumoniae D39V	78	45394	Ribo seq
Set G	Bacteroides thetaiotamicron VPI-5482	39	134110	MS/Ribo seq