ProsmORF-pred

1. Datasets:

For training the Random Forest ML classifier based on protein-features, Refseq annotated ORFs (<= 150 amino acids) of E.coli str. K12 substr. MG1655 (NC_000913.3) were used as the positive set. A random sample of the same set size was taken in the length range (20-150 aa) from all theoritically enumerated ORFs other than those in the annotation in the E.coli genome. The equal-sized positive and negative data had a size of 851 ORFs.The training set can be downloaded here
Benchmarking of the performance was carried out by comparing the prediction results with PRODIGAL and RANSEPS on Refseq annotated small ORFs (<=100 aa) from a large dataset of bacterial genomes. The average length of small ORFs had average length around 70-80 residues.
Additional benchmarking was also carried out using the datasets of experimentally identified small ORFs in the length of 20-50 residues. The set of experimentally identified ORFs in E.coli str. K12 substr. MG1655 was compiled from multiple studies (Pubmed:19121005, 30904393, 27013550, 29645342, 30837344). Other datasets of experimental ORFs available for the species Salmonella enterica serovar Typhimurium str. 14028s (Pubmed:28122954), Salmonella enterica serovar Typhimurium str. SL1344 (Link), Staphylococcus aureus subsp. aureus str. Newman (Pubmed:34061833), Caulobacter crescentus (Pubmed:25078267), Streptococcus pneumoniae D39 (Pubmed:35852327) and Bacteroides thetaiotamicron (Pubmed:31402174,31841667) were also used. A subset of these ORFs, that is validated at protein level, the gold-standard set was also used. For the purpose of benchmarking, ORFs catalogued from these studies were first length filtered (10-100 amino acids) and then filtered for overlaps, such that only one start for each stop was considered. For both analysis, positive set was derived from annotated or experimental smORFs while the negative set was all smORFs in the length range (10-100 aa) other than those annotated or experimentally identified.
Supplementary Datasets Discussed in the Paper:
Supplementary Dataset I : Positive and negative training set.
Supplementary Dataset II : All experimentally identified smORFs used for benchmarking.
Supplementary Dataset III : Subset of experimentally identified smORFs validated using epitope-tagging(protein level).
Supplementary Dataset IV : Assembly accessions of 3153 prokaryotic whole genomes used for conservation analysis.

2. Comparison with PRODIGAL and RANSEPS on Refseq annotated smORFs

Picture 1
Fig 1: The plot shows the performance of PRODIGAL, RANSEPS and ProsmORF-pred on annotated smORFs across representatives from diverse bacterial groups. A single species was chosen per each group. As is clearly seen our tool ProsmORF-pred achieves an median sensitivity of 80 % at an average False Positive Rate of 0.40 %. Detailed information shown in Table 1 and Table 2.

3. Benchmarking Results on smORFs identified in recent experimental studies

Picture 3
Fig 2: Performance of ProsmORF-pred and other tools (PRODIGAL,RANSEPS,SMORFER and SMORFINDER) on experimental datasets. While (A) and (B) show the sensitivity and specificity of all tools on these datasets, (C) indicates the length distribution of the respective datasets. Detailed information in Table 3.
Datasets of smORFs identified in recent experimental studies used for benchmarking:
Datasets Species Positive Negative Source
Set A1 E. coli K12 sbstr. MG1655 268 96373 Ribo seq/Epitope Tag
Set A2 E. coli K12 sbstr. MG1655 87 96373 Epitope Tag
Set B1 S. enterica Typhimurium str. SL1344 131 94957 Ribo seq/Transcription/Epitope Tag
Set B2 S. enterica Typhimurium str. SL1344 13 94957 Epitope Tag
Set C1 S. enterica Typhimurium str. 14028s 116 94772 Ribo seq/Epitope Tag
Set C2 S. enterica Typhimurium str. 14028s 18 94772 Epitope Tag
Set D Staphylococcus aureus str. Newman 164 68609 Mass Spectrometry
Set E Caulobacter crescentus 61 41219 Ribo seq/MS
Set F Streptococcus pneumoniae D39V 78 45394 Ribo seq
Set G Bacteroides thetaiotamicron VPI-5482 39 134110 MS/Ribo seq