Regular article|Articles in Press

# Differential Diagnosis of Hematologic and Solid Tumors Using Targeted Transcriptome and Artificial Intelligence

Open AccessPublished:October 12, 2022
Diagnosis and classification of tumors is increasingly dependent on biomarkers. RNA expression profiling using next-generation sequencing provides reliable and reproducible information on the biology of cancer. This study investigated targeted transcriptome and artificial intelligence for differential diagnosis of hematologic and solid tumors. RNA samples from hematologic neoplasms (N = 2606), solid tumors (N = 2038), normal bone marrow (N = 782), and lymph node control (N = 24) were sequenced using next-generation sequencing using a targeted 1408-gene panel. There were 20 subtypes of hematologic neoplasms and 24 subtypes of solid tumors. Machine learning was used for diagnosis between two classes. Geometric mean naïve bayesian classifier was used for differential diagnosis across 45 diagnostic entities with assigned rankings. Machine learning showed high accuracy in distinguishing between two diagnoses, with area under the curve varying between 1 and 0.841. Geometric mean naïve bayesian algorithm was trained using 3045 samples and tested on 1415 samples, and showed correct first-choice diagnosis in 100%, 88%, 85%, 82%, 88%, 72%, and 72% of acute lymphoblastic leukemia, acute myeloid leukemia, diffuse large B-cell lymphoma, colorectal cancer, lung cancer, chronic lymphocytic leukemia, and follicular lymphoma cases, respectively. We conclude that targeted transcriptome combined with artificial intelligence are highly useful for diagnosis and classification of various cancers. Mutation profiles and clinical information can improve these algorithms and minimize errors in diagnoses.
Diagnosis and classification of tumors are increasingly dependent on biological and molecular biomarkers. The management and therapy of cancer vary significantly, depending on the proper classification of cancer. Relying on the expertise of a pathologist and the morphologic features of the tumor alone lead to significant discrepancies in diagnosis because of the subjective nature. More important, the possibility of an incorrect diagnosis is relatively high. Numerous studies have shown that errors in the diagnosis and classification of cancers continue to be a significant issue in current clinical practice.
• Troyanskaya O.
• Trajanoski Z.
• Carpenter A.
• Thrun S.
• Razavian N.
• Oliver N.
Artificial intelligence and cancer.
• Chen J.H.
• Dhaliwal G.
Next-generation artificial intelligence for diagnosis: from predicting diagnostic labels to “wayfinding.”.
• Elemento O.
• Leslie C.
• Lundin J.
• Tourassi G.
Artificial intelligence in cancer research, diagnosis and therapy.
• Moon M.
• Nakai K.
Stable feature selection based on the ensemble L 1-norm support vector machine for biomarker discovery.
• Hong M.
• Tao S.
• Zhang L.
• Diao L.-T.
• Huang X.
• Huang S.
• Xie S.-J.
• Xiao Z.-D.
• Zhang H.
RNA sequencing: new technologies and applications in cancer research.
• Govindarajan M.
• Wohlmuth C.
• Waas M.
• Bernardini M.Q.
• Kislinger T.
High-throughput approaches for precision medicine in high-grade serous ovarian cancer.
Recent advances in utilizing machine learning to determine the morphology and immunohistochemistry of tumors are promising for improving cancer diagnosis and classification and reducing variability.
• Mercer T.R.
• Gerhardt D.J.
• Dinger M.E.
• Crawford J.
• Trapnell C.
• Jeddeloh J.A.
• Mattick J.S.
• Rinn J.L.
Targeted RNA sequencing reveals the deep complexity of the human transcriptome.
• Reeser J.W.
• Martin D.
• Miya J.
• Kautto E.A.
• Lyon E.
• Zhu E.
• Wing M.R.
• Smith A.
• Reeder R.
• Samorodnitsky E.
• Parks H.
• Naik K.R.
• Gozgit J.
• Nowacki N.
• Davies K.D.
• Varella-Garcia M.
• Yu L.
• Freud A.G.
• Coleman J.
• Aisner D.L.
• Roychowdhury S.
Validation of a targeted RNA sequencing assay for kinase fusion detection in solid tumors.
• Togni M.
• Masetti R.
• Pigazzi M.
• Astolfi A.
• Zama D.
• Indio V.
• Serravalle S.
• Manara E.
• Bisio V.
• Rizzari C.
• Basso G.
• Pession A.
• Locatelli F.
Identification of the NUP98-PHF23 fusion gene in pediatric cytogenetically normal acute myeloid leukemia by whole-transcriptome sequencing.
RNA profiling of cancer cells has been highly useful for providing information on the tumor, microenvironment, and immune response.
• Veeraraghavan J.
• Ma J.
• Hu Y.
• Wang X.-S.
Recurrent and pathological gene fusions in breast cancer: current advances in genomic discovery and clinical implications.
,
• Kloosterman W.P.
• van den Braak R.R.C.
• Pieterse M.
• Van Roosmalen M.J.
• Sieuwerts A.M.
• Stangl C.
• Brunekreef R.
• Lalmahomed Z.S.
• Ooft S.
• Galen A.V.
• Smid M.
• Lefebvre A.
• Zwartkruis F.
• Martens J.W.M.
• Foekens J.A.
• Biermann K.
• Koudijs M.J.
• Ijzermans J.N.M.
• Voest E.E.
A systematic analysis of oncogenic gene fusions in primary colon cancer.
Using next-generation sequencing to analyze RNA enables profiling a reliable clinical tool and approach for the discovery of biomarkers, characterizing the biology of tumors, and predicting the efficacy of various therapeutic approaches.
• Veeraraghavan J.
• Ma J.
• Hu Y.
• Wang X.-S.
Recurrent and pathological gene fusions in breast cancer: current advances in genomic discovery and clinical implications.
,
• Kloosterman W.P.
• van den Braak R.R.C.
• Pieterse M.
• Van Roosmalen M.J.
• Sieuwerts A.M.
• Stangl C.
• Brunekreef R.
• Lalmahomed Z.S.
• Ooft S.
• Galen A.V.
• Smid M.
• Lefebvre A.
• Zwartkruis F.
• Martens J.W.M.
• Foekens J.A.
• Biermann K.
• Koudijs M.J.
• Ijzermans J.N.M.
• Voest E.E.
A systematic analysis of oncogenic gene fusions in primary colon cancer.
RNA sequencing and quantification using next-generation sequencing is more reliable and reproducible than old technologies, such as microarrays or PCR-based RNA quantification.
• Zhou X.
• Zhan L.
• Huang K.
• Wang X.
The functions and clinical significance of circRNAs in hematological malignancies.
• Liu Y.
• Cheng Z.
• Pang Y.
• Cui L.
• Qian T.
• Quan L.
• Zhao H.
• Shi J.
• Ke X.
• Fu L.
Role of microRNAs, circRNAs and long noncoding RNAs in acute myeloid leukemia.
• Sun Y.-M.
• Chen Y.-Q.
Principles and innovative technologies for decrypting noncoding RNAs: from discovery and functional prediction to clinical application.
Targeted RNA sequencing of various tissue samples allows us to focus on relevant oncogenic markers and to sequence at a deeper level for better quantification of low-level expressor genes that might be major regulators of the complex biology of cells. This study explored the potential of using targeted transcriptomes and artificial intelligence for the differential diagnosis and classification of hematologic and solid tumors.

## Materials and Methods

### Patients and Samples

A total of 5450 fresh bone marrow (BM) and formalin-fixed, paraffin-embedded (FFPE) cancer samples were obtained for this study (Supplemental Table S1). The samples were collected consecutively without any selection during routine clinical molecular profiling using next-generation sequencing of DNA and RNA between November 2018 and November 2021. The tumor fractions varied between 30% and 80%, and the samples reflected a real-time occurrence. Diagnoses of the samples are listed in Table 1. The samples for various types of leukemia, myelodysplasia, and normal tissues were collected from fresh BM. On the other hand, lymphoma cases and solid tumors were based on FFPE samples. Tumor diagnosis was confirmed using morphologic analysis, flow cytometry, immunohistochemistry, and molecular profiling of the DNA and RNA. The DNA and RNA were extracted from the BM samples using an automated Maxwell System platform (Promega, Madison, WI). Agencourt FormaPure Total 96-Prep Kit was used to extract both the DNA and RNA from FFPE samples using an automated KingFisher Flex, following the manufacturer's recommendations. Agencourt FormaPure Kit had provision for a split protocol to extract both the DNA and RNA from the same FFPE lysate. Blood and BM samples were collected in EDTA. The DNA and RNA were extracted from fresh samples within 72 hours of collection. The study protocol was approved by Institutional Review Board by Western Copernicus Group (New England Institutional Review Board, Aspire Institutional Review Board, and Midlands Institutional Review Board; number 1-1476184-1). Consent was waived because of incidental collection and lack of risk. This study was conducted in accordance with the principles of the Declaration of Helsinki and its later amendments.
Table 1List of Neoplasms and Samples
DiseaseN
Aplastic anemia12
Acute lymphoblastic leukemia89
Acute myeloid leukemia352
Brain tumors44
Breast cancer137
Burkitt lymphoma10
Carcinoma (not otherwise specified)32
Clear cell renal cell carcinoma8
Cholangiocarcinoma9
Chronic lymphocytic leukemia167
Chronic myeloid leukemia46
Chronic myelomonocytic leukemia97
Colorectal carcinoma308
Diffuse large B-cell lymphoma746
Endometrial cancer113
Esophageal carcinoma34
Follicular lymphoma145
Gastric carcinoma10
Gastrointestinal stromal tumor11
Hairy cell leukemia5
Hodgkin lymphoma65
Lung cancer794
Lymphoma (not otherwise classified)3
Mantle cell lymphoma93
Marginal zone lymphoma76
Myelodysplastic syndrome316
Melanoma21
Multiple myeloma113
Myeloproliferative neoplasms88
Neuroendocrine tumor5
Normal bone marrow, fresh782
Normal lymph node24
Ovarian cancer126
Pancreatic cancer96
Prostate cancer36
Sarcoma137
Squamous cell carcinoma of skin15
T-cell acute lymphoblastic leukemia7
T-cell lymphoma145
Thyroid cancer24
Upper gastrointestinal cancer23
Urothelial cancer38
Vulva cancer9
Waldenstrom macroglobulinemia31
Total5450

### RNA Library Construction and Sequencing

The samples were selectively enriched for 1408 cancer-associated genes using the reagents provided in the Illumina TruSight RNA pan-cancer panel (Illumina, San Diego, CA) (Supplemental Table S1). cDNA was generated from the cleaved RNA fragments using random primers during the first- and second-strand synthesis. Sequencing adapters were ligated into the resulting double-stranded cDNA fragments. The coding regions of the expressed genes were captured from this library using sequence-specific probes to generate the final library. Sequencing was performed using the Illumina NextSeq 550 system platform. Ten million reads per sample were performed in a single run, and the read length was 2 × 150 bp. An expression profile was generated from the sequencing coverage profile of each sample using Cufflinks. Expression levels were measured as fragments per kilobase of transcripts per million.

### Using Machine Learning Algorithm for Classification of Two Diagnostic Classes

The RNA expression data were used in the machine learning algorithm to distinguish between any two diagnostic classes. Recognizing that high dimensionality of the problem would make it vulnerable to overfitting with many artificial intelligence techniques, we addressed this problem by applying a modified version of naïve Bayes (geometric mean naïve Bayes). The conditional independence assumption of the naïve Bayes is almost never strictly satisfied in practical applications. However, it is still a useful tool, especially in situations like this problem. With a high dimension and a limited sample size, to estimate the correlations between genes would be counterproductive. Naïve Bayes approach has a small number of parameters and, hence, a lower capacity as a learning system, which will help address the overfitting problem according to statistical learning theory. We developed the geometric mean naïve Bayes method to address the numeric underflow issue of standard naïve Bayes when applied to a high-dimensional problem. When the likelihood is the product of thousands of conditional probabilities, underflow is unavoidable, even with the proportional scaling. In geometric mean naïve Bayes, we apply the geometric mean to the conditional probabilities. The method is documented in a separate article.
• Albitar M.
• Zhang H.
• Goy A.
• Xu-Monette Z.Y.
• Bhagat G.
• Visco C.
• Tzankov A.
• Fang X.
• Zhu F.
• Dybkaer K.
• Chiu A.
• Tam W.
• Zu Y.
• Hsi E.D.
• Hagemeister F.B.
• Huh J.
• Ponzoni M.
• Ferreri A.J.M.
• Møller M.B.
• Parsons B.M.
• van Krieken J.H.
• Piris M.A.
• Winter J.N.
• Li Y.
• Xu B.
• Young K.H.
Determining clinical course of diffuse large B-cell lymphoma using targeted transcriptome and machine learning algorithms.
We proved that the geometric mean is essentially the only operation that will preserve the conditional independence of naïve Bayes and will not cause underflow. The gene selection/filtering method is used to eliminate irrelevant genes and improve the training. Two statistical criteria on individual gene were applied to perform the filtering. The two measures are applied to the individual gene for the sole purpose of eliminating irrelevant genes. Consequently, they are not used to measure performance and confidence of the final classifier. The two measures provide indications on how relevant a gene is to distinguish the classes. Although they are used for the same purpose, the two measures do not give the same ranking on the genes. To explain the rationale of defining such measures, we used the terms of performance measure and stability measure. In selecting genes that distinguished between the two classes, we used the standard naïve bayesian classifier on each gene with k-fold cross-validation.
$Equation 1.$
(1)

where m is the number of classes, ni is the number of cases in the class, and ti is the number of correctly classified cases in class i estimated using the k-fold cross-validation.
When the number of classes was m = 2, the measures were the average of the sensitivity and specificity. In general, this was the average of the accuracies of the individual classes. Overall accuracy is not appropriate for gene selection, because it can be misleading in data sets with unbalanced classes (eg, in a data set with 80 negative and 20 positive cases, a trivial classification of all negatives yields 80% accuracy). The coefficient 1/m was usually ignored in our study, because m was a constant and did not affect the ranking. The k-fold cross-validation was usually implemented with k = n (ie, leave one out). Although the leave-one-out method was computationally more extensive, the efficiency of the naïve Bayes algorithm still made the selection process reasonably fast. This accuracy value provided a direct measure of the genes used for classifying groups; however, this did not provide confidence information. For the confidence measure for gene selection, we relied on the P value of a gene to differentiate the classes. Analysis of variance was applied to compute the P value for a gene to discriminate between groups.
$Equation 2.$
(2)

where MSB was the mean sum of squares between groups, MSW was the mean sum of squares within groups, and F was the analysis of variance coefficient following the F distribution. The P value was obtained from the F value. This confidence value provided the measure of the stability and robustness of the gene in the classifying groups. It did not provide concrete classification accuracy but contributed the overall confidence in the differences of the class means. Both criteria provided quantitative measures of the relevance of a gene for classification; however, these two relevance measures did not always produce the same ranking. Applying both measures would produce effective and stable gene selection methods for machine learning–based classification systems.
After selecting individual genes, we used a naïve bayesian classifier to distinguish between diagnostic classes using multiple selected genes with both confidence and P values. However, because the naïve bayesian classifier has severe numerical underflow problems when the dimensions of data were high, we developed the geometric mean naïve bayesian (GMNB) classifier that eliminated the underflow problem by applying a multiplicative positive increasing function to the likelihood. In particular,
$Equation 3.$
(3)

This formula represented the geometric mean of conditional probabilities. We proved that the GMNB method resolved the underflow problem for high-dimensional data by showing that the expected value of such a likelihood approached 1/e when dimension d → ∞. We also proved that such a function is unique up to a constant multiple of exponent.
• Sun Y.-M.
• Chen Y.-Q.
Principles and innovative technologies for decrypting noncoding RNAs: from discovery and functional prediction to clinical application.
To reduce the effects of noise and avoid overfitting when selecting these genes, we employed leave-one-out cross-validation to obtain a robust performance measure. For an individual gene, a GMNB was constructed on the training subset and examined on the testing subset. The complement of the cross-validation error rate was used as the discriminant measure for bins.
$Equation 4.$
(4)

Instead of the overall error rate, the value d takes the sum of the error rates of individual classes. This definition avoided bias when the sample sizes were not balanced for different classes. The genes were ranked by d, with higher values corresponding to better-performing genes for classification. To address stability issues, we used t-test to measure the significance of a bin separating the two classes. By setting a P-value threshold, insignificant bins can be filtered. The selected genes were used to distinguish between the two classes using a k-fold cross-validation procedure (with k = 12). A naïve bayesian classifier was constructed for the training of k-1 subsets and evaluated on the other testing subset. The training and testing subsets were then rotated, and the average of the classification errors was used to measure the relevance of the gene. The classification system was trained using a selected subset of the most relevant genes. The processes of gene selection and class selection were applied recurrently to obtain an optimal classification system, and a subset of genes relevant to distinguishing between the two classes was defined and isolated.

### Using GMNB Classifier for Ranking Diagnostic Classes

GMNB classifier described above was also used in classifying each sample against multiple diagnostic classes. The GMNB method resolved the underflow problem of high-dimensional data.

## Results

### High Accuracy in the Differential Diagnosis between Two Diagnostic Classes

Initially, we evaluated the ability of machine learning to distinguish between the two disease classes. Using a machine learning algorithm, we first selected the proper genes to distinguish between the two classes by using the best classifier biomarkers based on the P value for predicting a specific diagnosis. This approach showed high sensitivity and specificity for distinguishing between the two diagnoses (Table 2). As shown in Figure 1 and Table 2, area under the curve for most classifications was >0.90. Distinguishing between normal and myelodysplastic syndrome (MDS) was relatively less reliable because of the significant overlap between the two entities. The algorithm used the expression of 400 genes to achieve a sensitivity of 78.1% and a specificity of 75.3%. However, the presence or absence of mutations can easily distinguish between these entities. Similarly, distinguishing between MDS and myeloproliferative neoplasms (MPNs) was relatively less robust because of the known overlap between the two entities. Using the expression of 500 genes, the algorithm achieved a sensitivity of 90.9% and a specificity of 70.8%. This is without using mutation profiles. As expected, using mutation profiling can reliably distinguish between these entities. Furthermore, clinical cases with features of both MDS and MPN are well documented, and the relatively poor distinction between these two entities might be due to the presence of such cases between the samples used in this study. Distinguishing between chronic lymphocytic leukemia, mantle cell lymphoma, and marginal zone lymphoma was remarkably reliable (Table 2). The expression of mere 10 genes was adequate to distinguish between chronic lymphocytic leukemia and mantle cell lymphoma, with a sensitivity of 94.6% and a specificity of 95.2%. Distinguishing between chronic lymphocytic leukemia and marginal zone lymphoma required the expression profile of 25 genes to achieve a sensitivity of 98.7% and a specificity of 91%, due to the overlap between the two entities. Furthermore, distinguishing between Hodgkin lymphoma and normal lymph node, or T-cell lymphoma, was highly reliable using the expression of 100 and 500 genes, respectively (Table 2). Similarly, distinguishing between various solid tumors was also highly reliable. Distinguishing between various sarcoma cases and gastrointestinal stromal tumors was highly reliable using the expression of 100 genes.
Table 2Transcriptome and Differential Diagnosis between Two Diagnostic Classes
Two classesAUC (95% CI)Sensitivity, %Specificity, %Genes, NAUC − 1 (95% CI)
Normal versus AML0.9764 (0.954–0.974)90.993.21000.945 (0.933–0.957)
Normal versus ALL0.981 (0.973–0.989)95.195.52000.977 (0.968–0.985)
Normal versus CLL0.997 (0.994–0.999)96.498.81000.980 (0.973–0.988)
Normal versus mantle0.992 (0.987–0.997)95.197.81000.969 (0.959–0.980)
Normal versus MDS0.831 (0.801–0.861)78.175.34000.826 (0.796–0.856)
Normal versus MPN0.923 (0.884–0.962)90.982.34000.903 (0.860–0.946)
MDS versus MPN0.884 (0.837–0.931)90.970.85000.806 (0.748–0.864)
AML versus MDS0.880 (0.854–0.906)86.170.24000.864 (0.837–0.892)
CLL versus mantle0.986 (0.968–1.000)94.695.2100.986 (0.968–1.00)
Marginal versus CLL0.984 (0.964–1.00)98.791250.864 (0.809–0.920)
Marginal versus follicular0.946 (0.917–0.974)9193.45500.942 (0.912–0.971)
Hodgkin versus normal LN0.990 (0.972–1.00)95.41001001.00 (1.00–1.00)
Hodgkin versus T-cell lymphoma0.963 (0.930–0.996)92.3915000.902 (0.850–0.954)
Hodgkin versus DLBCL0.975 (0.948–1.00)96.995.35000.965 (0.934–0.997)
DLBCL versus follicular0.986 (0.972–0.999)95.993.16000.975 (0.957–0.993)
DLBCL versus T-cell lymphoma0.967 (−0.946 to 0.988)91.789.86000.942 (0.915–0.969)
Lung versus colorectal0.982 (0.975–0.989)97.294.59000.977 (0.969–0.985)
Lung versus breast0.988 (0.982–0.994)9892.77000.988 (0.982–0.994)
Breast versus ovarian0.994 (0.984–1.00)10094.27000.989 (0.976–1.00)
Ovarian versus endometrial0.959 (0.933–0.984)92.991.26000.853 (0.803–0.902)
Breast versus colorectal0.997 (0.991–1.00)97.898.78000.987 (0.973–1.00)
Pancreas versus colorectal0.989 (0.980–0.997)94.595.85500.971 (0.956–0.985)
Pancreas versus esophageal0.999 (0.990–1.00)97.198.95500.960 (0.914–1.00)
Ovarian versus lung0.994 (0.984–1.00)97.696.66001.00 (0.997–1.00)
Lung versus DLBCL0.996 (0.992–0.999)97.297.38000.988 (0.983–0.993)
Sarcoma versus ovarian0.995 (0.986–1.00)99.295.73001.00 (0.997–1.00)
Sarcoma versus GIST1.00 (0.997–1.00)99.31003001.00 (0.997–1.00)
AUC, area under the curve; ALL, acute lymphoblastic leukemia; AML, acute myeloid leukemia; CLL, chronic lymphocytic leukemia; DLBCL, diffuse large B-cell lymphoma; GIST, gastrointestinal stromal tumor; LN, lymph node; MDS, myelodysplastic syndrome; MPN, myeloproliferative neoplasm.

### Differential Diagnosis between 47 Different Diagnostic Classes with Ranking

Distinguishing between multiple classes was significantly more complex, because the specific biomarkers that are suitable for distinguishing between two diagnostic classes may not be relevant for determining the differences between these two classes and the rest of the diagnostic classes.
The most difficult initial step was the selection of biomarkers for distinguishing between one class and the rest of the 47 diagnostic classes. To overcome this problem, we used all 1408 biomarkers without selection to provide a score that could be ranked to predict a specific diagnostic class. We used a machine learning approach that is based on a generalized naïve bayesian classifier to train the system to distinguish between 47 different diagnostic classes. We first used 3045 cases for training; then, we used 1415 cases for testing. Because some of the diagnostic classes had too few samples, the testing set included only 23 different diagnostic classes diagnosed against the 47 diagnostic classes (Table 3). To eliminate the underflow associated with the use of naïve bayesian classification, we applied the geometric mean to the likelihood produced by the naïve bayesian classifier. This allowed us to obtain a score for each diagnostic class. The score was used to rank the likelihood of each diagnosis. The purpose of this approach was to use additional information, particularly other clinical data, to select the best diagnosis.
Table 3Transcriptome and Differential Diagnosis Using Machine Learning Trained Using 47 Different Diagnostic Classes
DiagnosisCases, NAccurate diagnosis as first choice (PPA), n (%)PPV, %Accurate diagnosis as second choice, n (%)PPA by first and second choices, %
ALL2626 (100)840 (0)100
Colorectal10183 (82)794 (4)86
Brain1612 (75)750 (0)75
Lung201177 (88)737 (3)91
DLBCL149127 (85)738 (5)91
Breast3125 (81)712 (6)87
CLL6144 (72)695 (8)80
Endometrial3121 (68)663 (10)78
MM3122 (71)650 (0)71
Ovarian4129 (71)636 (15)85
Pancreas3119 (61)585 (16)77
Follicular3626 (72)535 (14)86
Mantle3118 (58)503 (10)68
Sarcoma4026 (65)451 (3)68
Hodgkin2616 (62)419 (35)97
Normal20192 (46)3739 (19)65
AML120106 (88)356 (5)93
T cell4121 (51)348 (20)71
Marginal268 (31)264 (15)46
MDS10119 (19)1347 (47)65
MPN263 (12)93 (12)23
CMML312 (6)42 (6)13
CML170 (0)01 (6)6
ALL, acute lymphoblastic leukemia; AML, acute myeloid leukemia; CLL, chronic lymphocytic leukemia; CML, chronic myeloid leukemia; CMML, chronic myelomonocytic leukemia; DLBCL, diffuse large B-cell lymphoma; MDS, myelodysplastic syndrome; MM, multiple myeloma; MPN, myeloproliferative neoplasm; PPA, positive percentage agreement; PPV, positive predictive value.
The testing set included cases of hematologic and solid tumors. Evaluating each of the diagnostic classes individually showed variations between the diseases (Table 3). In acute lymphoblastic leukemia, 100% of the cases were correctly diagnosed, and the overall positive predictive value (PPV) was 84%. In contrast, none of the 17 chronic myeloid leukemia cases was correctly diagnosed, and most were classified as MPN or chronic myelomonocytic leukemia. As expected, all chronic myeloid leukemia cases were classified correctly when molecular abnormalities were assessed using the same RNA sequencing data. All chronic myeloid leukemia cases demonstrated the presence of breakpoint cluster region–abelson 1 (BCR-ABL1) fusion mRNA, and the PPV was 100%. There was a significant overlap in the diagnosis based on RNA expression alone between the normal BM, MDS, chronic myelomonocytic leukemia, MPN, and acute myeloid leukemia. Similarly, the BM was easily distinguished when the mutation profile was considered. The same was true when distinguishing between acute myeloid leukemia and MDS/chronic myelomonocytic leukemia. Mutation profiles were crucial for distinguishing between these myeloid entities.
The machine learning software correctly diagnosed 85% of diffuse large B-cell lymphoma cases, with a PPV of 73%. The diagnosis was 91% correct when the first and second choices were considered.
Among solid tumors, colorectal cancers were diagnosed correctly in 82% of the cases, with a PPV of 79%. Similarly, lung cancers were diagnosed correctly in 88% of the cases as the first choice, with PPVs of 73% and 91%, when both the first and second choices were considered (Table 3).
Sarcoma diagnosis was predicted in 65% of the cases, with a PPV of 45%. Most of the misdiagnosed cases of sarcoma were ovarian cancer and vice versa. This is most likely due to the presence of stromal elements in ovarian tumors, which was used for the training and diagnosis.
We also evaluated the diagnostic accuracy of this system by grouping the diagnostic classes into five groups: lymphoid, myeloid, carcinoma (including brain tumors), sarcoma, and normal (Table 4). As shown in Table 4, correct diagnosis was obtained in 84% of the cases as the first ranked diagnostic choice, and an additional 8% as the second ranked diagnostic choice, with a final diagnostic accuracy of 92%. Most of the cases that were missed by the first choice and captured by the second choice had scores close to each other (<4%), indicating that the second option should be considered. Mostly, the missed cases were normal BM, most of which were misdiagnosed as MDS because of the significant similarity between the BM from MDS and normal BM. Particularly, all the normal BM samples were collected from patients having cytopenia and were considered negative for neoplasm because of lack of cytogenetic, morphologic, and molecular abnormalities. These cases were easily distinguished from MDS when mutation data were considered. Generally, misdiagnosed cases were commonly misdiagnosed within the same diagnostic category.
Table 4Transcriptome and Differential Diagnosis between Five Major Diagnostic Classes Using Machine Learning and Targeted Transcriptome
DiagnosisCases, NCases correctly diagnosed as first choice (PPA), n (%)Sensitivity (95% CI), %Specificity (95% CI), %Cases correctly diagnosed as second choice (PPA), n (%)Cases correctly diagnosed as first and second choices (PPA), %
Lymphoid427389 (91)77 (72–81)88 (86–90)20 (5)96
Myeloid295258 (87)
Of 37 patients, 36 were classified as normal.
44 (38–49)77 (75–80)26 (9)96
Carcinoma452427 (94)
Of 25 patients, 14 were lymphoid because the tumor was metastatic to the lymph node or pleural fluid.
81 (77–84)95 (92–96)17 (4)98
Normal20193 (46)
Of 108 cases, 107 were myeloid neoplasms (chronic myelomonocytic leukemia, myelodysplastic syndrome, chronic myeloid leukemia, or acute myeloid leukemia).
46 (39–53)96 (95–97)41 (20)67
Sarcoma4026 (65)
Of 14 cases, 4 were classified as ovarian (Mullerian tumors) and 2 in lymph nodes.
65 (48–79)99 (98–99)1 (3)68
Total14151189 (84)109 (8)92
PPA, positive percentage agreement.
Of 37 patients, 36 were classified as normal.
Of 25 patients, 14 were lymphoid because the tumor was metastatic to the lymph node or pleural fluid.
Of 108 cases, 107 were myeloid neoplasms (chronic myelomonocytic leukemia, myelodysplastic syndrome, chronic myeloid leukemia, or acute myeloid leukemia).
§ Of 14 cases, 4 were classified as ovarian (Mullerian tumors) and 2 in lymph nodes.

## Discussion

Cancer is a genomic disease.
• Berger M.F.
• Mardis E.R.
The emerging clinical relevance of genomics in cancer medicine.
The DNA changes in cancer lead to a cascade of abnormalities in RNA expression, which, in turn, lead to abnormal phenotypes, including abnormal or lack of differentiation, uncontrolled growth and proliferation, and abnormal apoptosis.
• Curtius K.
• Wright N.A.
• Graham T.A.
An evolutionary perspective on field cancerization.
Furthermore, these changes in cells trigger alterations in the host response that can be observed in the tumor microenvironment.
• Hanahan D.
Hallmarks of cancer: new dimensions.
Analyzing the RNA of cancerous tissues can provide tremendous information on the biology of the tumor, its differentiation, and the surrounding microenvironment. This information provides insight into the clinical behavior and therapeutic efficacy of the tumors.
• Jarosz-Biej M.
• Smolarczyk R.
• Cichoń T.
• Kułach N.
Tumor microenvironment as a “game changer” in cancer radiotherapy.
This information can be used to determine diagnosis, clinical course, potential therapeutic targets, and prognosis. In this study, we explored the potential of using RNA expression profiles to precisely diagnose cancer.
Determining the initial diagnosis is the first step in cancer management and therapy. Pathologists typically use the morphology and large panels of immunohistochemistry to confirm cancer diagnosis. Nevertheless, misdiagnosis may have a significant impact on clinical decisions and treatment, because morphologic evaluation is subjective and depends on the expertise of the pathologist. A more objective approach may help pathologists to diagnose precisely and reduce errors. This study investigated the potential of using RNA expression profiling in a machine learning approach to aid pathologists in making diagnoses and determining the cell of origin of the tumors. The approach described in this study was not intended to replace the clinical decision by the pathologist and clinician, but rather to aid the decision making and to add objectivity, efficiency, and reproducibility.
Although the focus of this article was to make a diagnosis and determine the cell of origin, the RNA molecular data generated in this process provided information on mutations, fusion genes, the microenvironment, and the immune response. As proof of principle, we focused on RNA expression profiling and did not incorporate mutation profiling in the diagnostic algorithms at this time. We used a targeted transcriptome to profile a wide range of hematologic neoplasms and solid tumors. We elected to use the targeted transcriptome rather than the whole transcriptome to exclude highly expressed housekeeping genes, and to improve the detection dynamic range of genes that may be expressed at low levels, which may have a significant impact on oncogenesis and cell differentiation. Furthermore, targeted transcriptome by hybrid capture is reliable when dealing with FFPE tissue and is more amenable to clinical testing and cost-effectiveness.
We first explored the potential of targeted RNA profiling combined with machine learning to distinguish between the two diagnostic classes. We used a unique approach in our machine learning to select the proper genes for classification, as described in Materials and Methods. We elected to use machine learning over convolutional neural network or deep learning, which has been shown to be effective on image-related applications, because expression data are different. The design of convolutional neural networks, such as the multiple convolution layers with small windows, naturally fits the structure of image data. The grid structure of an image is effectively represented in convolutional neural networks. However, gene expression analysis is a different problem. Little is known on relations or structures among different genes.
Our approach is to combine the classifiers on individual genes without estimating their correlations. Estimating the mean and variance of one gene is certainly feasible, and our geometric mean naïve Bayes method provides a way to construct a stable and deterministic combined classifier. The risk of overfitting is low, because the parameter estimation is stable and there is no hyperparameter to fit. The power of our classifier comes from the contribution of a large number of single-gene classifiers. Each single-gene classifier is usually weak, but this weakness is overcome by combining many weak classifiers together, as we demonstrate in this article. We first ranked specific genes whose expressions could distinguish between the two diagnostic classes in question. We then used machine learning in combination with gene information to distinguish between the two classes. As shown in Table 2, Figure 1, and Supplemental Figure S1, we demonstrated that distinguishing between the two diagnostic classes was reliable. The accuracy of prediction can be calculated from the receiver operating characteristic curves and may vary dependent on which cutoff point is used and whether we want to emphasize sensitivity or specificity. For example, distinguishing between normal BM and BM in acute myeloid leukemia, and various types of leukemia, can be achieved with high sensitivity and specificity (>90%). As expected, distinguishing between normal BM and more chronic diseases was less conclusive (area under the curve of 78.1% for MDS and 90.9% for MPN) because of the overlap between these entities and normal or reactive BM. In particular, these BM samples were obtained because of some indication of abnormality but were determined to be negative for a neoplastic process by morphology, flow cytometry, and lack of mutations. Adding a mutation profile to the data and allowing the algorithm to consider mutations would significantly improve prediction. Another example is distinguishing between Hodgkin lymphoma and normal lymph nodes, which can be difficult based on morphology and immunoprobing by flow cytometry or immunohistochemistry. The machine learning algorithm was able to distinguish between these two diagnostic classes with high sensitivity and specificity (95.4% and 100%, respectively) using the expression profiles of 100 genes. Similarly, in solid tumors, distinguishing between two tumors based on the site of origin was achievable with high accuracy and area under the curve ranging between 1.00 and 0.959. As expected, distinguishing between endometrial cancer and ovarian cancer was relatively more challenging than distinguishing between other tumors (area under the curve = 0.959 using 600 genes). In solid tumors, the expression of many genes was required (between 300 and 900) to achieve high accuracy in predicting various diagnostic classes. In contrast, only 10 to 500 genes were needed to distinguish between various diagnostic classes of hematologic neoplasms.
When all the 47 diagnostic classes were considered and no prior knowledge of the tumor site and cell of origin was assumed, prediction of diagnosis solely based on RNA expression profiling was more challenging (Table 3). However, in this classification, the machine learning algorithm provided a ranking for potential diagnostic classes. This ranking system listed the biologically overlapping diagnostic classes; therefore, other information, including mutation profile, morphology, and clinical data, can be considered to reach the final diagnostic decision. The information obtained using this approach can be used in the two-class algorithm described above. As shown in Table 3, the positive percentage agreement was 100% for acute lymphoblastic leukemia and remained high for most of the major epithelial tumors (colorectal, brain, lung, and breast). The positive percentage agreement improved further when the second-ranking diagnosis was considered, particularly for lung cancer (88% to 91%). For hematologic neoplasms, high positive percentage agreement was obtained for lymphoid neoplasms, particularly diffuse large B-cell lymphoma, and improved further when a second-ranking diagnosis was considered (from 88% to 91% for diffuse large B-cell lymphoma, from 62% to 97% for Hodgkin lymphoma, and from 72% to 86% for follicular lymphoma). As expected, distinguishing between normal BM and chronic myeloid neoplasms was less reliable. Chronic myeloid leukemia was practically indistinguishable from other diseases without molecular data. However, the same targeted RNA sequencing provided the results of BCR-ABL1 fusion mRNA, and the diagnosis could be confirmed. MDS, chronic myelomonocytic leukemia, MPN, and normal BM could be distinguished, if the mutation profile was considered.
By grouping samples into five classes (lymphoid, myeloid, carcinoma, normal, and sarcoma), we calculated the accuracy of diagnosis with sensitivity and specificity. As shown in Table 4, carcinomas and lymphomas were correctly classified with good sensitivity and specificity. Sarcoma and normal tissues were classified as having a high specificity.
This study demonstrated the potential of combining artificial intelligence with genomics in the routine practice of oncology, and in determining the diagnosis and cell origin of tumors. Therefore, clinical decisions can be based on solid objective data. However, some diagnostic classes contained too few cases, and increasing the number of cases and further validation are needed. Some myeloid samples, particularly those of chronic leukemia, were in the early stage of disease. Samples with solid tumors were metastatic, involving lymph nodes. A metastatic epithelial tumor in a lymph node can show a lymphoid profile, in addition to carcinoma, particularly if the carcinoma fraction is not dominant. This could potentially confuse the diagnostic algorithm. Microdissection of the tumor was performed on all analyzed solid tumor samples, but the tumor fraction varied between 30% and 90%. Therefore, ranking the diagnoses was important, and it allowed us to consider other information to reach the final diagnosis. This approach was realized practically by developing a software that can be used to feed RNA data for automated diagnosis and classification of tumors. Furthermore, this software and algorithms can be continuously trained by adding more samples or new diagnostic classes.
Limitation of the study is not including the exact molecular mutations and chromosomal abnormalities in the algorithms. Integrating such abnormalities most likely will improve the prediction significantly.

## Acknowledgment

We thank Editage for English-language editing.

## Author Contributions

H.Z. and M.A. developed artificial intelligence algorithms and analyzed data; M.A.Q., M.W., and A.C. performed blind testing; A.E., A.I., J.M., M.D., D.S., M.G., A.P., A.G., and M.A. contributed samples, concept design, and data interpretation; and I.D.D., W.M., and I.S. performed RNA sequencing.

## Supplemental Data

• Supplemental Figure S1

Receiver operating characteristic curves for the prediction of diagnoses between two diagnostic classes using RNA combined with the machine learning algorithm. The area under the curve (AUC) and 95% CI are shown for various diagnostic classes. The number of genes used for distinguishing between diagnostic classes is shown. ALL, acute lymphoblastic leukemia; AML, acute myeloid leukemia; CLL, chronic lymphocytic leukemia; FPF, false positive fraction (specificity); GIST, gastrointestinal stromal tumor; TPF, true positive fraction (sensitivity).

• Supplemental Table S1

## References

• Troyanskaya O.
• Trajanoski Z.
• Carpenter A.
• Thrun S.
• Razavian N.
• Oliver N.
Artificial intelligence and cancer.
Nat Cancer. 2020; 1: 149-152
• Chen J.H.
• Dhaliwal G.
Next-generation artificial intelligence for diagnosis: from predicting diagnostic labels to “wayfinding.”.
JAMA. 2021; 326: 2467-2468
• Elemento O.
• Leslie C.
• Lundin J.
• Tourassi G.
Artificial intelligence in cancer research, diagnosis and therapy.
Nat Rev Cancer. 2021; 21: 747-752
• Moon M.
• Nakai K.
Stable feature selection based on the ensemble L 1-norm support vector machine for biomarker discovery.
BMC Genomics. 2016; 17: 65-74
• Hong M.
• Tao S.
• Zhang L.
• Diao L.-T.
• Huang X.
• Huang S.
• Xie S.-J.
• Xiao Z.-D.
• Zhang H.
RNA sequencing: new technologies and applications in cancer research.
J Hematol Oncol. 2020; 13: 1-16
• Govindarajan M.
• Wohlmuth C.
• Waas M.
• Bernardini M.Q.
• Kislinger T.
High-throughput approaches for precision medicine in high-grade serous ovarian cancer.
J Hematol Oncol. 2020; 13: 1-20
• Mercer T.R.
• Gerhardt D.J.
• Dinger M.E.
• Crawford J.
• Trapnell C.
• Jeddeloh J.A.
• Mattick J.S.
• Rinn J.L.
Targeted RNA sequencing reveals the deep complexity of the human transcriptome.
Nat Biotechnol. 2012; 30: 99-104
• Reeser J.W.
• Martin D.
• Miya J.
• Kautto E.A.
• Lyon E.
• Zhu E.
• Wing M.R.
• Smith A.
• Reeder R.
• Samorodnitsky E.
• Parks H.
• Naik K.R.
• Gozgit J.
• Nowacki N.
• Davies K.D.
• Varella-Garcia M.
• Yu L.
• Freud A.G.
• Coleman J.
• Aisner D.L.
• Roychowdhury S.
Validation of a targeted RNA sequencing assay for kinase fusion detection in solid tumors.
J Mol Diagn. 2017; 19: 682-696
• Togni M.
• Masetti R.
• Pigazzi M.
• Astolfi A.
• Zama D.
• Indio V.
• Serravalle S.
• Manara E.
• Bisio V.
• Rizzari C.
• Basso G.
• Pession A.
• Locatelli F.
Identification of the NUP98-PHF23 fusion gene in pediatric cytogenetically normal acute myeloid leukemia by whole-transcriptome sequencing.
J Hematol Oncol. 2015; 8: 1-3
• Veeraraghavan J.
• Ma J.
• Hu Y.
• Wang X.-S.
Recurrent and pathological gene fusions in breast cancer: current advances in genomic discovery and clinical implications.
Breast Cancer Res Treat. 2016; 158: 219-232
• Kloosterman W.P.
• van den Braak R.R.C.
• Pieterse M.
• Van Roosmalen M.J.
• Sieuwerts A.M.
• Stangl C.
• Brunekreef R.
• Lalmahomed Z.S.
• Ooft S.
• Galen A.V.
• Smid M.
• Lefebvre A.
• Zwartkruis F.
• Martens J.W.M.
• Foekens J.A.
• Biermann K.
• Koudijs M.J.
• Ijzermans J.N.M.
• Voest E.E.
A systematic analysis of oncogenic gene fusions in primary colon cancer.
Cancer Res. 2017; 77: 3814-3822
• Zhou X.
• Zhan L.
• Huang K.
• Wang X.
The functions and clinical significance of circRNAs in hematological malignancies.
J Hematol Oncol. 2020; 13: 1-15
• Liu Y.
• Cheng Z.
• Pang Y.
• Cui L.
• Qian T.
• Quan L.
• Zhao H.
• Shi J.
• Ke X.
• Fu L.
Role of microRNAs, circRNAs and long noncoding RNAs in acute myeloid leukemia.
J Hematol Oncol. 2019; 12: 1-20
• Sun Y.-M.
• Chen Y.-Q.
Principles and innovative technologies for decrypting noncoding RNAs: from discovery and functional prediction to clinical application.
J Hematol Oncol. 2020; 13: 1-27
• Albitar M.
• Zhang H.
• Goy A.
• Xu-Monette Z.Y.
• Bhagat G.
• Visco C.
• Tzankov A.
• Fang X.
• Zhu F.
• Dybkaer K.
• Chiu A.
• Tam W.
• Zu Y.
• Hsi E.D.
• Hagemeister F.B.
• Huh J.
• Ponzoni M.
• Ferreri A.J.M.
• Møller M.B.
• Parsons B.M.
• van Krieken J.H.
• Piris M.A.
• Winter J.N.
• Li Y.
• Xu B.
• Young K.H.
Determining clinical course of diffuse large B-cell lymphoma using targeted transcriptome and machine learning algorithms.
Blood Cancer J. 2022; 12: 25
• Berger M.F.
• Mardis E.R.
The emerging clinical relevance of genomics in cancer medicine.
Nat Rev Clin Oncol. 2018; 15: 353-365
• Curtius K.
• Wright N.A.
• Graham T.A.
An evolutionary perspective on field cancerization.
Nat Rev Cancer. 2018; 18: 19-32
• Hanahan D.
Hallmarks of cancer: new dimensions.
Cancer Discov. 2022; 12: 31-46
• Jarosz-Biej M.
• Smolarczyk R.
• Cichoń T.
• Kułach N.
Tumor microenvironment as a “game changer” in cancer radiotherapy.
Int J Mol Sci. 2019; 20: 3212