Convolutional Neural Networks for the Evaluation of Chronic and In ﬂ ammatory Lesions in Kidney Transplant Biopsies

In kidney transplant biopsies, both in ﬂ ammation and chronic changes are important features that predict long-term graft survival. Quantitative scoring of these features is important for transplant diagnostics and kidney research. However, visual scoring is poorly reproducible and labor intensive. The goal of this study was to investigate the potential of convolutional neural networks (CNNs) to quantify in ﬂ ammation and chronic features in kidney transplant biopsies. A structure segmentation CNN and a lymphocyte detection CNN were applied on 125 whole-slide image pairs of periodic acid e Schiff e and CD3-stained slides. The CNN results were used to quantify healthy and sclerotic glomeruli, interstitial ﬁ brosis, tubular atrophy, and in ﬂ ammation within both nonatrophic and atrophic tubuli, and in areas of interstitial ﬁ brosis. The computed tissue features showed high correlations with Banff lesion scores of ﬁ ve pathologists

In kidney transplant biopsies, both inflammation and chronic changes are important features that predict long-term graft survival. Quantitative scoring of these features is important for transplant diagnostics and kidney research. However, visual scoring is poorly reproducible and labor intensive. The goal of this study was to investigate the potential of convolutional neural networks (CNNs) to quantify inflammation and chronic features in kidney transplant biopsies. A structure segmentation CNN and a lymphocyte detection CNN were applied on 125 whole-slide image pairs of periodic acideSchiffe and CD3-stained slides. The CNN results were used to quantify healthy and sclerotic glomeruli, interstitial fibrosis, tubular atrophy, and inflammation within both nonatrophic and atrophic tubuli, and in areas of interstitial fibrosis. The computed tissue features showed high correlations with Banff lesion scores of five pathologists Q10 . Analyses on a small subset showed a moderate correlation toward higher CD3 þ cell density within scarred regions and higher CD3 þ cell count inside atrophic tubuli correlated with longterm change of estimated glomerular filtration rate. The presented CNNs are valid tools to yield objective quantitative information on glomeruli number, fibrotic tissue, and inflammation within scarred and non-scarred kidney parenchyma in a reproducible manner. CNNs have the potential to improve kidney transplant diagnostics and will benefit the community as a novel method to generate surrogate end points for large-scale clinical studies. Although much progress has been made toward the prevention of acute kidney transplant rejection, long-term graft loss remains a major issue in donor kidney  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 demonstrated the prognostic value of inflammation and tubulitis in regions with interstitial fibrosis and tubular atrophy (i-IFTA and t-IFTA, respectively). 1e4 Accurate scoring of these chronic, inflammatory parameters is therefore pivotal in strategies to prevent graft loss. The commonly used scoring system for kidney transplant biopsy assessment is the Banff classification system. 5,6 The Banff classification system was the first standardized, international classification system for kidney transplant diagnostics and facilitated uniformity in the reporting of renal transplant pathology. 7 It is internationally applied by kidney researchers and physicians, and it is the globally accepted quantification tool for histopathologic transplant evaluation. However, the Banff classification system has increasingly been criticized for its limited reproducibility and its suboptimal patient stratification. Multiple studies show poor to moderate interobserver agreement, specifically for the scoring of fibrotic changes and inflammatory lesions. 8e12 Moreover, the Banff classification system is based on semiquantitative scoring on an ordinal scale, whereas inflammatory and chronic parameters represent a continuous spectrum and should therefore preferably be quantified on a granular, continuous scale.
Quantitative assessment of transplant biopsies may be improved by the application of digital image analysis techniques. 13e15 Specifically, deep learning, the use of datadriven learning systems where multilayered (deep) neural networks are trained to generate output from input, has proven to be a powerful tool for histopathologic tissue assessment. 16e19 The most widely applied neural networks in medical image analysis are convolutional neural networks (CNNs). CNN-based image analysis could benefit biopsy assessment by increasing reproducibility and efficiency. In addition, CNNs can output absolute values, which may provide more insight into the stage of ongoing pathologic processes. A second and important advantage of CNNbased image analysis is the ability to decrease interobserver variability, a major problem in any form of histologic assessment by human observers.
The notable performance of CNNs on medical imaging data has resulted in an increasing number of studies focused on deep learning applications for kidney tissue. These efforts were pioneered by the segmentation and classification of the glomerulus and were expanded toward other applications, such as multiclass segmentation, the segmentation of sclerotic glomeruli and interstitial fibrosis and tubular atrophy (IFTA), and diabetic nephropathy classification. 20e24 This study investigated the potential of CNNs as quantification tools for the assessment of chronic and inflammatory lesions, going beyond the current arbitrary semiquantitative thresholds and showing the absolute quantification of tubulointerstitial inflammation as a continuous parameter in areas with and without IFTA. Ideally, CNNs can be used in addition to the Banff classification system to support kidney researchers and physicians in their studies on chronic kidney tissue changes.
For this purpose, two previously developed CNNs aimed at the segmentation of periodic acideSchiff (PAS)estained tissue and detection of lymphocytes in immunohistochemistry (IHC) were used. 25,26 The CNNs were retrained and applied on a cohort of PAS-and CD3-stained kidney transplant biopsy slides. Quantifications were performed on the basis of the CNN results. The reliability of the CNNbased quantifications was evaluated by assessing the correlation with the following visually scored components of the Banff classification system: glomerular count, total inflammation (ti), interstitial inflammation (i), tubulitis (t), interstitial fibrosis (ci), tubular atrophy (ct), i-IFTA, and t-IFTA.

Materials and Methods
A visual overview of this study can be found in Figure 1 ½F1 .

Patient Cohort
Tissue  Table 1 ½T1 . The local institutional review board waived the need for approval of using Radboudumc tissue blocks in this study (number 2016-2269).

Regions of Interest
The cortical regions were manually annotated using the automated slide analysis platform software version 1.9 (https://github.com/computationalpathologygroup/ASAP, last accessed June 13, 2022). The pathologists Q16 were asked to perform their analyses within these regions of interest, and the CNN-based quantifications were performed within these same regions. Tissue folds, subcapsular inflammation, and inflammatory infiltrates surrounding large arteries were excluded from the regions of interest.

Visual Pathologists' Assessment of the Patient Cohort Biopsies
Five pathologists Q17 , specialized in kidney transplant pathology, manually counted the number of glomeruli and scored the following Banff lesion categories on the PAS WSI according to criteria listed in Supplemental Table 1 (based on Banff 2018 5 ): ti, i, t, ci, ct, i-IFTA, and t-IFTA. After a washout period of 4 weeks minimum, the pathologists repeated the scoring for the Banff ti, i, t, i-IFTA, and t-IFTA categories, now using the PAS WSI in combination with the CD3 WSI. The interobserver variability was assessed for both scenarios by calculating quadratic weighted Cohen k coefficients. The visual glomerular counts and Banff ti, i, t, ct, ci, i-IFTA, and t-IFTA scores were compared with their equivalent tissue feature, quantified by CNNs (listed in Supplemental Table 2

Structure Segmentation CNN Development
The authors previously presented a U-net architectural CNN for the multiclass structure segmentation of PAS-stained kidney sections into relevant tissue classes, such as healthy and globally sclerotic glomeruli, interstitium, and proximal, distal, and atrophic tubuli. 25 For the current study, this CNN was improved by including more training data and improved post-processing techniques (see below). There was no overlap between the cases that were used for CNN development and the slides that were used in the formerly described PAS-CD3 patient cohort. A novel method was developed for the segmentation of interstitial fibrosis based on image processing of the multiclass structure segmentation results, further described in Indirect Segmentation Method for Interstitial Fibrosis and IFTA.

Ground Truth
For development of the structure segmentation network, the data set (60 WSIs) that was described in the authors' earlier publication on kidney tissue segmentation 25 was complemented with 36 additional PAS-stained transplant biopsies (Radboudumc, n Z 19; Mayo Clinic, n Z 17) and 3 tumor nephrectomy samples (Mayo Clinic), resulting in 99 WSIs. The slides were digitized on a Pannoramic 250 Flash II digital slide scanner (3DHISTECH; Radboudumc) or an Aperio ScanScope XT System scanner (Leica Biosystems Q18 , Germany; Mayo Clinic) at a resolution of 0.24 and 0.49 mm/ pixel, respectively. The data set was annotated using automated slide analysis platform software, applying the following predefined classes: glomeruli, sclerotic glomeruli, empty Bowman capsules, proximal tubuli, distal tubuli, atrophic tubuli, capsule, arteries/arterioles, interstitium, and border (being the basement membranes of the tubuli). All annotations were checked and corrected where necessary by a pathologist Q19 . The WSIs were randomly divided into training (n Z 63), validation (n Z 16), and test (n Z 20) sets. The total number of annotations per tissue class is listed in Supplemental Table 3. Mayo Clinic tissue samples were scanned with institutional review board approval (numbers 17-002391 and 10-004644), and digital image file transfer was approved under institutional review board number 18-005592.

Network Design
A U-net architecture was used as the structure segmentation network design. 27 The network was trained for 95 epochs at 512 iterations per epoch with a batch size of eight patches (412 Â 412 pixels at a resolution of 1.0 mm/pixel). Adam was used as weight optimization algorithm and categorical cross entropy as loss function. 28 Spatial and color augmentation techniques were applied to increase the network's robustness for variations in tissue morphology, staining intensity, and image quality. Before inference of the structure segmentation network, a tissue-background segmentation network was applied, separating tissue from background and removing dust particles and tissue artifacts. 29 Post-Processing Post-processing was used to optimize the structure segmentation results, applying the following steps at a pixel spacing of 1.0 mm/pixel: i) pixels classified as empty glomeruli positioned at the edge of the biopsy were removed; ii) pixels classified as border or interstitium were temporarily set to 0, grouping pixels of all the other classes into discrete objects; iii) holes (ie, value 0 regions) with an area <150 pixels inside objects were filled with their  373  374  375  376  377  378  379  380  381  382  383  384  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432  433  434  435   436  437  438  439  440  441  442  443  444  445  446  447  448  449  450  451  452  453  454  455  456  457  458  459  460  461  462  463  464  465  466  467  468  469  470  471  472  473  474  475  476  477  478  479  480  481  482  483  484  485  486  487  488  489  490  491  492  493  494  495  496 dominant surrounding object label; iv) objects with an area <300 pixels were considered noise and set to the interstitium class; v) objects that consisted of more than one tubule class were assigned to the predominant tubule class, and objects that consisted of more than one glomerulus class were assigned to the predominant glomerulus class; vi) regions <50 pixels inside objects were assigned to their dominant surroundings; vii) objects classified as glomeruli, having an area <2500 pixels, were set to the interstitium class; and viii) pixels classified as border were labeled as interstitium, and all interstitium pixels were subsequently placed back unless they were filled during step 3. The decision to use a minimum area of 2500 pixels for glomeruli was based on the knowledge that the diameter of a complete glomerulus ranges from approximately 100 to 200 mm, depending on the level of sectioning. This corresponds to a minimum area of 7854 pixels [based on the following formula: area Z (diameter 2 * p)/4]. By using 2500 pixels as a minimum area, corresponding to a diameter of approximately 56 mm, we avoided the risk of excluding complete glomeruli.

Structure Segmentation Performance
The segmentation performance of the network was assessed by calculating the CNN's precision, recall, and Dice score on pixel level on the test set, where: The test set that was used to assess the performance metrics of the structure segmentation CNN was composed of PAS-stained slides from the Mayo Clinic and Radboudumc. Because the material from the current patient cohort contains PAS-stained biopsies from Radboudumc, it can be assumed that the performance on the test set will correspond with that on this patient cohort. Therefore, the performance metrics of the structure segmentation CNN were not additionally calculated for the patient cohort.

Indirect Segmentation Method for Interstitial Fibrosis and IFTA
The structure segmentation CNN was subsequently applied to the 125 PAS WSIs from the patient cohort (see Materials and Methods; Patient Cohort; Tissue Samples). Interstitial fibrosis regions were derived from the structure segmentation masks by computing distance maps for interstitial pixels with respect to atrophic tubuli and to all other structures. Pixels were assigned to the interstitial fibrosis class if they were closer to atrophic tubuli than to any other structure, under the biological assumption that interstitial fibrosis and tubular atrophy develop in tandem. This allowed for the quantification of interstitial fibrosis alone and IFTA. Because the CNN was not directly trained on interstitial fibrosis and IFTA, Dice score, precision, and recall could not be calculated for these classes. Instead, three human observers Q21 visually estimated the percentage interstitial fibrosis and IFTA on 20 cases from the patient cohort. Similar to the automated scoring method, the visual score was a continuous score, ranging from 0% to 100%, and was not limited to categories. To assess the soundness of our automatic interstitial fibrosis/IFTA scoring method, the intraclass correlation coefficient (ICC) was calculated for the percentages given by the human observers and the percentages based on CNN results.

Lymphocyte Detection CNN
A recently developed lymphocyte detection CNN was adapted and used for the detection of lymphocytes in this study. This network was developed in a previous study, 26 in which four network architectures were trained with 171,166 manually annotated CD3 þ and CD8 þ lymphocytes: a fully convolutional network, a U-net, a you only look at lymphocytes once network, and a locally sensitive method network. The networks were evaluated for their detection performance of lymphocytes within normal tissue, artifacts, and immune cell clusters, using IHC-stained sections originating from nine medical centers. The best performing network for all the tasks was used in the current study (Unet). Because this network was trained on conventional IHC, it was retrained for the current study using 6237 lymphocyte annotations (15 WSIs) in restained kidney slides (PAS-CD3) in addition to the original training data. This retrained network was subsequently used for the cell detections in this study.

Image Registration
The PAS WSI and the CD3 WSI pairs display the same biopsies and are therefore roughly aligned. Nevertheless, tissue deformations may occur during IHC staining, and the rescanning of the slides causes a slight alteration of the tissue's coordinates in the image. This was corrected by nonlinear image registration, using the noncommercially available software HistokatFusion Q22 (Fraunhofer MEVIS lab, Bremen, Germany). The software offers a three-step registration pipeline, consisting of a manual or automated prealignment, a parametric registration computed on coarse resolution images, and an accurate nonlinear registration. 30 This allowed for an accurate spatial translation of tissue features between slides and corresponding masks.

Automatically Quantified Tissue Features
On the basis of the registered results of the structure segmentation CNN and the lymphocyte detection CNN, the following features were calculated: the number of nonsclerotic glomeruli and globally sclerotic glomeruli; the highest CD3 þ cell count inside proximal tubuli or distal tubuli; the highest CD3 þ cell count inside atrophic tubuli; the CD3 þ cell density inside the total cortical area; the CD3 þ cell density inside the cortical area, excluding interstitial fibrosis; and the CD3 þ cell density inside regions of interstitial fibrosis.

Correlation between Automated Feature Quantification and Visual Banff Lesion Scoring
To assess the correlation of glomerular counting performed by pathologists with automated glomerular quantification, the average ICC of the pathologists and the average ICC of the pathologists and the CNN are reported.

Correlation between Automated and Visual Scoring of Chronic Lesions and the Course of Kidney Function
In contrast to the ordinal scoring by human observers, the deep learningebased results are reported as a continuum. It should be investigated whether these continuous values hold more prognostic information than the current lesion scoring system. As an illustration for such a validation study, we assessed the correlation between manually and automatically scored chronic lesions and long-term change in kidney function. More extensive validation should be performed on a larger data set, specifically designed for this purpose. The D estimated glomerular filtration rate (DeGFR) was defined as the difference between eGFR measured at 1 week before the biopsy procedure (according to the Modification of Diet in Renal Disease formula) and the eGFR measured at 2 years after the biopsy procedure. These data were available for 46 cases. One biopsy sample per patient was used for these analyses. When biopsy samples from multiple time points were included from a single patient in the patient cohort, only the last sample was included (n Z 39). Cases were only included if no clinical event occurred (defined as the need for a biopsy for cause) between the biopsy procedure and eGFR measurement 2 years after the biopsy procedure (n Z 29). Subsequently, 11 cases were excluded, where the biopsy for cause was obtained <60 days after transplantation to avoid that early transplantation-related lesions, such as acute tubular necrosis, would distort the analyses. This resulted in 18 eligible cases for the correlation assessment. Spearman correlation was calculated to assess the relationship between DeGFR and visually scored i-IFTA, t-IFTA, ci, and ct scores. The Spearman correlation was also calculated between DeGFR and automatically quantified CD3 þ cell density inside fibrotic regions, CD3 þ cell counts per atrophic tubuli, area percentage of interstitial fibrosis, and percentage of atrophic tubuli.

Validation of the Indirect Interstitial Fibrosis and IFTA Segmentation Method with Visually Estimated Percentages
The correlation of automatically generated interstitial fibrosis and IFTA percentages with percentages provided by human observers was assessed to validate the indirect segmentation method of fibrotic regions. The average ICC of three human observers for scoring interstitial fibrosis was 0.655, and the average agreement of the observers and the  ). This validation confirmed the rationale of the indirect interstitial fibrosis and IFTA segmentation strategy and justified the use of this method to define fibrotic tissue regions in the entire patient cohort. These regions were used to automatically include and exclude interstitial fibrotic regions in CD3 þ cell density calculations and to quantify interstitial fibrosis. boxed areas on the low-resolution images represent the areas depicted in the high-resolution images. C and D: The segmentation of atrophic tubuli by the structure segmentation convolutional neural network is visualized in green. E and F: Using image processing, pixels in closer proximity to atrophic tubuli than to any other structures (excluding interstitium) were assigned to the interstitial fibrosis class (green). The interstitial fibrosis (IF) percentage based on the cortical area in this figure is 1% for the nonfibrotic biopsy and 36% for the fibrotic biopsy. Scale bars: 500 mm (A and B, low resolution); 50 mm (A and B, high resolution).  747  748  749  750  751  752  753  754  755  756  757  758  759  760  761  762  763  764  765  766  767  768  769  770  771  772  773  774  775  776  777  778  779  780  781  782  783  784  785  786  787  788  789  790  791  792  793  794  795  796  797  798  799  800  801  802  803  804  805  806  807   808  809  810  811  812  813  814  815  816  817  818  819  820  821  822  823  824  825  826  827  828  829  830  831  832  833  834  835  836  837  838  839  840  841  842  843  844  845  846  847  848  849  850  851  852  853  854  855  856  857  858  859  860  861  862  863  864  865  866

Agreement between Automated Feature Quantification and Visual Banff Lesion Scoring
The results of the structure segmentation CNN and the lymphocyte detection CNN were used to quantify numerous tissue features from the patient cohort. ICCs and Spearman correlations were calculated between these features and the average Banff lesion scoring of five kidney pathologists Q24 . The mean ICC of the CNN and the panel of pathologists for glomerular counting was 0.941. As supported by Figure 4, visual assessment of the segmentation result showed highly accurate segmentations with occasional false-positive segmentations of sclerotic glomeruli. Limiting the automated glomerular count to non-sclerotic glomeruli led to a mean ICC of the CNN and the pathologists of 0.972 (Table 3   ½T3 ). Next, the CNN assessment of interstitial fibrosis (pixel percentage), tubular atrophy (object percentage), inflammation in the total tubulointerstitium (cells/mm 2 ), inflammation in nonfibrotic regions (cells/mm 2 ), inflammation in fibrotic regions (cells/mm 2 ), tubulitis (highest cell count), and tubulitis in atrophic tubuli (highest cell count) was compared with the average score of pathologists for the following Banff categories: ci, ct, ti, i, i-IFTA, t, and t-IFTA (Table 4 ½T4 and Figure 6   ½F6 ). The highest correlation was reported for automatically assessed CD3 þ cell density in the total cortical area with the mean ti score of the pathologists, followed by the CD3 þ cell density in non-scarred cortical regions and the mean i score of the pathologists. Good correlations were reported for automatic and visual assessment of interstitial fibrosis and tubular atrophy, as well as for CD3 þ cell density in scarred cortical regions and the mean i-IFTA score of the pathologists. The lowest correlations are reported for the highest CD3 þ cell count in nonatrophic tubuli and the mean t score of the pathologists, and the highest CD3 þ cell count in atrophic tubuli and the mean t-IFTA score of the pathologists.

Correlation between Chronic Tissue Scores and the Course of Kidney Function
The correlation of ci, ct, i-IFTA, and t-IFTA with the longterm course of kidney function was evaluated for the CNNbased quantification method and the visually assessed Banff scores. On average, an improvement of eGFR was found over time in this subset of the patient cohort. Nevertheless, moderate inverse correlations were found between the DeGFR and the average i-IFTA score of the pathologists (ICC Z e0.567; P Z 0.014) (Supplemental Figure 3A), and DeGFR and automatically assessed cell density inside interstitial fibrotic regions of the cortex (ICC Z e0.515; P Z 0.029) (Supplemental Figure 3B). The highest CD3 þ cell count inside atrophic tubuli segmented by the structure segmentation CNN also inversely correlated with DeGFR (ICC Z e0.782; P < 0.001) (Supplemental Figure 4B). A weaker inverse correlation was found between the average print & web 4C=FPO  869  870  871  872  873  874  875  876  877  878  879  880  881  882  883  884  885  886  887  888  889  890  891  892  893  894  895  896  897  898  899  900  901  902  903  904  905  906  907  908  909  910  911  912  913  914  915  916  917  918  919  920  921  922  923  924  925  926  927  928  929  930  931 t-IFTA score of the pathologists and DeGFR compared with the correlation with the automated method (ICC Z e0.568; P Z 0.014) (Supplemental Figure 4A). The visual ci and ct Banff score and the automatically assessed interstitial fibrosis area percentage and tubular atrophy percentage did not correlate with DeGFR.

Discussion
In this study, deep learning was used to quantify both inflammation and chronic lesions in kidney transplant biopsies. Two CNNs were applied: a structure segmentation CNN for PAS-stained kidney tissue and a lymphocyte    1117  1118  1119  1120  1121  1122  1123  1124  1125  1126  1127  1128  1129  1130  1131  1132  1133  1134  1135  1136  1137  1138  1139  1140  1141  1142  1143  1144  1145  1146  1147  1148  1149  1150  1151  1152  1153  1154  1155  1156  1157  1158  1159  1160  1161  1162  1163  1164  1165  1166  1167  1168  1169  1170  1171  1172  1173  1174  1175  1176  1177  1178  1179   1180  1181  1182  1183  1184  1185  1186  1187  1188  1189  1190  1191  1192  1193  1194  1195  1196  1197  1198  1199  1200  1201  1202  1203  1204  1205  1206  1207  1208  1209  1210  1211  1212  1213  1214  1215  1216  1217  1218  1219  1220  1221  1222  1223  1224  1225  1226  1227  1228  1229  1230  1231  1232  1233  1234  1235  1236  1237  1238  1239  1240 inflammation in non-scarred cortical regions, and inflammation in areas with interstitial fibrosis correlated with Banff ti, i, and i-IFTA scoring, respectively. In addition, glomerular counts based on CNN results correlated highly with visual glomerular counts. A correlation was found between higher inflammatory cell density inside areas of interstitial fibrosis and long-term decline in eGFR. Lower kidney function also correlated with higher inflammatory cell count inside atrophic tubuli. This was in agreement with the correlations that were found for visual Banff i-IFTA and t-IFTA scoring with long-term changes in eGFR. The literature on kidney tissue segmentation using deep learning has expanded drastically over the past few years. 31e34 Many of the models described in the literature were trained in a binary manner (ie, glomeruli versus nonglomeruli or tubuli versus nontubuli). The current study demonstrates a segmentation performance for healthy and globally sclerotic glomeruli comparable to that reported in literature, despite the challenge of nonbinary segmentation. 35e37 Also, glomerular quantifications based on our CNN results correlated highly with glomerular counts performed by five pathologists Q25 . In a study by Jayapandian et al, 38 multiple networks were presented for segmenting glomerular, vascular, and tubular structures. The authors are one of the few to report separate segmentation performance of proximal and distal tubular segments, with impressive results. Unfortunately, atrophic tubuli were not included in this study. 38 Bouteldja et al 36 demonstrated a multiclass segmentation network for PASstained kidney tissue, showing excellent segmentation performances. However, healthy and atrophic tubuli were combined in their evaluation. The current study presents the only multiclass structure segmentation CNN that is developed for the segmentation and classification of the interstitium, healthy and sclerotic glomeruli, and proximal, distal, and atrophic tubuli. Such discrimination (especially that between healthy and atrophic/sclerotic structures) is crucial for developing an assay that yields clinically relevant and actionable data.
Interstitial fibrosis and tubular atrophy have been shown to correlate with chronic kidney disease and chronic rejection in kidney transplants. The quantification of fibrosis has been the subject of several studies. 39e42 Artificial neural networks have been developed for the assessment of fibrosis in trichrome-stained kidney slides 43,44 and recently the first neural network for sclerotic glomeruli and IFTA segmentation in PAS-stained slides was presented, showing good agreement with manual annotations in deceased-donor tissue. 24 In the current study, a novel approach was presented for the segmentation of interstitial fibrosis by generating an interstitial fibrosis mask based on atrophic tubuli segmentations resulting from the structure segmentation CNN. The segmentation of pixels in closer proximity to atrophic tubuli than to other structures resulted in a convincing definition of interstitial fibrotic regions. The correlation of the manual scoring of interstitial fibrosis percentage by three human observers was similar to the correlation between manual scoring and the automated method. In addition, the automated quantification of interstitial fibrosis showed high correlations with the average Banff ci lesion scores of five kidney pathologists. These results convincingly show that the presented CNN can be used as a valid quantification tool for interstitial fibrosis in kidney tissue.
Although the segmentation performance of atrophic tubuli has significantly improved since earlier studies, the Dice coefficient is still relatively low compared with that of some of the other classes. The confusion matrix in Supplemental Figure S2 shows that this can largely be attributed to mix-ups with distal tubuli and interstitium. It can be doubted whether the confusion with distal tubuli can be entirely prevented as the transition from a healthy tubule to an atrophic tubule is a continuous process. However, the false-positive atrophic tubuli segmentations inside (inflamed) interstitium possibly result from a relatively low number of inflamed interstitial regions in the training set. This can be improved in future work, by expanding training data sets.   1365  1366  1367  1368  1369  1370  1371  1372  1373  1374  1375  1376  1377  1378  1379  1380  1381  1382  1383  1384  1385  1386  1387  1388  1389  1390  1391  1392  1393  1394  1395  1396  1397  1398  1399  1400  1401  1402  1403  1404  1405  1406  1407  1408  1409  1410  1411  1412  1413  1414  1415  1416  1417  1418  1419  1420  1421  1422  1423  1424  1425  1426  1427   1428  1429  1430  1431  1432  1433  1434  1435  1436  1437  1438  1439  1440  1441  1442  1443  1444  1445  1446  1447  1448  1449  1450  1451  1452  1453  1454  1455  1456  1457  1458  1459  1460  1461  1462  1463  1464  1465  1466  1467  1468  1469  1470  1471  1472  1473  1474  1475  1476  1477  1478  1479  1480  1481  1482  1483  1484  1485  1486  1487  1488 Over the past two decades, studies have demonstrated the detrimental effect of inflammation within areas of interstitial fibrosis and tubular atrophy on kidney transplant outcome. 1e4,45e47 As a result, inflammatory fibrosis (i-IFTA) was introduced to the Banff lesion scoring system in 2015. 6 Accurate scoring of i-IFTA requires the visual exclusion of non-scarred parenchyma, followed by an estimation of inflammatory burden inside the scarred region. This makes i-IFTA hard to score, also considering the novelty of the category. The low interobserver agreement for the scoring of i-IFTA in the current study (with and without IHC available) emphasizes the necessity of a supporting scoring tool as presented in this study. Yi et al 48 recently presented the so-called composite damage score, composed of abnormal interstitium areas and tubuli density and areas of mononuclear leukocyte infiltration. Although the authors did not directly compare composite damage score with i-IFTA, it was shown to be predictive for late eGFR decline and patient survival and will possibly approximate this Banff category. 48 Instead of presenting an entirely new scoring system, the aim of this study was to stay close to the commonly used definitions while increasing the scoring granularity, accuracy, and reproducibility. To do so, the automatically generated segmentations and cell detections were combined using an award-winning image registration technique. 49 This allowed us to calculate CD3 þ cell density within scarred and non-scarred parenchyma and perform absolute CD3 þ cell counts in healthy and atrophic tubuli, enabling comparison to ti, i, t, i-IFTA, and t-IFTA scores. Automatically quantified cell densities in the complete cortical area were highly correlated to the average ti scores of the pathologists. Excluding scarred regions from the analysis allowed for the calculation of an equivalent for the Banff i score, which showed a high correlation with visual scoring as well. The Banff ti and i scores and their computational equivalents require minimal segmentation of the tissue in specific compartments. This may explain why the highest interobserver agreements and the highest correlations between automated and visual assessment were found for these categories. Lower correlations were found for cell densities inside regions of interstitial fibrosis with visual i-IFTA scores. This was possibly partially due to the low interobserver agreement among pathologists. In addition, we observed false-positive tubuli detections in inflamed interstitial regions. Therefore, the automatically generated interstitial fibrosis mask will not reach these regions, causing an underestimation of i-IFTA. In return, these falsepositive segmentations can lead to an overestimation of (atrophic) tubulitis. This can be improved by including more inflamed interstitial regions during development of the structure segmentation network.
Finally, the correlation between the change in kidney function and automatically and visually scored ci, ct, i-IFTA, and t-IFTA was assessed as a proof of principle. Higher serum creatinine levels at time of the biopsy could cause an artifact when looking at DeGFR. To avoid this artifact, we used the eGFR measured 1 week before the biopsy for cause as a baseline. In reality, the serum creatinine levels appeared to be close on both time points [mean eGFR at minus 1 week: 29.53 mL/minute per 1.73 m 2 (SD, 8.70 mL/minute per 1.73 m 2 ); mean eGFR at time of biopsy: 28.26 mL/minute per 1.73 m 2 (SD, 7.99 mL/minute per 1.73 m 2 )]. This causes most patients to show an improvement of eGFR over time. Nonetheless, a significant, inverse, correlation was found between the inflammatory burden inside areas of interstitial fibrosis and the subsequent course of kidney function. This held for the automated quantifications by the CNNs and the visual lesion scoring by pathologists. This shows that the presented method can support uniform assessment of inflammatory burden inside fibrotic and nonfibrotic kidney tissue.
There were some limitations in this study. First, the method presented in this article relies on the restaining of PAS-stained slides with IHC, followed by image registration. Most clinical centers will not include these methods in their routine transplant diagnostics procedure. Therefore, future studies shall be targeted at the development of an inflammatory cell detection network for PAS-stained sections, targeted at macrophages, B lymphocytes, and T lymphocytes. Second, our automated method does not correct for tangential sectioning. Third, the data show a trend toward an inverse correlation between visual and automated scores of inflammation inside areas of interstitial fibrosis and tubular atrophy and the course of kidney function. However, the number of patients eligible for these analyses was too small to draw strong conclusions from these results. The predictive potential of automated quantification of specific tissue features should be assessed in a larger cohort that was designed for this purpose. Finally, cortical regions were manually annotated as regions of interest for visual and automated assessment. A cortex segmentation CNN is required for fully automated assessment and will therefore be developed in future work.
Although this study supports a positive view toward the inclusion of CNN-based quantifications in routine transplant diagnostics, the true, short-term, clinical value of this study can be found in the application of CNNs for prospective kidney (transplantation) research. The results demonstrate that the presented CNNs produce reliable quantifications of (inflammatory) fibrotic regions that could be used to monitor pathologic processes in detail over time in a uniform manner. In particular, the CNN-based results can be used as surrogate end points in large-scale clinical studies, relieving pathologists from tedious scoring tasks. Predictive models often require histologic revisions of large cohorts, where uniform assessment is challenged by variation between countries, laboratories, and observers. The presented CNNs can be used to compute tissue features in a reproducible manner, which can subsequently function as input for a clinical prediction model. The continuous output of the CNNs can be used to reevaluate the thresholds of the Banff Inflammatory Lesion Scoring with DL The American Journal of Pathologyajp.amjpathol.org 13 categories, which might result in a different patient grouping and a better prognostic system. In conclusion, two CNNs were developed, applied, and combined for the segmentation of kidney tissue and the detection of CD3 þ inflammatory cells. Good correlations were found for the automated quantification of glomeruli, interstitial fibrosis, and (total) inflammation with the manual scoring of their equivalent Banff lesion categories. The segmentation performance of (atrophic) tubuli should be improved to achieve better correlation with visual scoring of (atrophic) tubulitis and i-IFTA. Analyses on a small subset indicate an inverse correlation between long-term changes in eGFR and inflammation within scarred regions, based on both automated and visual assessment. Further validations are necessary to continuously assess the prospects of deep learning in kidney transplant pathology.