Performance Evaluation of Single and Multi-Objective Calibration in Rainfall-Runoff Modelling

: Hydrologic modelling is widely used as a tool for making decisions and predictions in planning and managing water resources. Successful application of a hydrologic model depends upon careful calibration and validation using appropriate performance measures in satisfying corresponding performance evaluation criteria. Performance of MIKE 11 NAM rainfall runoff model was assessed using four calibration objectives and their different combinations, as applied to upper Gin catchment, Sri Lanka. The four calibration objectives measured different aspects of hydrograph: good water balance, good overall agreement of the shape of the hydrograph, good agreement for peak flows, and good agreement for low flows. Their numerical performance was measured using four objective functions from which fifteen calibration schemes were formed (four single objective schemes and eleven multi-objective schemes). Using aggregated distance measure, equal weights were assigned to the four objective functions. Shuffled complex evolution algorithm was used to solve the multiple calibration objective problems. Model performance was evaluated using three criteria: Nash-Sutcliffe coefficient (NSE), percent bias (PBIAS) and coefficient of determination (R 2 ). Results revealed significant trade-offs between the objective functions, highlighting that no single calibration objective was able to depict all the aspects of the hydrograph simultaneously. However, multi-objective calibration yielded more accurate and consistent simulations covering different aspects of the hydrograph, simultaneously with overall best performance shown for combination of the four objective functions satisfying all the performance evaluation criteria (NSE=0.56, R 2 = 0.56, PBIAS=17%), compared to the single-objective calibration. Instead of using R 2 alone, use of the corresponding regression slope as a weighing factor of R 2 was recommended following further analysis of simulation results.


Introduction
Hydrologic models have been widely used for decades as a tool applied in hydrological assessments, including impact assessments on water resources due to human activities and climate change [1], [2], [3], [4], [5], [6]. In distributed hydrologic modelling, the spatial variability of catchment characteristics is taken into account; however, distributed hydrologic models require huge amount of data and identification of considerable number of model parameters which possibly limit their applicability for operational purposes. Besides, there is no clear evidence to prove the distributed models' contribution towards additional model efficiency [7]. Thus, lumped rainfall-runoff models are widely applied as these models use well established physical laws at micro scale, include simple conceptual parameterization and efficient computation, and represent an appropriate alternative to understand hydrological system dynamics in areas with limited data [8], [9]. MIKE 11 NAM, a lumped, conceptual rainfallrunoff model, simulates the rainfall-runoff processes occurring at the catchment scale. It had been applied to a number of catchments around the world, representing many different hydrological regimes and accepted as a wellproven engineering tool that can be used to simulate the watershed physical processes [10], [11], [12], [13], [14, [15]. MIKE 11 NAM model was setup in this study to simulate the rainfallrunoff process.
In hydrologic modelling studies, proper model calibration is important as model performance highly depends on the selection of objective functions [16], [17]. Particularly, in conceptual hydrological models, since model parameters are conceptual representations of catchment processes, they cannot be determined through direct measurements and have to be estimated through matching observed and modelled catchment behaviour in the calibration process [18]. In most of the modelling approaches, objective functions are used to measure different aspects of the hydrograph which generally contain overall water balance, overall shape of the hydrograph, low flows and peak flows [19]. Single objective calibration is widely used in many practical applications, however, identification of an objective function that reproduces all the characteristics of a hydrograph may be daunting [20], [21]. To circumvent this problem, use of multi-objective functions for hydrologic model calibration has been assessed for different types of models and verified in providing better model performance: semi-distributed model [22], [23]; lumped conceptual model [18], [19]; and dynamic flood operation model [24]. This study incorporated the most widely used method for solving the multi-objective global optimization problem which transformed the multi-objective problem into a single objective problem by using aggregated distance measure, a method to assign equal weights to the different objectives in the optimization [19], [20], [25].
Performance evaluation criteria are defined as mathematical determination of how well a model simulation matches the available observations [26]. In general, many efficiency criteria contain a summation of the absolute or squared errors, to evade the cancelling of errors of opposite sign. Thus, the calibration attempt which aimed at minimizing these errors neglects smaller errors associated with low streamflow values while placing an emphasis on larger errors associated with high streamflow values leading to fit the higher portions of the hydrograph at the expense of the lower portions [27]. Accordingly, many studies demonstrated the advantages of using multiple evaluation criteria over a single criterion to cover entire hydrograph since the use of a single evaluation technique led to undue emphasis on matching one aspect of the hydrograph at the expense of other aspects [19], [27], [28], [29], [30]. In complying with them, this study included three performance evaluation criteria, namely Nash-Sutcliffe coefficient (NSE), percent bias (PBIAS) and coefficient of determination (R 2 ).

Study Area
Gin catchment covers an area of about 930 km 2 and drains to Indian Ocean by Gin River with a total length of about 116 km. The catchment is located in southern Sri Lanka between 80°08" E to 80°40" E and 6°03" N to 6°26" N. It is characterized by bimodal rainfall pattern predominantly between the months of May and September (south west monsoon), and between November and February (north east monsoon) followed by inter monsoon rains during the remaining months of the year with mean annual rainfall between 2,500 mm (at the upstream) and 3,500 mm (at the downstream). The upper reaches of the river flow through hilly region, and the lower reaches of the river flow through alluvial plain with relatively flat slopes, most of which is flood prone. Land use in the catchment primarily consists of forests, cultivations and human settlements with most of the settlements located within the extensive floodplain of Gin River. Red Yellow Podzolic soils are predominant in the upland areas while the lowland floodplains are dominated by alluvial and bog soils [31].
The catchment receives heavy rainfall during the south west monsoon period resulting in frequent flooding which greatly threatens the safety of life and property in the catchment. May, 2017 flood event approximately conform to a flood with a return period of 200 years [32]. Gin River is the primary drinking water source from which pipe-borne water is supplied to Galle city, the capital of southern Sri Lanka. Thus, accurate rainfall-runoff simulations are vital to understand the catchment hydrological processes in minimizing the flood hazards and managing the water resources efficiently.

Data Sets and River Network Generation
Daily rainfall data at Aningkanda, Baddegama, Hiniduma, and Labuduwa rain gauging stations (from 2000 to 2016) and other meteorological data were obtained from the Department of Meteorology, Sri Lanka. Daily river discharge data at Thawalama river gauging station (from 2000 to 2016) and river cross-sections were obtained from the Department of Irrigation, Sri Lanka. The weighted areal rainfall was computed using Thiessen's polygon method. Gin catchment, and its rain gauging stations weighted by Thiessen polygon method, are shown in Figure  1 (a) and Figure 1 (b), respectively.
River network developed in MIKE 11 with the locations of cross sections is shown in Figure 2. The uppermost sub catchment was linked at Thawalama defining the location of the subcatchment outlet.

2.3
MIKE 11 NAM Model Set-up MIKE 11 NAM hydrological model was applied in this study to simulate the rainfallrunoff processes occurring at the catchment scale. MIKE 11 NAM, which is a lumped, conceptual rainfall-runoff model, considers the catchment as a single entity. The model consists of four different interrelated storages: groundwater storage, root zone storage, surface storage, and snow storage. Overland were simulated as a function of the moisture contents in the four storages. Parameters related to surface, root zone and groundwater storages were calibrated in this study. The parameter values of the model corresponding to catchment characteristics were estimated using different mathematical algorithms. Schematic representation of the MIKE 11 NAM model is shown in Figure 3. Description of the model parameters and the range of each parameter used for the model calibration are shown in Table 1.

Calibration Objectives and Objective Functions
Calibration objectives have to be formulated as numerical goodness-of-fit measures that are optimized in the model calibration. In the model calibration, the following four calibration objectives were considered: (i) a good agreement between the average simulated and observed catchment runoff (i.e. a good water balance); (ii) a good overall agreement of the shape of the hydrograph; (iii) a good agreement of the peak flows with respect to timing, rate and volume; and (iv) a good agreement for low flows.
The parameters which represent the catchment characteristics of lumped, conceptual hydrological models are not measured manually in the calibration process. They are estimated using mathematical equations (objective functions) such that the model simulated values match with observed values of the catchment [19], [34].  is the number of time steps in the calibration period, Mp is the number of peak flow events in Equation (3), Ml is the number of low flow events in Equation (4) in the calibration period and nj is the number of time steps in event number j.
Both peak and low flow events were specified separately with respect to threshold values.

Flow-Duration Analysis and Threshold Values for Low Flows and High Flows
Flow-duration analysis was performed by using daily average flow during 2000-2016 to identify the high flow segment and the low flow segment. Weibull plotting position was used to develop the flow-duration curves as it provided an unbiased estimate of exceedance probability, regardless of the underlying probability distribution of the ranked observations [36]. Low flow was defined as the values falling above the 75 th percentile in the distribution of streamflow values. High flows were the streamflow values below the 25 th percentile. For 75% of the time, the streamflow equalled or exceeded 13 m 3 /s and for 25% of the time, it equalled or exceeded 38 m 3 /s.

2.5
Model Calibration Since most of the MIKE 11 NAM parameters are of an empirical and conceptual nature, generally it is not possible to determine their values on the basis of the physiographic, climatic and soil physical characteristics of the catchment [37]. Hence the final parameter estimation was performed by calibration against time series of hydrological observations which automatically generated the model parameters and assessed the performance of the model according to the model performance measures using the Shuffled Complex Evolution (SCE) algorithm [38], [39]. SCE algorithm which combined the global search method (genetic algorithm) with the local search method (simplex method) had been widely applied all over the world and proven to be consistent, effective, and efficient global optimization procedure in locating the optimal model parameters of a hydrological model [29], [40]. where Ai are transformation constants. A balanced aggregated measure was incorporated by assigning the transformation constants in order to make the different objectives having equal weight in the optimization. The transformation constants were automatically calculated based on the initially generated population of parameter sets in the optimization loop [19], [37]. By minimising Equation (6) with respect to ∅, the optimal parameter set was found and by using the SCE algorithm [38], [39], the optimization was performed automatically.

Calibration Schemes
The rainfall-runoff model was established using single and combined objective functions in the calibration process. Fifteen different calibration schemes were used including the four single objective functions (Table 2). Data from year 2000 to 2015 was used for the model calibration. Calibration was carried out for each scheme using the sixteen single year's data for the purpose of detailed comparison of performance evaluation variations in the calibration schemes. To ensure optimal performance for both the calibration period and the validation period, the validation period was also kept similar to the calibration period [41], [42]. Thus, the model was validated for the year 2016. Following the calibration and validation, the performance of the model was evaluated using the three performance criteria.

2.6
Performance Evaluation Criteria It was shown in the previous studies that the model performance strongly depended on the selection of performance criteria [27], [43]. Both graphical and numerical performance measures should be applied in the performance evaluation; while the graphical evaluation includes comparison of the simulated and observed hydrograph, the numerical performance measure includes the overall water balance error (i.e. the difference between average simulated and observed runoff), and measure of the overall shape of the hydrograph based on coefficient of determination or Nash-Sutcliffe coefficient [37]. According to Moriasi et al. [41], three evaluation indexes which had been commonly used, accepted and recommended in published literature were selected to evaluate the model performance: percent bias, PBIAS  ... (8) ... (9) where Qo,i is the observed discharge at time i, Qs,i is the simulated discharge at time i, Ōo is the mean observed discharge, Ōs is the mean simulated discharge and N is the total number of time steps.

Results and Discussion
Averages of absolute values of the three performance evaluation criteria obtained for the model calibration are shown in Table 3. In assessing the goodness-of-fit of the observed and the simulated hydrological processes,

Single Objective Function Calibration (Schemes 1-4)
According to Table 3 and Figure 4, significant differences existed in the performance evaluation of the four single objective function schemes (schemes 1-4). The best value for PBIAS was given by F1(∅) compared to the other single objective function schemes but it showed the highest variation ( Figure 4). F1(∅) underestimated the low flows and overestimated the high flows. Among all the single objective function schemes, F2(∅) performed best for both the NSE and R 2 values, followed by F3(∅). However, objective functions F2(∅) and F3(∅) were having the worst values and the highest variations for PBIAS compared to the F1(∅). F4(∅) which placed emphasis on simulation of the low flow did not perform well in matching the observed and simulated overall water volumes and overall shape of the hydrographs since NSE and R 2 were more sensitive to high flows which agreed with Krause et al. [27] and Zhao et al. [42]. To illustrate the differences of the hydrograph shapes calibrated using the four single objective function schemes, representative hydrographs for the observed and simulated discharge for year 2012 are  Figure 5, high and low flows were not simulated well in the calibration using F1(∅); the model over predicted as much as it under predicted giving deceiving rating of model performance which agreed with Moriasi et al. [46]. Moreover, it is clearly shown in Figure 5 that the shape of the observed and simulated hydrographs were having reasonable agreement for F2(∅) which measured the overall agreement of the shape of the hydrograph. However, owing to the comparatively high PBIAS, F2(∅) did not show good agreement between the average simulated and observed catchment runoff. Therefore, it is evident significant trade-offs between the single objective functions, except for less significant trade-off shown between F2(∅) and F3(∅).
Thus, when single objective functions were used for the calibration, catchment characteristics were not represented well due to the trade-off between bias and variance which prevented the model to reach the optimal balance between two. This is where the multi-objective calibration approach is being used as shown by Gupta et al. [22] and Moriasi et al. [46] to represent the catchment characteristics while balancing all the objectives, since the objectives are of equal importance.

Multi-objective Function Calibration (Schemes 5 -15)
In the multi-objective function calibration process, the four different calibration objectives were optimized using aggregated measure in which different objective functions were assigned equal weights. Table 3 lists the mean values of the performance evaluations for different multi-objective functions (Schemes 5 -15) and Figure 3 illustrates their variations. According to Table 3, the simulations by the multi-objective functions integrated the characteristics of the single objective functions and primarily produced either improved or balanced results for the performance evaluation criteria. For example, PBIAS value calibrated by the multi-objective function F1(∅) F2(∅) was better than that by F2(∅) and F1(∅), while the value of NSE calibrated by the multi-objective function F1(∅)F2(∅) was better than that by F1(∅) and similar to that by F2(∅). The value of R 2 calibrated by the multi-objective function F1(∅) F2(∅) was better than that by F1(∅) and a little worse than that by F2(∅). As depicted by the model output, this slightly poor performance was attributed the oversensitivity of R 2 to outliers than to observations near the mean which was illustrated by Legates and Davis [47] as an obvious limitation of correlationbased measures.
From the box plots shown in Figure 3 which present the variation of the model's performance for different objective functions, it can be observed that schemes 5 -15 have approximately similar mean values for NSE and R 2 , except for the schemes 6, 7 and 13. NSE showed high variation for the schemes 6 and 7, whereas R 2 showed high variation for the scheme 14. PBIAS showed significant differences in both mean value and variation for schemes through 5 to 15. The best PBIAS mean value was obtained by the scheme 5 followed by the scheme 15. Considering overall performance of all the evaluation  5, 12, and 15) criteria of multi-objective calibration functions, schemes 5, 12 and 15 showed better values for NSE, R 2 and PBIAS in comparison to the other schemes. However, when schemes 5 and 12 were used for the calibration, a large variation was observed for NSE and R 2 .
Owing to the above aspects, scheme 15 which includes combination of all the objective functions was recommended as the best. This further confirmed the findings by Jie et al. [20] and Moriasi et al. [46], who recommended synthesizing the performance of the multiple evaluation indexes in evaluating the model performance.
To verify the above, the simulated hydrographs which were calibrated by the multiple objective functions (scheme 5, 12 and 15) were compared with the observed hydrograph as shown in Figure 6 using representative hydrographs for year 2012. Simulated hydrographs showed better agreement with the observed (Figure 6), depicting more comprehensive performance in simulating the hydrological behaviour of the catchment (including high, intermediate and low flows) when calibrated using the multiobjective calibration functions, compared to the single objective calibration functions ( Figure 5). Out of the three schemes, overall best agreement between the simulated and observed hydrographs was shown for the scheme 15.
Though the simulated hydrographs followed the trend of the observed flows, the highest flows (> 90 m 3 /s) were underestimated by the model. Underestimation of the highest flows might be due to the inadequate representation of spatial variability of rainfall owing to limited number of rainfall gauging stations, particularly in the upper and middle reaches of the catchment [ Figure 1(b)] [48], [49]. Moreover, according to Makungo et al. [13], MIKE 11 NAM model has generally been known to underestimate peak flows owing to lack of ability of rainfall-runoff models to simulate complex hydrological phenomena during high flow periods, accurately. Additionally, observational errors, technical errors associated with inaccurate calibration of recording devices and extrapolation of ratings beyond the range of gauged discharges, might have individually or collectively led to underestimation of the aforesaid flows.

Calibration vs. Validation
Performance Criteria Reliability and hydrological soundness of the calibrated MIKE 11 NAM model was evaluated by the validation process carried out using year 2016 data for the scheme 15 [F1(∅)F2(∅)F3(∅)F4(∅)]. As recommended by Morasai et al. [46], the same performance evaluation criteria which were used for the model calibration, were applied for the model validation. Model validation performance data are stated in Table 4. PBIAS, NSE and R 2 values obtained in the multi-objective calibration and validation runs were acceptable since they were comparable or better than those obtained in the studies carried out by Morasai et al. [46] and Odiyo et al. [14] ("satisfactory" if R 2 > 0.60, NSE > 0.50, and PBIAS ≤ ±15%). This implied that, following the multi-objective calibration and validation processes, the MIKE 11 NAM model performed reasonably well in representing the hydrological characteristics of the catchment that influence the runoff generation process.    To provide more insight into the model performance over different flow regimes, hydrologically oriented model diagnostics which are flow-duration curves were used ( Figure 9) [46], [50]. The flow-duration curves showed that the model tended to underestimate the observed data for the highest flows (< 5% exceedance probability); had a good agreement for the high flows (> 5% and < 25% probability); overestimated during the intermediate and low flows (> 25% and < 90% probability); and had a good agreement for the lowest flows (> 90% exceedance probability) except for a few lowest flows exceeding 98% probability which was underestimated.

Figure 9 -Comparison of Observed and Simulated Flow-duration Curves for the Validated Scheme 15
For the validated scheme 15 [F1(∅) F2(∅)F3(∅)F4(∅)], the value of the NSE and R 2 were 0.68 and 0.69, respectively. Though NSE and R 2 were having high values, it was indicated that both NSE and R 2 were not very sensitive to the quantification of systematic under prediction or over prediction errors, which agreed with Krause et al. [27]. NSE and R 2 were based on squared differences between observation and prediction and, as a result, an emphasis was placed on larger errors (associated with high streamflow values) while smaller errors (associated with low streamflow values) tended to be neglected leading to fit the higher portions of the hydrograph (e.g., high flows) at the expense of the lower portions (e.g., low flows).
Previous studies revealed that, good R 2 values close to 1.0 could still be resulted from a model which systematically under predicts or over predicts all the time. R 2 standardizes the differences between the simulated and observed variances and it only evaluates linear relationships between the variables. Thus, R 2 is insensitive to additive and proportional differences between the model simulations and observed data [27], [30]. To cope with that problem, Krause et al. [27] developed a weighted version of R 2 (ωR 2 ) by combining the R 2 and gradient of the regression on which R 2 was based. By weighting R 2 , model simulation results were reflected more comprehensively by quantifying over predictions or under predictions (Equation 10).
where, ωR 2 is the weighted version of R 2 ; b is gradient of the regression on which R 2 is based.
According to Equation (10), the weighted R 2 value of 0.50 for the validated scheme 15 reflected the aforementioned poor simulation better than that of the R 2 value alone (0.69). Thus, it is evident that more reliable inferences could be obtained by weighting the coefficient of determination by the slope of the regression, which forms modified form of R 2 . This is stressing to further investigate for more conservative performance evaluation criteria for NSE as well, such as its modified form, by incorporating absolute values rather than squared differences that evaluate model performance without being influenced by extreme values.

Conclusion
Performance of fifteen calibration schemes which formed using four objective functions and their combinations were investigated using the most commonly used performance evaluation criteria for rainfall-runoff modelling using MIKE 11 NAM. SCE algorithm was used to calibrate the model while aggregated distance measure was used to set equal weights to different objective functions in obtaining a compromise solution. Study revealed that no single objective function was adequate to measure all the aspects of a hydrograph including the low flows, peak flows and shape of the hydrograph, and water balance simultaneously, since the different objective functions were adopted for different purposes. Single objective functions reduced the error with respect to simulation of some aspects of the hydrograph at the expense of other aspects; trade-off between overall root mean square error (overall agreement of shape of the hydrograph) and average RMSE of peak flow events (agreement of peak flows) were less significant, while significant trade-offs between other objective functions were evident. Model calibration conducted using the multi-objective approach has been very successful in comparison to the single objective approach; simulations by the multi-objective functions integrated the characteristics of the single objective functions and primarily produced either improved or balanced results for the performance evaluation criteria, showing better agreement between observed and simulated flows. In particular, multi-objective calibration scheme 15 produced the best combination of objective functions to capture the catchment hydrological response. Weighting the R 2 by integrating the slope of the regression was recommended to obtain more reliable inferences, reflecting the systematic biases in the simulated flow.
Though this study was performed on one catchment by using one conceptual hydrologic model, the findings provide useful reference for rainfall-runoff simulation studies in tropical region catchments, since the choice of calibration objective is governed by the particular aspect of the hydrograph rather than by the catchment and the model. Further research is needed to identify additional performance evaluation criteria, like modified form of NSE, for more comprehensive reflection of model results.