A Comparison of Methods of Estimating Missing Daily Rainfall Data

The availability of a long and complete rainfall record is very important for carrying out a hydrological study successfully. However in general, the data series in these records may contain gaps for various reasons. The objective of this study is to analyse the different methods available for filling gaps in rainfall data records and propose a method suitable for a river basin situated in a mountainous area in Sri Lanka. Towards this end, daily rainfall data from ten gauging stations in the upper catchment area of BaduluOya were collected. Seven different techniques were studied to ascertain their suitability. The methods studied were the Arithmetic Mean method, Normal Ratio method, Inverse Distance Weighting method, Linear Regression method, Weighted Linear Regression method, Multiple Linear Regression method and the Probabilistic method. The data generated for the target stations were compared with actual observations made, based on error statistics, Error Standard Deviation (STD),Root Mean Square Error (RMSE) and Correlation Coefficient (CC). The results of the study showed that for target stations that have only one neighbouring station with a high correlation coefficient, the Probabilistic method and the Linear Regression method give good predictions. For stations that have relatively low correlation coefficients with the neighbouring stations, the Inverse Distance Squared method and the Normal Ratio method outperformed the others. To obtain accurate results from the Multiple Linear Regression method and the Weighted Linear Regression method, it is necessary to have a set of neighbouring stations that have fairly high correlation coefficients with the target station.


Introduction
A lengthy rainfall data series plays a major role in all water related studies. Consistency and continuity of rainfall data series are very important for obtaining reliable results from such studies. However, these rainfall data series very often contain gaps or missing values due to various reasons such as the absence of observers, problems with measuring devices, loss of records etc. The use of a rainfall data series with missing values may critically influence the statistical power and accuracy of a study. By estimating and filling the missing rainfall data, a series could be made longer to make the water related study more reliable. Diverse techniques have been proposed and adopted in filling missing data with a view to obtaining a continuous and lengthy rainfall data series.
Basically, the procedures can be grouped into three major classes as deterministic, stochastic and artificial intelligence based methods [1]. Deterministic approaches are more suitable because of their robustness, ease of implementation and computational efficiency [1,2]. They are mathematical models that always produce the same output from a given initial condition and they neither contemplate on the existence of randomness nor do they attribute the results to a probability of occurrence. Arithmetic mean method, normal ratio method and inverse distance weighting method are the examples of deterministic methods. Stochastic and artificial intelligence approaches are sophisticated but they are more costly and complex [3,4]. Stochastic methods provide probabilistic estimates of the outcome. Artificial intelligence methods such as artificial neural networks (ANNs)have a complex mathematical formulation and are thus difficult to implement.
The best method for estimating missing rainfall data can vary for different areas depending on their rainfall patterns and spatial distributions. [2] This study filled data at monthly time steps.
The objective of this paper is to present the analysis carried out to evaluate the few methods that are available for filling gaps in rainfall data records and propose a novel method in their place. A river basin situated in a mountainous area in Sri Lanka was used for the study. The study area is shown in Figure 1.
. Figure 1 -Distribution of ten rain gauging stations in BaduluOya upper catchment 2. Methodology

Data Collection and Study Area
Daily rainfall data at ten gauging stations in BaduluOya upper catchment collected over a period of 10 years were considered for the study based on their availability and spatial variability. The elevation of the catchment selected, varied between 740m and 1440 m above mean sea level and it covered an area of about 600 km 2 . Figure 1 shows the gauging stations and their data availability is shown in Figure 2. The data were obtained from the Department of Meteorology. Method Seven different techniques for estimating the missing data were used to evaluate the suitability of each method for mountainous areas.The seven methods used included six deterministic methods and one probabilistic method.
Gauging stations located at Bandarawela, Badulla and Lower Spring Valley, each of which had 100% data availability during a period of 10 years were selected for the evaluation.
As shown in Table 1, during the analysis, rainfall data of the above mentioned three stations (target stations)relating to one or more months of a every year were randomly deleted.These months were considered as months for which rainfall data were missing. Thereafter in respect of each station, the missing (deleted) data of each month were estimated using rainfall data available at other neighboring gauging stations. This was repeated for all the seven methods. Subsequently, the estimated data were compared with the actual observations making use of three error statistics.
The stations which were nearest and spread out well around the target station were the ones that were selected for each target station. Thereafter, the correlation coefficients of each target station with the above selected neighbouring stations were calculated. As some neighbouring stations had gaps, a time period where all the stations had 100% data were considered in calculating the correlation coefficients. The neighbouring stations were ranked according to their correlation coefficients for rainfall data estimation. The distribution of neighbouring stations for each target station is shown in Figure 3. The names of target stations with the names of their corresponding neighbouring gauging stations along with their respective correlation coefficients are given in Table 2.  The seven methods used in the estimation of missing data are given below.

Arithmetic Mean (AM) Method
If the normal annual rainfalls at surrounding gauges are within 10% of the normal annual precipitation at the stations concerned, then the arithmetic procedure could be adopted to estimate the missing data [5]. This assumes equal weights from all nearby rain gauge stations and uses the arithmetic mean of the precipitation data.

Normal Ratio (NR) Method
This method is used if the normal annual precipitation of any surrounding gauges exceeds 10% of the gauge that is under consideration. This weighs the effect of each surrounding station [6].

Inverse Distance Weighting(IDW) Method
In this method, the weight for each station is assumed to be inversely proportional to its squared distance of the target station from the neighbouring station with data [7].

Linear Regression (LR) Method
The correlation coefficients between the target station and each of the neighbouring stations are initially calculated and then ranked. Then the missing data are estimated using a linear regression equation with the station that has the highest correlation. The correlation and the equation of the regression line are obtained using Microsoft Excel software.
In order to be able to generate zero values together with non-zero values, the regression line is forced through the origin. …(4)

Weighted Linear Regression (WLR) Method
This method takes into account the impact of the distance between the target station and data station in addition to their correlation with respect to data. It uses Equation 5, in which a weighting factor is introduced [1].

Multiple Linear Regression (MLR) Method
Rainfall data are estimated considering a linear correlation between the target station and some of the other (multiple) neighboring stations [1].

Probabilistic (PRB) Method
The PRB method assumes that the missing rainfall values and available data for a particular gauging station have similar statistical properties [3]. The data estimation procedure starts with the calculation of the monthly total rainfall at the missing station using the IDW method and based on the observations made at the neighbouring rain gauging stations located within a 10 km radius. Thereafter using non-zero values in the data series, a probability distribution is fitted into monthly rainfall data over a ten year period and the parameters of the probability distribution are determined. The daily rainfall data are usually very much right skewed and therefore probability distributions that match well with the mare used in the study. Only nonzero values for the missing month are randomly generated using estimated probability distribution parameters until the average of the generated values become equal to the monthly rainfall estimated using the IDW method. Finally, the distribution of generated datachronologically is done for each month by matching them with the data available at the gauging station that has the highest correlation.
The data were generated using three probability distributions, viz., Generalized Gamma, Weibull and Pearson 6 in order to determine the probability distribution that was most suitable for the daily rainfall data.
The notations used for the seven methods are given below.

Comparison of Estimates
The data estimated for the target stations were compared with the actual observations made, based on the following error statistics: a) Error Standard Deviation (STD) b) Root Mean Square Error (RMSE) c) Correlation Coefficient (CC)

Results and Discussion
The AM method could not be applied to any of the target stations since the average annual rainfalls at their surrounding gauges were not within the 10% range of the normal annual precipitation at the target station. This means that the annual rainfall values can be significantly different among the gauging stations even though they are located close to each other, probably due to the considerable variations in their elevations. Table 3 presents an example of a comparison of the observed rainfalls with the generated rainfalls at the three gauging stations for three different months. Data generated were for the months given in Table 1. As Table 3 reveals, the results indicate that different methods perform differently for the three stations.
When the correlation coefficients with the neighboring stations are each more than 0.7, the MLR method performed acceptably. As Table 2 shows, the Bandarawela station has only two neighbouring gauging stations each having a correlation coefficient greater than 0.7 whereas the Badulla station has no neighbouring gauging station with a correlation coefficient exceeding 0.7. Therefore, the MLR method could not be used for these two stations. However for the Lower Spring Valley station, each of the neighbouring stations except one station had a correlation coefficient greater than 0.7. Therefore, the MLR method generated fairly good results for the Lower Spring Valley station.
The performance of the MLR method was poorer than the LR method. It requires a set of neighbouring stations with good correlation coefficients (greater than 0.7) for it to perform well. Sometimes there are negative coefficient values due to the presence of zero values on the daily scale. This method may perform well on the monthly scale because monthly records usually consist of non-zero values.

Table 3 -Comparison of generated rainfalls with actual rainfalls at the three stations
When generating data using the PRB method, the monthly average rainfalls obtained from the IDW method were used as constraints. Initially, the most representative probability distribution for the data set at the station where data were missing was identified. Out of the probability distributions, Generalized Gamma,Weibull and Pearson 6, Generalized Gamma was found to be the most suitable. Non-zero values for the missing month were randomly generated thereafter. The monthly average rainfall computed based on the IDW method is in itself estimation and thus can be erroneous. Moreover, random values were generated only up to 100 seeds or trials. These facts can reduce the accuracy of the final outcome of the PRB method.
The WLR method showed that when the highest correlated stations are closer to the target station, the generated data are acceptable. For instance, for the Bandarawela station, this method gave significantly better results than for the Badulla station.Finally as shown in Table 4, the generated data were compared with the observed data based on the three above mentioned error statistics.
When data values generated using the above mentioned methods are compared with yje actual data values, there were no significant differences among the STD values in most of   Table 4 -Error statistics at the three stations Table 5 presents the most suitable methods for filling missing data at the three stations based on error statistics. Data generation was done on a daily basis and thereafter their monthly totals were computed. Table 6 presents recorded and generated monthly rainfalls at the three locations.
As shown in Table 7   The Bandarawela station has a very high correlation with its neighboring Diyathalawa station. These two stations are close to each other. The differences in elevation and distance between the two stations are very small. In the case of the Bandarawela station, the PRB method outperformed all the other methods, suggesting that the PRB method is highly dependent on the correlation between the target and the neighboring station. The MLR method also depends on the correlation coefficient.
The Lower Spring Valley station has the highest annual rainfall in the catchment and its rainfall pattern varies highly from those of neighboring stations. The data generation based on one single neighboring station was not effective for the Lower Spring Valley station and thus the PRB method or the LR method did not perform well. Four or five surrounding gauging stations were considered in the NR method and the results obtained were acceptable.
Compared to the other two stations, the Badulla station has a relatively lower CC value and higher STD and RMSE values. It has the lowest elevation above the mean sea level and all of its neighboring gauging stations are located at considerably higher elevations. Besides, the distances to the neighboring stations are also relatively high. Therefore, the rainfall pattern at the Badulla station will be different from those of its neighboring stations. These facts lead to lower CCs at the Badulla station compared to those of the others. The IDW method gave good predictions for the Badulla station. Table 8 summarizes the methods observed to be suitable based on the CCs and the number of neighboring stations that were considered.

Conclusions
Based on the analysis, it can be concluded that it is not possible to name one single method from among the seven methods studied as the most suitable one for all of the stations. However, the analysis identified appropriate methods to be employed in filling missing rainfall data at a gauging station based on the number of neighboring gauging stations available for use and their correlations with that particular station for which data are filled.