Defect Detection in Woven Fabrics by Analysis of Co-occurrence Texture Features as a Function of Gray-level Quantization and Window Size

In this experimental research, the effects of gray-level quantization and tiling window size on 22 gray-level co-occurrence matrix features were investigated in the context of automated woven fabric defect detection. A dataset comprising 1426 128×128 images was used, in which defective and the defect-free images were split in a 50:50 ratio. Experiments were carried out with seven quantization levels (LL = 4, 8, 16, 32, 64, 128 and 256) and four window sizes (NN = 8, 16, 32, 64). The features were extracted from each image in the training set for each< LL,NN >combination and thereafter were ranked using the joint mutual information metric. Next, for each < LL,NN > combination, a k-nearest neighbour classifier was trained, first with only the highest-ranking feature and thereafter iteratively by adding features of lower ranks. It was observed that a minimum of nine features were needed to achieve an acceptable (>90%) F1 score for any < LL,NN >combination, except when NN is relatively large. The two features that contribute to improving the F1 score for any < LL,NN >combination were found to be Homogeneity I and Homogeneity II. It was also noted that using an 8×8 window on images with 128 gray levels resulted in a practically usable high F1 score (96.39%) with the least number of features (14).


Introduction
Almost 30% of fabric defects go unnoticed to the human eye during a typical fabric inspection process in the apparel manufacturing industry [19]. Undetected defects can result in large losses in the production as they affect the quality of the output and often result in the rejection of garments at the final quality control stages. Many researchers have attempted to solve this issue by using computer vision-based techniques. However, due to the diversity of properties in different fabric types used in the industry and their sensitivity to the algorithms used, most of these studies limit their scope to defect detection in a few types of fabric using few select algorithms.
Some more common types of defects in fabrics manifest as local anomalies in the fabric texture. Following this assumption, texture analysis has been widely used to separate defective areas of the fabric from the defect-free area. Among the many methods used, the gray-level cooccurrence matrix (GLCM) [1] has been frequently used to calculate textural features owing to its ability to discriminate between textures that are visually easily separable.
An important preprocessing step that is known to improve the textural features estimated from the GLCM is gray-level re-quantization. At present, studies on the effect of the gray-level quantization on texture classification have been conducted for irregular textures in the geological [5] [7] and biomedical [6] [8] [9] fields, but not for near-regular textures such as those in textile samples. All these studies recommend using fewer gray levels in order to obtain better texture classification accuracy, whereas two of them [5][6] recommend graylevel quantization values that are datasetspecific. Therefore, it is clear that the current available literature is limited to irregular textures. As woven fabric has near-regular texture [18], we apply our method and conduct our analysis on such textures.
Motivated by the absence of a basis upon which an optimal gray-level quantization level can be reliably chosen for co-occurrence feature extraction in plain-weave fabrics, this study aims to investigate the effect of different values for gray level quantization on fabric defect detection. The study uses multiple window sizes for partitioning the image into smaller tiles, which is a common practice for GLCM in this domain. Moreover, the behaviors of 22 textural features that can be computed from the GLCM are also considered in the classification. The contributions of this paper are the highlighting of co-occurrence features that contribute most to the classification, the determination of the least number of features required to achieve appreciable detection accuracy and the establishment of an objective framework for the choice of the optimal quantization/window-size combination in future research on plain-weave fabric defect detection.

Dataset
Our dataset was created as follows. From the 195 512x512-pixel images in the publicly available Cotton Incorporated dataset [20], the images of unpatterned plain-weave fabrics containing Broken End, Broken Pick, Burl, Coarse End, Coarse Pick, Dirty Yarn, Foreign Yarn, Knot, Oil Spot and Slubs were selected. They were randomly cropped with a 128x128 window and annotated with the support of domain experts. As the defect-free images obtained thus outnumbered the defective images, the set of defect-free images was undersampled to obtain a perfectly balanced dataset comprising 1426 images.

Gray-level Quantization
The gray level depth ( ) in an image has an effect on the textural features calculated from the GLCM. Many research encourage gray-level depth reduction through down sampling due to multiple reasons. First, a lower depth requires less computational time and cost and thereby facilitates real-time processing with limited resources. Additionally, when similar gray levels are merged for reduction, the effect of noise on the image is also averaged to a certain degree [5]. Similarly, the down-sampling process also mitigates undesirable effects of uneven illumination [6]. All these features contribute in reducing the extent of information included in the sample to match the extent required in isolating defects from other natural variation in the fabric texture.

Window Size
In fabric defect detection, GLCMs are calculated for non-overlapping subregions of the image since most defects are extremely small compared to image size. The usual approach is to split the image into tiles using a × sliding window. Theoretically, larger windows provide a more accurate classification of texture [7], but having a window smaller than the smallest defect ensures its detection [4]. Using a smaller window also means splitting the image into smaller tiles and hence the computation of more GLCMs, thereby increasing the computational cost further. Even though the largest window always provides the fastest speed of computation since it splits the image to the least number of tiles, it will fail to detect defects that are comparatively much smaller than its size.
To determine the window size that is sufficiently large so as to provide an accurate estimation of texture without ignoring the distortion at the defect, four window sizes ( = 8, 16, 32, 64) are studied.

Gray-level Co-occurrence Matrix
The GLCM is used to extract the second-order statistics of an image. The matrix represents the number of occurrences of gray-level pairs, i and j, that are d distance apart along a direction . Hence, the GLCM captures the frequency of spatial relationships among pixels in textured images. The GLCM is defined as [1] ( , ) = || {[ 1 , 1 ), ( 2 , 2 )]} | 2 − 1 = , 2 − 1 = , ( 1 , 1 ) = , ( 2 , 2 ) = || …(1) where ( 1 , 1 ) and ( 2 , 2 ) are pixels, (•) is the gray level of the pixel and || • || is the number of pixel pairs that satisfy the conditions. Studies on textile images of multiple resolutions have revealed that a distance of one pixel gives better classification accuracy [2]. For this reason and in order to allow relative comparisons to be made with respect to the parameters under consideration, a distance of one pixel is used in this study. For the sake of making the textural features invariant to rotation, the GLCMs of all possible directions ( ) are obtained from a tile, and the features calculated from them are averaged [1] [4]. Prior to the calculation of features, the GLCMs derived from each tile are normalized.

Extracting the Feature Vector
In the interest of improving the detection accuracy, the textural feature vector extracted from each image is compared with that of a defect-free image of the same fabric. For referential clarity, images belonging to the latter group will be known as reference images and images in the dataset will be known as query images.
The procedure that is used to extract the textural feature vector from a query image and compare it with the fabric's reference image is detailed in the rest of this subsection. Apart from the gray scaling and gray-level requantization, no additional preprocessing was applied to the images, in order to preserve all micro-level features in their natural form.
An image that is re-quantized to gray levels and tiled with a window size of results in a set of of tiles, where is given by Four GLCMs, one for each , are derived from each tile. The 22 textural features as presented in [9] are calculated from each GLCM and the same features calculated from the four GLCMs of one tile are averaged to ensure rotational invariance. This results in each of the tiles being described by a 22-element feature vector, given by the matrices and for the reference image and query image, respectively, in Figure 1. The aforementioned set of 22 features includes GLCM features such as energy, entropy, correlation II and homogeneity II [2] [12] that are commonly used in fabric defect detection. The rest of the features are used for texture classification in general [4]- [8].
For the defect-free reference image of the fabric, it is assumed that the same feature across all tiles is similar in value, because woven fabrics have near-regular texture. Therefore, to represent the reference image by one textural feature vector ( ), each feature is averaged across all tiles.
A query image may either be defect-free or defective. If a query image is defective, distortions in the texture of a tile, caused by the presence of a defect, lead to one or more textural features having abnormally high or low values. Therefore, both the maximum of each textural feature ( ) and the minimum ( ) are compared with by calculating the absolute deviation between them. The maximum absolute deviation is taken as the feature vector of the query image.
According to a previous study [12], a comparison of the feature average across tiles compared with the feature maximum or minimum across tiles provides a feature set suitable for more accurate classification.

Feature Ranking
Some features in high-dimensional feature spaces may carry non-discriminative or correlated information from the problem domain's perspective. These features do not contribute to the classifier performance and, in some cases, could even influence it negatively by biasing the classifier on directions outside the problem domain. Feature ranking determines the features which are most discriminative and relevant to a given classifier model.
In order to separate the feature ranking process from the classification, joint mutual information (JMI) is used to rank features in this study. The classifier-independent features selected thus are generic. Additional advantages of using JMI are the lower chance of overfitting during classification and speed of computation [10] compared to classifier-dependent techniques. JMI is used from among other classifierindependent methods as it provides the best trade-off between accuracy and stability [11].
The mutual information between two random variables, and is defined as where is their joint probability density function and and are their respective marginal probabilities. When the mutual information is conditioned by a third variable , it can be expressed as … (5) The JMI score finds the feature that adds the most new information to a set of already selected features ( ). The score is formulated for a feature under investigation ( ) as where is the target class.

Detection
Past studies on texture analysis [7] [8] [9] similar to our work, have opted to use Fisher's linear discriminant analysis (FLDA) for classification. According to the earliest study among these [7], the reason for choosing FLDA over a machine learning technique was the scarcity of samples in their dataset. Due to the dataset used in this research having an adequate number of samples for a binary classification problem and due to the ubiquity and success of machine learning techniques at present, this study uses the k-nearest neighbour (k-NN) classifier for classification with a standard five-fold cross-validation (FFCV) to determine the optimal number of neighbours.
Since the classes in the dataset are balanced, the k-NN is an acceptable algorithm for the classification. The fact that it is a non-parametric algorithm comes as an advantage. When the k-NN is provided with a sample to classify, it calculates the dissimilarity between said sample and each sample in the training set within the feature space. Here, the Manhattan distance is used as the dissimilarity measure. The Manhattan distance between two samples and is given by, where and are the feature vectors of and . Then, the nearest samples are considered and the test sample is assigned the majority class.
The F1 score is used to measure classifier performance. Accuracy has also been calculated for the purpose of comparing our results with those in recent studies. The definitions of these metrics are given below: where and are the true positives, false positives, true negatives and false negatives, respectively.

Assessment of Optimal Feature Set
The dataset is split into the training set which contains 1018 images and the test set containing 408 images. The reference images used in the feature extraction are not considered part of the dataset.
For notational clarity, let denote a single combination of a gray level and the window size. The experiments were performed as follows.
The feature vectors for all < , > combinations were extracted using the method in Figure 1 for all images in the training set. For each < , >, JMI, given by equation 6, was used to rank the 22 features in the feature vector. The ranked features were normalized to the range −1 to 1. For each < , >, a binary classification was carried out using a k-NN classifier with an increasing number of features, starting with the features of the highest rank and adding features of lower ranks until the full feature set is used for the classification as shown in Figure 2. The classifier performance is measured using the F1 score, given by equation 10. The set of features which produced the best F1 score is selected as the optimal feature set.

Results and Discussion
This section is split into two subsections. Subsection 3.1 discusses the results presented in Tables 1 and 2, and Figure 3, which shows the outcomes of the classification of the full dataset. Subsection 3.2 discusses the results presented in Figure 4, which shows the outcomes of the classification on subsets of the dataset that represent each defect.

Analysis of Results for the Full Dataset
From the results of the JMI ranking of each texture feature at each gray level quantization and window size in Table 1, it is clear that a subset of features, most prominently energy (P8), homogeneity I (P10), homogeneity II (P11) and maximum probability (P12) have considerably higher JMI scores than the rest of the features. The higher scores are mostly noticeable when the window size is smaller and fewer gray levels are used. The reason for this most likely stems from the texture characteristics of the fabric images.
Owing to the near-regularity of textures that occur in woven fabrics, the pixel intensities in their grayscale images will only take a few values that repeat throughout the image. This results in a GLCM in which a few elements are comparatively much larger than others. These elements correspond to the dominant intensity pairs occurring in the images. When the number of gray levels are reduced, pixels that contain similar intensities will be merged, thus making the GLCM fairly sparse and at the same time increasing the magnitude of the GLCM elements that correspond to the dominant intensity pairs at those gray levels. The occurrence of a defect will either introduce new intensities or disturb the pattern of cooccurrence in its vicinity, hence resulting in GLCM elements of less magnitude in the tile(s) in which the defect occurs.
As the energy (P8) is simply the sum of the squares of the GLCM elements, the energy of GLCMs of defect-free tiles will be comparatively higher than those of defective tiles. Because the squaring of the elements further emphasizes the difference, it should theoretically be able to distinguish between defect-free tiles and tiles with defects that introduce new intensities to the image. This is presumably the reason why energy has a higher score. The maximum probability (P12) extracts the number of occurrences of the most frequently-occurring pixel pair i.e. the maximum of the GLCM and therefore behaves similar to the energy. The two homogeneity features (P10 and P11) not only depend on the magnitude of the GLCM elements, but also on their row and column indices. In other words, the homogeneity features will depend on the difference of the intensities of pixel pairs as well as the number of occurrences. Therefore, P10 and P11 should be able to detect tiles with defects that change the contrast of the image in their vicinity.
From the results of the k-NN classification on the test set with an increasing number of features in Figure 3, it is clear that using a very large window, relative to the image size, will invariably result in lower F1 scores. This result is contrary to the statement in [5] that larger windows, in theory, provide more accurate classifications. The reason smaller windows produce better results in this study is the small size of most defects that commonly occur in fabrics. Using larger windows will fail to capture the texture distortion in the locality of a defect which is significantly smaller than the defect-free area. The claim that the window size should be smaller than the defective area [4] cannot be verified here, as the smallest window size used in this study is 8×8 and there are much smaller defects in the dataset. However, it is believed that using overly small windows will have a negative effect on the classification since even subtle differences in the texture that are not considered defects will affect the textural features being calculated. These subtle differences might arise as a result of the changeable nature of the yarn structure or stray yarn fibres sticking out of the fabric.
It should also be noted in Figure 3 that, even if a greater number of gray levels is used, having a sufficiently large textural feature set will eventually result in acceptably accurate classifications (with F1 scores > 90%) given that the window size used for tiling is considerably small. This finding is in contrast with the result in [9] which claims such a non-variation is only present if the textural features  used in the classification had not been averaged over multiple orientations of the GLCM. The reason for the difference might be that the textures analysed in [9] were natural, irregular textures whereas the textures analysed in this study are near-regular and thus respond differently to the various orientations of the pixel pairs compared when deriving the GLCM.
The maximum F1 scores that were obtained for each < , > combination and the features   to be used. The combination that uses the least number of features (14) to produce an acceptable accuracy is 128 gray levels and a 8×8 window. The F1 score of the classification in this case is 96.39%, which is only 2.4% less than the maximum.
It can also be seen from Figure 3 that using 9 features in the classification will increase the F1 score above 90% for any < , >combination, excluding the = 64×64 case. Therefore, it can be inferred that at least 9 feature computations are required to reliably achieve a F1 score greater than 90%, given that the window size is relatively small. From Table 2, judging by the features that are consistently ranked among the top 9 features for a majority of < , >combinations, it can be inferred that the features P10, P11, P21, P8, P22, P12, P3, P20 and P4 are most likely to improve the classification F1 score for any < , >combination. It is also clear that the two features P10 and P11 are crucial in raising classification accuracy, as they have been ranked among the top 9 features for all < , >combinations. A set of 9 features is much less in number than the 17 features that another study [9] has proposed. The reduction in the number of features might also be a consequence of the near-regularity of woven fabric texture compared to the natural, irregular textures of ultrasound images used in the aforementioned study. The performance comparison of several commonly-used classifiers with the k-NN are given in Table 3 showing the superior performance of the k-NN for our feature set, which is the reason for its choice in this analysis. All classifiers had been optimized by performing a grid search and FFCV.

Analysis of Results for Each Defect
This subsection compares the results of the classification on each defect type as shown in Figure 4 with inferences made in the previous subsection.
It is immediately apparent that the Dirty Yarn defect is classified with a very high F1 score for all window sizes and all quantization levels. Judging by the sample given in Figure  5, the defect is distinct enough to not merge with the surrounding defect-free region during the gray-level down sampling and spans over the fabric in a tapering manner such that the textural information it produces can be captured by both large and small windows. These properties are not prominent in the other defects shown in the same figure. Even though the 64×64 window generally produces lower classification F1 scores, it can be seen that the Dirty Yarn defect is perfectly classified for all quantization levels, whereas Oil Spot and Burl are perfectly classified for six of the seven quantization levels when using this window size. Observation of samples of these defects as presented in Figure  5 shows that these defects are all "global" defects i.e. ones that occupy a considerable area of the image. The issue of large windows not significantly capturing the texture distortion created by small defects does not apply for such defects.
Regarding the number of features required for optimal F1 score, it can be inferred that Oil Spot, Slubs and Coarse End require many features to quantify the texture information they produce at any < , >combination, whereas all other defects can be classified accurately with fewer features.
It is also clear from Figure 4 why the overall classification F1 score was highest at = 16 and = 8, as all defects except for Broken Pick and Coarse End, have been classified with near-perfect F1 scores. However, it should be noted that, even though the least number of features for an appreciable F1 score for the overall classification is obtained at = 8 × 8 and = 128, the least number of features required to achieve maximum F1 score for each defect varies considerably.
A comparison of the best performance achieved in this study for the full dataset is compared with the results of similar studies in Table 4.

Conclusions
The aim of this research was to analyze the effect of the window size and the number of gray levels on varying sets of GLCM features.
The results of the experimental study show that a set of 9 features is sufficient to reach an F1 score exceeding 90%. Out of these 9 features, Homogeneity I and Homogeneity II have contributed to improving the classification accuracy regardless of the size of the window or its number of gray levels. The combination that produced an acceptable F1 score (96.39%) using the least number (14) of features was 128 gray levels and a 8×8 window. Another outcome of the study is the result that the use of a greater number of gray levels will still produce reasonably high F1 scores if a greater number of features are used in the classification. Furthermore, the fact that large windows impact the classification accuracy negatively has been confirmed. This work has shown that the requirements for classifying near-regular textures with the GLCM differ from those for classifying irregular textures, mainly in that fewer features can be used to achieve acceptable classification accuracies.