Generalized linear Bayesian models for standardizing CPUE: an application to a squid-jigging fishery in the northwest Pacific Ocean

1 College of Marine Sciences, Shanghai Ocean University, 999 Hucheng Ring Road, Shanghai 201306, China. E-mail: xjchen@shou.edu.cn 2 The Key Laboratory of Shanghai Education Commission for Oceanic Fisheries Resources Exploitation, 999 Hucheng Ring Road, Shanghai 201306, China. 3 The Key Laboratory of Sustainable Exploitation of Oceanic Fisheries Resources, Ministry of Education, 999 Hucheng Ring Road, Shanghai 201306, China. 4 School of Marine Sciences, University of Maine, Orono, Maine 04469, USA.


INTRODUCTION
Catch per unit effort (CPUE) of a fishery is often used as an abundance index which is usually assumed to be proportional to the stock abundance (Hilborn and Walters, 1992). For many fish species, such as those with short life spans and weak relationships between spawning stock biomass and recruitment (Yatsu et al., 1998;Ichii, 2003;Ichii and Mahapatra, 2004), it is often difficult to conduct a traditional stock assessment based only on age-or length-structured data. Thus, developing a reliable time series of abundance indices from fisheries data is critical to monitoring the population dynamics of these fish species.
Observed fisheries CPUE data are, however, influenced by many factors, in addition to fish population abundance, including spatial-temporal factors such as area and season and environmental factors such as sea surface temperature (SST) and sea surface salinity (SSS; Rodhouse, 2001). The impacts of these factors on CPUE may shift the assumed proportionality between observed CPUE and stock abundance. Thus, CPUE standardization is needed to remove the impacts of factors other than population abundance (Gavaris, 1980;Quinn and Deriso, 1999;Tian et al., 2009). Many statistical models have been developed for CPUE standardization: the generalized linear model (GLM) (Nelder and Wedderburn, 1972) is the most commonly used method in CPUE standardization, and its extension the generalized linear mixed model is also commonly used in evaluating the random effect of interactions (Pinheiro and Bates, 2000). Because of the limitations of the GLM in dealing with nonlinearity between dependent variables and multiple predictive variables, the generalized additive model (GAM) model is increasingly used recently for CPUE standardization (Bellido et al., 2001;Tian et al., 2009). However, these approaches may not work properly for some fisheries with unique characteristics, such as the neon flying squid (Ommastrephes bartramii) fishery in the north Pacific Ocean.
The neon flying squid is an oceanic cephalopod widely distributed in subtropical and temperate waters of the world (Roper et al., 1984). It supports a major commercial squid fishery in the northwest Pacific Ocean. Japan began the commercial fishery in 1974 after catches of the Japanese common squid (Todarodes pacificus) dropped sharply during the early 1970s, and was followed by Korea and the Taiwan province of China. In the late 1980s, with the development of the driftnet fishery, annual catch increased rapidly. Mainland China started a small-scale commercial neon flying squid fishery in 1994, and since then the fishery has expanded greatly (Chen et al., 2008).
The North Pacific population of O. bartramii comprises two seasonal cohorts (autumn and winter-spring) and four stocks: (1) the central stock of the autumn cohort; (2) the east stock of the autumn cohort; (3) the west stock of the winter-spring cohort; and (4) the central-east stock of the winter-spring cohort (Mori, 1997;Nagasawa et al., 1998a, b;Yatsu et al., 1998). One of the stocks for O. bartramii, the western stock of winter-spring cohort, is the main target of the Chinese jigging fishery (Chen et al., 2007(Chen et al., , 2008. This species plays an important role in the ecosystem of the North Pacific Ocean (Wang and Chen, 2005). Thus it is critical to monitor its population dynamics for the sustainable exploitation of this squid. However, because of its short life span and possible weak relationship between spawning stock biomass (SSB) and recruitment (Yatsu et al., 1998;Ichii, 2003;Ichii and Mahapatra, 2004), it is difficult to conduct a traditional stock assessment based on age-or length-structured data. Current stock assessment methods for this squid including the depletion method (Chen et al., 2008) and squid recruitment prediction (Cao et al., 2009) are based on the availability of abundance indices. Therefore, developing a reliable abundance index from the fishery data is critically important for this squid fishery.
For the western stock of the winter-spring cohort of O. bartramii, previous studies suggest that the distribution of this squid and its fishery CPUE can be significantly influenced by many environmental variables on the fishing grounds. Chen (1999) showed that the distribution and abundance of neon flying squid in the central north Pacific Ocean were strongly affected by water temperature and salinity, with temperature having a higher predictive power for abundance. The relationships between CPUE of neon flying squid fishery and SST were established in several studies (Yatsu and Watanabe, 1996;Chen, 1997;Chen and Tian, 2005). However, the CPUE standardization for this squid fishery using the GLM model showed that environment variables (SST, SSS) had no significant impacts on CPUE (Tian et al., 2009). Additionally, because of communication among fishing vessels and the use of SST information in squid searching, the monthly distribution of fishing effort is always concentrated on the highly productive areas perceived by the fishermen. Using such observed CPUE as squid abundance index may cause serious problems in monitoring the squid population. In order to overcome the problems, we explored the use of the generalized linear Bayesian model (GLBM) for the squid CPUE standardization in this study.
Bayesian statistical methods are widely used in fisheries stock assessment to provide a conceptually elegant approach for providing fishery management advice under uncertainty (McAllister and Kirkwood, 1998), but they have been rarely used for CPUE standardization. Bayesian methods are appealing as they can easily incorporate heterogeneous data and are flexible in allowing the use of new data (Zhang et al., 2009). In this study we explored and developed alternative statistical methods for CPUE standardization and applied them to neon flying squid. Factors that might affect the CPUE of the neon flying squid fishery were identified. We determined whether our approach could deal with the problems of developing an abundance index for squid in the presence of large heterogeneity in the spatial dynamics of fishing effort.

Data
Commercial squid fisheries data on the western stock of the winter-spring cohort in the northwest Pacific Ocean west of 170°E were obtained from June to November 1995-2004 from the database of the Chinese Squid-Jigging Technology Working Group. The main fishing ground of Chinese squid-jigging vessels targeting the winter-spring cohort is shown in Figure 1. The Observer Program of Chinese squid-jigging fishery, coordinated by the Technology Working Group, covers more than 70% of Chinese squid fishing vessels in the northwestern Pacific Ocean. The fishery data include fishing date, fishing location (longitude, latitude), number of fishing vessels per day, and daily catch per fishing vessel. A spatial scale of 0.5°×0.5° (referred to as a fishing grid) was used in grouping the data. The CPUE of a fishing grid was calculated by dividing the total catch by the number of fishing vessels per day within the fishing grid. The mean observed CPUE i (metric tons per fishing day for one fishing vessel, t/d) for a fishing grid was calculated as: where Catch i is the sum of catch for all the fishing vessels within a fishing grid, and Fishing days is the sum of fishing days for all fishing vessels in the fishing grid. We chose month as the time-step in grouping observed CPUE for each fishing grid. In addition to catch and effort data, each monthly record also includes temporal information (year and month) and spatial information (latitude and longitude), and data on SST, SSS and sea level height (SLH). These environmental data were acquired from the website http://iridl.ldeo.columbia. edu/ SOURCES/.CARTON-GIESE/ and grouped by the 0.5×0.5° grid for each month, consistent spatially and temporally with the above fisheries data.

Covariates and error models
The covariates considered in the analysis were selected if they were found to influence the squid fishery CPUE or squid habitats in previous studies (Chen, 1997(Chen, , 1999Murata and Nakamura, 1998;Chen and Tian, 2005). These covariates were divided into two groups: categorical covariates, including year, month, latitude and longitude (the centre point value of each fishing grid); and continuous covariates, including SST, SSS, and SLH. Moreover, as all Chinese squidjigging vessels are modified from the same inshore bottom-trawlers and equipped with a main engine power of 2×120 kW, squid-attracting lamp power of 112 kW and 16 squid-jigging machines, and thus almost identical in fishing power (Chen et al., 2007), fishing power was not considered as a factor in modelling.
Two-way interactions among the main factors considered in this study were restricted to spatial effects versus temporal effects. The interactions between spatial-temporal factors and environmental variables were excluded in the GLM due to possible correlations between these terms (Bigelow et al., 1999). Because the life-span of neon flying squid is only one year (Yatsu et al., 1998) and this species is highly migratory, the interactive effects of spatial versus temporal factors on abundance are considered to be more important (Chen et al., 2003). Correlation matrix between variables (including interaction terms) was calculated in order to avoid the high correlated terms in the GLBM model. All of the analyses were based on the assumption that CPUE data are log-normally distributed. An examination of plots of standardized residuals and quantile-quantile plots provided no evidence for violation of this assumption (results not shown because of space limitation).

Lognormal basic model
The GLBM is based on the Bayesian estimator for the GLM. In this study we used lognormal GLM for the Bayesian estimation. Therefore, CPUEs were modelled using the log-normal distribution: where U i,j,k,l is the observed CPUE at Latitude j and Longitude k in Month l of Fishing Year i, U i,j,k,l is the mean of the distribution on the log scale for U i,j,k,l , and σ is the standard deviation of the distribution on the log scale. Following Campbell and Tuck (1996), to overcome the problem of CPUE having zero values we added a small constant (D) equivalent to 10% of the overall CPUE (total catch divided by total effort over all the year) to each observed CPUE value: The mean, U i,j,k,l , was estimated based on the effects of explanatory variables: where k is the intercept; year i , month l , lat j , and lon k are the effects of Year i, Month l, Latitude j, and Longitude k, respectively; and interactions are the effects between spatial covariates and temporal covariates. The abundance index at Latitude j and Longitude k in Month l of Fishing Year i was estimated as:

Non-hierarchical and hierarchical models
In this study we developed lognormal generalized linear Bayesian models of both hierarchical and nonhierarchical form for standardizing CPUE. Hierarchical generalized linear models are also known as Generalized Linear Mixed Models. For a non-hierarchical model, interaction between spatial covariates and temporal covariates has a fixed effect; while for a hierarchical model we can assume that the interaction has a random effect (i.e. use of nested parameters), which can lead a more reliable estimation of the interactions.

Prior distribution of parameters and model initialization
Bayesian analyses require all model parameters to have prior probability distributions. Because we only had limited knowledge about all the model parameters, we assumed that the priors for all the parameters were non-informative. We assigned a normal distribution with mean = 0 and a large variance (100000) for k, year i , month l , lat j and lon k . The normal distribution with such a large variance is similar to a uniform distribution. For σ 2 we assigned an inverse gamma distribution, 1/σ 2 ~ Gamma (0.001, 0.0001), where 0.001 and 0.0001 are the parameters of shape and rate (the inverse scale) of the gamma distribution (Ntzoufras, 2009). For the non-hierarchical lognormal model, the prior of each Interaction was assigned a normal distribution defined as N (0, 100000). For the hierarchical models, the priors of Interactions were assumed to follow a normal distribution ~N (U c , σ c 2 ), the hyperparameter, U c , was assigned a normal distribution, ~N (0, 100000) and 1/σ c 2 ~ Gamma (0.001,0.0001). Sensitivity tests were conducted to evaluate the robustness of these priors especially for the shape of the variance parameter of the Gamma.

Estimated abundance index using predictive CPUE
In order to deal with the high spatial heterogeneity of fishing effort distribution in the squid fishery, we used our models to predict CPUE in unfished areas based on the estimated effects of the explanatory variables as long as these areas were fished in some of the other months. Therefore, the abundance index for Year i and Month l, A i,l , was estimated in two different ways, depending on whether predicted CPUE values in the unfished areas were used in the calculation of the abundance index. If predicted catch rates were used, A i,l was calculated as: where TG is the total number of fishing grids fished in the time series. If predicted catch rates were not used, TG refers to the number of fished grids in Year i and Month l.

Model selection
We started the process by analyzing the GLBM based on Equation (5) without adding interaction covariates and selected the most fitted model using the stepwise approach based on deviance information criteria (DIC) (Spiegelhalter et al., 2002). The smaller the DIC value is, the better the model is. We then processed it by adding interaction covariates (year i *month l , year i *lon k , year i *lat j , month l *lon k and month l *lat j ) both non-hierarchically and hierarchically, and identified the best-fitting models by these two approaches based on the DIC. Uncorrelated terms in each model were ensured during the modelling process.

Convergence diagnostics
Convergence is a fundamental assumption for the application of any method for sampling posterior distribution (Cowles and Carlin, 1996;Brooks and Gelman, 1997). In this paper, a large number of MCMC "burnin" and iterations were done in order to stabilize the central tendency of posterior distributions for all parameters. "Burn-in" was discarded before the analysis. Two chains were used for convergence checking by comparing the variances within the chains and across the chains (Gelman and Rubin, 1992). The within-chain variance was calculated by taking n draws of m parameters with Equation (8), and the estimated variance across the chains was computed with Equation (9).
Alternate convergence diagnostic of Geweke (1992) was also applied in this study because we used the posterior mean as the estimated value.

Solving models
All of the Bayesian analyses and modelling were implemented with R version 2.10.0 and WinBUGS14 (Bayesian inference Using Gibbs Sampling; Spiegelhalter et al., 2003). Appropriate MCMC iterations were used on the basis of the convergence assessment and a burn-in period with 100000 initial iterations being discarded was used in order to ensure that samples were drawn from a distribution close enough to the true stationary distribution to be usable for the estimation and inference. As initial values were part of the computing process, we assigned the covariates different initial values for two chains when we did MCMC runs. Autocorrelations among draws from posteriors were eliminated by thinning the samples for every 50 MCMC runs (Geweke, 1992).

Spatial distribution
The analyses were based on 3361 records, of which 3357 had positive catches and 4 had zero catch. The distribution of effort is shown in Figure 1a, indicating that there were 587 fishing grids fished from 1995 to 2004, while effort in a given month was only distributed in less than 100 fishing grids (Fig. 1b). Most fishing effort was deployed in August to October and effort data in June and November only consisted of a small proportion of the overall efforts.

Correlation
The correlations of all covariates including interaction terms are summarized in the matrix plot (Fig. 2). Non-significant correlations were observed between covariates without interaction terms. However, interaction terms of month l * lon k and month l * lat j showed high correlations with other interaction terms. Therefore, we excluded these two terms from the modelling process. There is no significant correlation between CPUE and any single factor alone considered in this study (Fig. 2).

Model selection and comparison
The best-fitting models derived from three approaches (i.e. interaction terms being excluded, in-teraction terms of fixed effects being included and interaction terms of random effects being included) are listed in Table 1. The DIC results showed that the model selected in GLBM analyses and containing month l , lat j, SST, SSS and SLH had the lowest value, suggesting that this model outperformed the other two best-fitting models which were derived from considering interaction terms of fixed and random effects ( Table 1). The results of this model are shown in Table 2.
For the GLBM analyses from which interaction terms were excluded, the best-fitting model revealed that all the covariates except for year i and lon k considered in this study had significant effects (95% posterior interval does not cover zero) on CPUE and SSS had the most significant effect on CPUE (Fig.  3a). The covariates, month l lat j and SST, were the second, third and fourth most significant covariates, respectively (Fig. 3a), and SLH had a weaker impact on CPUE than the other four significant covariates (Fig. 3a). For the non-hierarchical GLBM analyses in  which interaction terms were assumed to have fixed effects on CPUE, the best-fitting model showed that the covariates of lat j SST, SSS, SLH, year i × month l and year i × lon k had significant effects on CPUE (Fig. 3b). Of the 6 significant covariates, SSS, lat j , and SST were the most, second and third significant covariates, respectively, and year i × month l , year i × lon k and SLH had very weak impacts on CPUE, compared with the other 3 significant covariates (Fig.  3b). For the hierarchical GLBM analyses in which interaction terms were assumed to have random effects on CPUE, the best-fitting model was the same as the second approach but with lower DIC value. The impacts of covariates on CPUE were almost the same as the best-fitting model derived from the second approach (Fig. 3c).
The spatial-temporal interaction terms with both fixed and random effects were found to be less important in affecting CPUE according to the DIC results (Table 1), and the spatial and temporal factors tended to influence CPUE independently as a covariate rather than together as a covariate.

Convergence diagnostics
There is no evidence of non-convergence (e.g. for posterior distributions of parameters for the best-fitting model based on the convergence diagnostic (Table  2). Marginal posterior density plots of parameters for the three best-fitting models also showed a relatively stable central tendency (Fig. 4). Alternate convergence diagnostic of Geweke (1992) indicated that there was no evidence of non-convergence (P>0.05) for the posterior distributions of parameters in the best-fitting model derived from the three approaches.

Comparison of observed CPUE and standardized CPUEs
The CPUE values standardized using the best-fitting model with two different sets of data (with and without predicted abundance index) were compared with the observed CPUE under different spatial and temporal scales. For yearly CPUEs which were averaged across all fishing months, mean standardized CPUE estimates from the best-fitting model with the lowest DIC value not including predicted CPUE data in unfished grids in modelling had a similar trend to the observed CPUE, in particular for the later years, but tended to fluctuate less than the observed CPUE. Both showed that the highest stock size occurred in 2000. However, the derived abundance index not including the predicted CPUE data in unfished grids showed that the lowest abundance appeared in 2002, while for the observed CPUE it occurred in 2001 (Fig. 5). Yearly mean standardized CPUE with the inclusion of predicted CPUE data in the unfished grids showed a stable trend in 1995 to 2004 with little fluctuation (Fig. 5). For monthly CPUE, the standardized CPUE with and without predicted CPUE data in the unfished grids showed the same trend, i.e. the stock size increased with the months, but in general the standardized CPUE with predicted CPUE data in the unfished grids was lower than the observed CPUE, in particular for the months when fishing activity peaked (August to October; Fig. 6).

DISCUSSION
Here we developed Bayesian CPUE standardization models, non-hierarchical lognormal and hierarchical lognormal models, and applied these models to the CPUE standardization for the neon flying squid fishery to address the problems caused by the spatial heterogeneity of fishing effort distribution in the neon flying squid fisheries (Tian et al., 2009). This study suggests that the approaches developed in this study can effectively deal with the CPUE standardization and estimation when spatial distribution of fishing efforts is highly heterogeneous, a phenomenon commonly observed in commercial fisheries (Hilborn and Walters, 1992).
In commercial fisheries for neon flying squid, fishing effort is usually concentrated in one small fishing area showing the highest returns because of high efficiency in searching fishing grounds and communications between vessels (Tian et al., 2009). Our results show that monthly observed CPUEs appeared to be higher than the standardized CPUE, and the standardized CPUE based on data without predicted CPUE values in the unfished grids was lower than that with  predicted CPUE values (Fig. 6), especially during the fishing peak of August to October (Chen, 1997). This indicates that high concentration of spatial distribution of fishing efforts may result in the observed CPUE and even the standardized CPUE without predicted CPUE values in the unfished grids having hyperstability relationships with abundance, which implies that CPUE stays high as abundance drops (Hilborn and Walters, 1992). Thus, previous studies for neon flying squid, which used observed CPUE as an abundance index directly in stock assessment (Chen et al., 2003) and evaluating impacts of environmental variables on abundance (Cao et al., 2009), may be biased. On the other hand, the models developed in this study have the capacity to predict abundance indices for unfished areas based on the estimated effects of the explanatory variables as long as these areas are fished in other months. Using such monthly CPUEs standardized with predicted abundance indices can deal with the spatial aggregation of fishing effort to a certain extent and reflect the abundance distribution more accurately. Previous studies also indicated that ignoring the unfished grids implicitly assumed that they behaved in the same way as the fished grids and could lead to severe hyperdepletion in abundance indices for fisheries, and several approaches were proposed to deal with the unfished grids (Walters, 2003;Campbell, 2004). However these approaches tend to bypass the unfished areas because of interpolation or weighting of the spatial regions.
The GLBM developed in this study, on the other hand, aims to estimate the abundance index of unfished areas directly by sharing the information using the Bayesian statistical method.
The best-fitted model identified in the study includes the covariates of month l , lat j , SST, SSS and SLH. These spatio-temporal factors were considered important in determining the spatial distribution of this squid fishery (Chen, 1997;Chen et al., 2003;Cao et al., 2009). The selection of the environmental factors in this study is consistent with the results of previous studies (Chen, 1997;Chen et al., 2003;Cao et al., 2009). In May or June when the fishing vessels of Chinese mainland started fishing operation, the squid gradually migrated northward from their spawning ground to the fishing ground for feeding and then southward back to spawning ground for spawning (Murata and Nakamura, 1998). The stock biomass would be closely related to the timing when peak recruitments migrated into the fishing ground and the latitudinal location where peak recruitments could migrate. Therefore lat j , which is the most significant covariate of all the spatial-temporal covariates of the best-fitted model (Fig.  3), is a critical factor for determining the distribution of peak recruitment. Our result indicates that SST and SSS were the main factors affecting CPUE. The other species of Ommastrephidae squids also have the similar linear relationships between abundance and SST or SSS (Rodhouse, 2001). However, the results of using traditional GLM models showed SST and SSS were not significant in explaining CPUE (Tian et al., 2009). The spatial-temporal interactions of both hierarchical and non-hierarchical models were identified as not significant in influencing CPUE as they were driven by environmental factors which were already included. For example, special oceanographic zones in the North Pacific such as transition zones, chlorophyll fronts (TZCF) and optimum-SST front zones for neon flying squid, which are closely related to the life history of neon flying squid, have a significant seasonal northsouth movement (Ichii et al., 2009). We also added the year effect (non-significant) for re-running the GLBM (Maunder and Punt, 2004), but found that it was hard to achieve convergence if we included the year effect.
The estimated posteriors were similar for different priors in the sensitivity analyses, indicating that the posteriors were determined by the data rather than by the priors. Compared with the observed CPUEs for neon flying squid stock, the standardized CPUEs tended to have smaller fluctuations and were in general lower than the observed CPUE, indicating that the impact of spatial, temporal and environmental factors on catch rates was largely removed through the standardization process.
There are also two issues that need to be addressed. Firstly, although we have a relatively large number of squid fishery data sets with a 10-year time span, these data lack spatial contrast. For example, there are almost no zero catch data because the fishing vessels would immediately depart the area when CPUEs dropped below economically viable levels. In a given time of the fishing season, the fishing vessels are usually concentrated at a few locations with similar environmental conditions, and there are small environmental gradients on the fishing grounds. This may result in loss of effective information for the Bayesian estimation. Additionally, the limited data of June and November could lead the results of monthly standardized CPUE for June and November to have large uncertainties. Future studies should compare the temporal trend of standardized CPUE with abundance indices derived from a fishery-independent scientific survey which is often considered as a more reliable abundance index because of its scientifically rigorous design (Hilborn and Walters, 1992).
Secondly, the basic model of GLBM developed in this study is GLM, which assumes linearity between the dependent and explanatory variables. However, many studies suggest that functional relationships between CPUE and environmental variables are likely to be non-linear (Bigelow et al., 1999;Daniel and Michel, 2004). The general additive model (GAM), known as the extension of GLM with additive models, can analyze non-linear relationships (Hastie and Tibshirani, 1990), and Tian et al. (2009) showed that GAM models could explain more variations in CPUE standardization for neon flying squid with weak effects on CPUE. However, overfitting can be a problem with GAMs, which often make the fitted relationship perform less well on a new data set than on the data set used for model fitting (Everitt, 2002). Bayesian models can easily incorporate heterogeneous data and are flexible in allowing the use of new data (Zhang et al., 2009). Therefore, although we have to assume linearity of the relationship between neon flying squid CPUE and environmental variables (e.g. SST and SSS), the issue of formally detecting nonlinearity deserves further investigation before it is attempted to use GAM for neon flying squid CPUE standardization.
In summary, we explored a new modelling approach in this study to standardize fisheries CPUE, which tends to yield a more reliable abundance index and deals with the problem of spatial aggregations of fishing efforts by including predicted CPUEs of unfished grids in the CPUE standardization. However, we warn of potential problems associated with lack of contrast in the data used in this study, suggesting that the proposed modelling approach should be further tested using data from other fisheries.