A deep dive on GLM's in frequency severity models
Post
Cancel

# A deep dive on GLM's in frequency severity models

Jupyter notebook here

This notebook is a deep dive into General Linear Models (GLM’s) with a focus on the GLM’s used in insurance risk modeling and pricing (Yan, J. 2010).I have used GLM’s before including: a Logistic Regression for landslide geo-hazards (Postance, 2017), for modeling extreme rainfall and developing catastrophe models (Postance, 2017). The motivation for this post is to develop a deeper knowledge of the assumptions and application of the models and methods used by Insurance Actuaries, and to better understand how these compare to machine learning methods.

### Case study dataset: motorcylce insurance

The Ohlsson dataset is from a former Swedish insurance company Wasa. The data includes aggregated customer, policy and claims data for 64,548 motorcycle coverages for the period 1994-1998. The data is used extensively in actuarial training and syllabus worldwide Ohlsson, (2010).

Variables include:

dtypenullnunique
Agefloat64085
Sexcategory02
Geogcategory07
EVcategory07
VehAgeint64085
NCDcategory07
PYrsfloat6402577
Claimsint6403
Severityfloat640590
Claimint6402
SeverityAvgfloat640590

### EDA

All data: low number of claims and frequency (1% freq)

AgeVehAgePYrsClaimsSeverityClaimSeverityAvg
count64548.00000064548.00000064548.00000064548.00000064548.00000064548.00000064548.000000
mean42.41606212.5400631.0107590.010798264.0177850.010380246.964360
std12.9809609.7274451.3073560.1073234694.6936040.1013524198.994975
min0.0000000.0000000.0027400.0000000.0000000.0000000.000000
25%31.0000005.0000000.4630140.0000000.0000000.0000000.000000
50%44.00000012.0000000.8273970.0000000.0000000.0000000.000000
75%52.00000016.0000001.0000000.0000000.0000000.0000000.000000
max92.00000099.00000031.3397302.000000365347.0000001.000000211254.000000

Claims data:

AgeVehAgePYrsClaimsSeverityClaimSeverityAvg
count670.000000670.000000670.000000670.000000670.000000670.0670.000000
mean35.4761197.9656721.5794151.04029925435.5522391.023792.620149
std12.8510566.7688962.9833170.19680538539.4150330.033765.250000
min16.0000000.0000000.0027401.00000016.0000001.016.000000
25%25.0000002.0000000.4301371.0000003031.5000001.03007.750000
50%30.0000007.0000000.7904111.0000009015.0000001.08723.500000
75%47.00000012.0000001.4979451.00000029304.5000001.026787.750000
max68.00000055.00000031.1671202.000000365347.0000001.0211254.000000

Claims and losses by each variable:

### Modeling

Train-Test split

1 2 3 4 5 6 7 8 9 10 11 12 13 # train-test splits stratifies on claims # take copies to overcome chained assignment train,test = train_test_split(df,test_size=0.3,random_state=1990,stratify=df['Claim']) train = df.loc[train.index].copy() test = df.loc[test.index].copy() a,b,c,d = train['Claim'].sum(),train['Severity'].sum(),test['Claim'].sum(),test['Severity'].sum() print(f"Frequency\nTrain:\t{len(train)}\t{a}\t${b}\nTest:\t{len(test)}\t{c}\t${d}\n") # severity train test train_severity = train.loc[train['Claim']>0].copy() test_severity = test.loc[test['Claim']>0].copy() a,b,c,d = train_severity['Claim'].sum(),train_severity['Severity'].sum(),test_severity['Claim'].sum(),test_severity['Severity'].sum() print(f"Severity\nTrain:\t{len(train_severity)}\t{a}\t${b}\nTest:\t{len(test_severity)}\t{c}\t${d}\n") 
1 2 3 4 5 6 7 Frequency Train: 45183 469 $11664342.0 Test: 19365 201$5377478.0 Severity Train: 469 469 $11664342.0 Test: 201 201$5377478.0 

### Claim Frequency

For predicting the occurrence of a single claim (i.e. binary classification) one can use the Binomial distribution (a.k.a Bernoulli trial or coin-toss experiment).

When predicting claim counts or frequency, $Y$, a model that produces Poisson distributed outputs is required. For instance, a Poisson model is suitable for estimating the number of insurance claims per policy per year, or to estimate the number of car crashes per month.

The key components and assumptions of a Poisson distributed process are:

1. event occurrence is independent of other events.
2. events occur within a fixed period of time.
3. the mean a variance of the distribution are equal e.g. $mu(X) = Var(X) = λ$

STAT 504: Poisson Distribution

If the mean and variance are unequal the distribution is said to be over-dispersed (var > mean) or under-dispersed (var < mean). Over-dispersion commonly arises in data where there are large number of zero’s (a.k.a zero-inflated).

In the case of zero-inflated data, it is “A sound practice is to estimate both Poisson and negative binomial models.Cameron, 2013. Also see this practical example for beverage consumption in pdummy_xyhon

In GLM’s, link-functions are applied in order to make the mean outcome (prediction) fit to some linear model of input variables from other distributions. “A natural fit for count variables that follow the Poisson or negative binomial distribution is the log link. The log link exponentiates the linear predictors. It does not log transform the outcome variable.” - Count Models: Understanding the Log Link Function,TAF

Lastly, for any form of count prediction model one can also set an offset or exposure. An offset, if it is known, is applied in order to account for the relative differences in exposure time for of a set of inputs. For example, in insurance claims we might expect to see more claims on an account with 20 years worth of annual policies compared to an account with a single policy year. Offsets account for the relative exposure, surface area, population size, etc and is akin to the relative frequency of occurrence (Claims/years). See these intuitive SO answers here, here, and here.

1 2 3 4 # Mean & Variance mu = df.Claims.mean() var = np.mean([abs(x - mu)**2 for x in df.Claims]) print(f'mu = {mu:.4f}\nvar = {var:.4f}') 
1 2 mu = 0.0108 var = 0.0115 

Here we observe an over-dispersed zero-inflated case as the variance of claim occurrence ($v=0.0115$) exceeds its mean ($mu=0.0108$).

As suggested in Cameron (2013) we should therefore try both $Poisson$ and $Negative Binomial$ distributions.

For good measure, and to illustrate its relationship, lets also include a $Binomial$ distribution model. The Binomial model with a logit link is equivalent to a binary Logistic Regression model [a,b]. Modeling A binary outcome is not a totally unreasonable approach in this case given that the number of accounts with claims $n>1$ is low (22) and as the $Binomial$ distribution extends to a $Poisson$ when trials $N>20$ is high and $p<0.05$ is low (see wiki, here and here).

However, there is one change we need to make with the Binomial model. That is to alter the way we handle exposure. A few hours of research on the matter led me down the rabbit hole of conflicting ideas in textbooks, papers [i] and debates on CrossValidated [a,b]. In contrast to Poisson and neg binomial there is no way to add a constant term or offset in the binomial formulation (see here). Rather it is appropriate to either: include the exposure as a predictor variable, or to use weights for each observation (see here and the statsmodels guidance on methods for GLM with weights and observed frequencies. I opted for the weighted GLM. The model output of the binomial GLM is the probability of at least 1 claim occurring weighted by the observation time $t=Pyrs$. Note there is no equivalent setting on the predict side, the predictions assume a discrete equivalent time exposure t=1.

In addition it is common in insurance risk models to use a quasi-poisson or zero inflated Poisson (ZIP) model in scenarios with high instances of zero claims. In data science and machine learning we would refer to this as an unbalanced learning problem (see bootstrap, cross validation, SMOTE). The ZIP model combines:

• a binomial model to determine the likelihood of one or more claims occurring (0/1)
• a negative binomial or poisson to estimate the number of claims (0…n)
• a severity model to estimate the avg size of each claim (1…n)

Statsmodels uses patsy formula notation. This includes: notation for categorical variables , setting reference/base levels, encoding options, and operators.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # # examples of formula notation in smf # print(' + '.join(train.columns)) # expr = "Claims ~ Age+C(Sex)+C(Geog, Treatment(reference=3))+EV+VehAge+NCD" # including PYrs as parameter commented out in glm() expr = "Claims ~ Age + Sex + Geog + EV + VehAge + NCD" # + np.log(PYrs) FreqPoisson = smf.glm(formula=expr, data=train, offset=np.log(train['PYrs']), family=sm.families.Poisson(link=sm.families.links.log())).fit() FreqNegBin = smf.glm(formula=expr, data=train, offset=np.log(train['PYrs']), family=sm.families.NegativeBinomial(link=sm.families.links.log())).fit() # uses the binary "Claim" field as target # offset is Pyrs (Years complete 0.0...n) # aka a binary logistic regression FreqBinom = smf.glm(formula="Claim ~ Age + Sex + Geog + EV + VehAge + NCD " , data=train, freq_weights=train['PYrs'], family=sm.families.Binomial(link=sm.families.links.logit())).fit() 

### Model coefficients

Poisson Parameters

We can derive the model output using predict or the raw coefficients. Sampling the Poisson rate (lambda) illustrates the difference in predicted rates for the intercept and for a when a driver is male.

1 FreqPoisson.summary() 
Dep. Variable: No. Observations: Claims 45183 GLM 45161 Poisson 21 log 1.0000 IRLS -2541.7 Wed, 14 Oct 2020 4135.6 21:43:22 1.83e+05 22 nonrobust
coef std err z P>|z| [0.025 0.975] -0.9052 0.284 -3.184 0.001 -1.462 -0.348 0.3881 0.162 2.392 0.017 0.070 0.706 -0.6478 0.135 -4.811 0.000 -0.912 -0.384 -0.9043 0.137 -6.606 0.000 -1.173 -0.636 -1.3826 0.123 -11.254 0.000 -1.623 -1.142 -1.5218 0.389 -3.912 0.000 -2.284 -0.759 -1.4581 0.315 -4.623 0.000 -2.076 -0.840 -22.0308 1.77e+04 -0.001 0.999 -3.47e+04 3.46e+04 0.0923 0.237 0.389 0.697 -0.372 0.557 -0.4219 0.199 -2.115 0.034 -0.813 -0.031 -0.3602 0.215 -1.678 0.093 -0.781 0.061 -0.0334 0.204 -0.164 0.870 -0.433 0.367 0.4132 0.202 2.042 0.041 0.017 0.810 0.3316 0.483 0.687 0.492 -0.614 1.278 -0.1441 0.181 -0.795 0.426 -0.499 0.211 0.0184 0.192 0.096 0.923 -0.357 0.394 0.3047 0.184 1.660 0.097 -0.055 0.664 -0.0535 0.215 -0.249 0.804 -0.475 0.368 0.0967 0.206 0.470 0.639 -0.307 0.501 0.1835 0.137 1.334 0.182 -0.086 0.453 -0.0580 0.004 -13.633 0.000 -0.066 -0.050 -0.0762 0.008 -9.781 0.000 -0.091 -0.061
1 2 Lambda intercept: 0.40 Lambda intercept + male: 0.60 

Negative binomial coefficients and incidence rate ratio

• https://stats.idre.ucla.edu/stata/dae/negative-binomial-regression/
• https://stats.stackexchange.com/questions/17006/interpretation-of-incidence-rate-ratios
• https://stats.stackexchange.com/questions/414752/how-to-interpret-incidence-rate-ratio
• https://www.cdc.gov/csels/dsepd/ss1978/lesson3/section5.html
1 print(FreqNegBin.summary()) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claims No. Observations: 45183 Model: GLM Df Residuals: 45161 Model Family: NegativeBinomial Df Model: 21 Link Function: log Scale: 1.0000 Method: IRLS Log-Likelihood: -2535.9 Date: Wed, 14 Oct 2020 Deviance: 3754.7 Time: 21:43:23 Pearson chi2: 1.82e+05 No. Iterations: 21 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.8851 0.288 -3.069 0.002 -1.450 -0.320 Sex[T.M] 0.3927 0.163 2.403 0.016 0.072 0.713 Geog[T.2] -0.6545 0.137 -4.778 0.000 -0.923 -0.386 Geog[T.3] -0.9104 0.139 -6.541 0.000 -1.183 -0.638 Geog[T.4] -1.3888 0.125 -11.100 0.000 -1.634 -1.144 Geog[T.5] -1.5350 0.391 -3.928 0.000 -2.301 -0.769 Geog[T.6] -1.4693 0.317 -4.635 0.000 -2.091 -0.848 Geog[T.7] -21.1594 1.13e+04 -0.002 0.999 -2.22e+04 2.22e+04 EV[T.2] 0.0957 0.240 0.399 0.690 -0.375 0.566 EV[T.3] -0.4359 0.202 -2.155 0.031 -0.832 -0.039 EV[T.4] -0.3671 0.217 -1.690 0.091 -0.793 0.059 EV[T.5] -0.0319 0.207 -0.154 0.877 -0.437 0.373 EV[T.6] 0.4212 0.205 2.054 0.040 0.019 0.823 EV[T.7] 0.3309 0.486 0.681 0.496 -0.622 1.283 NCD[T.2] -0.1456 0.183 -0.795 0.427 -0.505 0.214 NCD[T.3] 0.0171 0.194 0.089 0.929 -0.362 0.397 NCD[T.4] 0.2965 0.186 1.591 0.112 -0.069 0.662 NCD[T.5] -0.0509 0.217 -0.234 0.815 -0.477 0.375 NCD[T.6] 0.1034 0.208 0.498 0.618 -0.303 0.510 NCD[T.7] 0.1871 0.139 1.343 0.179 -0.086 0.460 Age -0.0581 0.004 -13.509 0.000 -0.067 -0.050 VehAge -0.0766 0.008 -9.730 0.000 -0.092 -0.061 ============================================================================== 

Binomial Model Coefficients and Logits Log-Odds, Odds and Probabilties

• https://sebastiansauer.github.io/convert_logit2prob/
1 print(FreqBinom.summary()) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claim No. Observations: 45183 Model: GLM Df Residuals: 45690.29 Model Family: Binomial Df Model: 21 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -3504.3 Date: Wed, 14 Oct 2020 Deviance: 7008.7 Time: 21:43:23 Pearson chi2: 4.72e+04 No. Iterations: 24 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -2.7911 0.289 -9.672 0.000 -3.357 -2.225 Sex[T.M] 0.7379 0.153 4.832 0.000 0.439 1.037 Geog[T.2] -0.4605 0.141 -3.265 0.001 -0.737 -0.184 Geog[T.3] -0.6028 0.140 -4.320 0.000 -0.876 -0.329 Geog[T.4] -0.1851 0.111 -1.668 0.095 -0.403 0.032 Geog[T.5] -1.8854 0.514 -3.667 0.000 -2.893 -0.878 Geog[T.6] -2.0365 0.456 -4.469 0.000 -2.930 -1.143 Geog[T.7] -21.7246 1.56e+04 -0.001 0.999 -3.07e+04 3.06e+04 EV[T.2] -0.0613 0.261 -0.234 0.815 -0.574 0.451 EV[T.3] 0.3441 0.198 1.734 0.083 -0.045 0.733 EV[T.4] -0.5492 0.226 -2.426 0.015 -0.993 -0.105 EV[T.5] 0.2531 0.204 1.238 0.216 -0.147 0.654 EV[T.6] 0.3888 0.207 1.882 0.060 -0.016 0.794 EV[T.7] -1.5831 1.025 -1.544 0.123 -3.593 0.427 NCD[T.2] -0.3731 0.198 -1.887 0.059 -0.761 0.015 NCD[T.3] -0.0929 0.203 -0.458 0.647 -0.490 0.305 NCD[T.4] 0.0394 0.206 0.191 0.849 -0.365 0.443 NCD[T.5] -0.7865 0.295 -2.671 0.008 -1.364 -0.209 NCD[T.6] -0.6098 0.273 -2.230 0.026 -1.146 -0.074 NCD[T.7] 1.0612 0.123 8.654 0.000 0.821 1.302 Age -0.0384 0.003 -11.086 0.000 -0.045 -0.032 VehAge -0.0705 0.006 -11.589 0.000 -0.082 -0.059 ============================================================================== 
1 2 3 4 5 6 7 # Example conversions from logits to probabilities const = FreqBinom.params[0] odds = np.exp(const) probability = odds / (1+odds) print(f'Intercept: p = {probability:.3f}') _ = np.exp(const+FreqBinom.params[1])/(1+(np.exp(const+FreqBinom.params[1]))) print(f'Intercept + Male: p = {_:.3f} ({_-probability:.3f})') 
1 2 Intercept: p = 0.058 Intercept + Male: p = 0.114 (0.056) 

### Prediction

1 2 3 4 5 6 7 8 9 test['Fnb'] = FreqNegBin.predict(transform=True,exog=test,offset=np.log(test['PYrs'])) test['Fpo'] = FreqPoisson.predict(transform=True,exog=test,offset=np.log(test['PYrs'])) test['Fbi'] = FreqBinom.predict(transform=True,exog=test) fig,axs = plt.subplots(1,3,figsize=(13,3.3),sharex=True,sharey=True) sns.histplot(test['Fpo'],ax=axs[0],label='Poisson') sns.histplot(test['Fnb'],ax=axs[1],label='NegBinomial') sns.histplot(test['Fbi'],ax=axs[2],label='Binomial') test.sample(5) 
AgeSexGeogEVVehAgeNCDPYrsClaimsSeverityClaimSeverityAvgFnbFpoFbi
6358469.0M41530.36164400.000.00.0006940.0006900.004815
244621.0M45830.76712300.000.00.0184210.0181980.030839
6350569.0M111170.66575300.000.00.0038330.0037810.011952
730925.0M35160.26301400.000.00.0150580.0147180.017250
2957342.0M66770.25753400.000.00.0033910.0033370.008624

Looking at the model summaries, the histograms and results the of predicted values on the test, we see that each model weights covariates similarly and produces similar scores on the test data. Again note that the $Binomial$ model was only used to demonstrate its similarity in this case, but this may not hold for other data.

### Claim Severity

1 2 3 4 5 6 7 # including PYrs as parameter commented out in glm() expr = "SeverityAvg ~ Age + Sex + Geog + EV + VehAge + NCD" ### Estimate severity using GLM-gamma with default inverse-power link SevGamma = smf.glm(formula=expr, data=train_severity, family=sm.families.Gamma(link=sm.families.links.inverse_power())).fit() 

Ignore the warning for now we will come back to that.

After fitting a GLM-Gamma, how do we find the Gamma shape (a) and scale (b) parameters of predictions for Xi?

Regression with the gamma model is going to use input variables Xi and coefficients to make a pre-diction about the mean of yi, but in actuality we are really focused on the scale parameter βi. This is so because we assume that αi is the same for all observations, and so variation from case to case in μi=βiα is due simply to variation in βi.technical overview of gamma glm

Alternatively you can infer gamma parameters from CI

Below I illustrate the range of predicted severity values for the intercept and each level in Geography.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # shape is shape = 1/SevGamma.scale print(f'Shape: {shape:.2f}') # intercept constant,intercept = SevGamma.params[0],SevGamma.scale/SevGamma.params[0] print(f'Intercept: {intercept:.0f}') # predicted mean G(Yi) is exp(Bo + Bi*Xi..) geogs = [(i,SevGamma.scale/(constant+c)) for i,c in zip(SevGamma.params.index,SevGamma.params) if 'Geog' in i] # plot fig,axs = plt.subplots(1,6,sharex=True,sharey=True,figsize=(15,3)) for ax,x in zip(axs.flatten(),geogs): sns.histplot(np.random.gamma(shape=shape,scale=x[1],size=10000),stat="count",element="step",ax=ax,) ax.set_title(x[0]) ax.set_xlim(0,14e4) 
1 2 Shape: 0.54 Intercept: 52203 

The GLM-Gamma model gives us a prediction of the average severity of a claim should one occur.

1 2 test_severity['Giv'] = SevGamma.predict(transform=True,exog=test_severity) test_severity[:3] 
AgeSexGeogEVVehAgeNCDPYrsClaimsSeverityClaimSeverityAvgGiv
3544545.0M46274.92054812480.012480.028728.757724
965326.0M66950.589041146000.0146000.021782.480267
203921.0M221910.432877111110.0111110.011537.649676

Now, remember the error we got using the inverse-power link function. The warning is fairly self explanatory “the inverse_power link function does not respect the domain of the Gamma family”. Oddly this is a design feature of Statsmodels at the time of writing. Rather it is better to use the log link function to ensure all predicted values are > 0.

1 2 3 4 5 6 7 # formula expr = "SeverityAvg ~ Age + Sex + Geog + EV + VehAge + NCD" ### Estimate severity using GLM-gamma with default log link SevGamma = smf.glm(formula=expr, data=train_severity, family=sm.families.Gamma(link=sm.families.links.log())).fit() 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # dispersion aka rate dispersion = SevGamma.scale print(f'Dispersion: {dispersion:.4f}') # shape is 1/dispersion shape = 1/dispersion print(f'Shape: {shape:.4f}') # intercept constant,intercept = SevGamma.params[0],np.exp(SevGamma.params[0]) print(f'Intercept: {intercept:.2f}') # predicted mean G(Yi) is exp(Bo + Bi*Xi..) # tuple(name,Yi,scale) geogs = [(i, np.exp(constant+c), np.exp(constant+c)*dispersion) for i,c in zip(SevGamma.params.index,SevGamma.params) if 'Geog' in i] # plot fig,axs = plt.subplots(1,6,sharex=True,sharey=True,figsize=(13,3)) for ax,x in zip(axs.flatten(),geogs): sns.kdeplot(np.random.gamma(shape=shape,scale=x[2],size=10000),shade=True,ax=ax,) ax.set_title(x[0]) 
1 2 3 Dispersion: 2.0528 Shape: 0.4872 Intercept: 69330.81 

Statsmodels uses patsy design matrices behind the scenes. We can apply the design matrix to calculate the distribution parameters for both the frequency and severity models, and for any data set. Train, test, synthetic data and portfolios. You name it.

This is an incredibly powerfull approach as it enables the development of highly parameterised Monte Carlo and risk simulation models.I will walk the steps using pandas dataframes so that it is clear:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # 1. Define a dummy model for your data and using the same formula and settings defined earlier. dummy_ = smf.glm(formula=expr,data=test_severity,family=sm.families.Gamma(link=sm.families.links.log())) # 2. we can then access the desing matrix. # Which is simply the data values, but also handling the intercept and reference categories. a = pd.DataFrame(dummy_.data.exog,columns=dummy_.data.param_names) # 3. Retrieve and transpose the trained model coefficients b = pd.DataFrame(SevGamma.params).T # 4. And multiply together # but this only works if indexes are equal c = a.multiply(b,axis=0) # 5. It is much cleaner to use arrays. # et voila c = pd.DataFrame(dummy_.data.exog * SevGamma.params.values,columns=dummy_.data.param_names) 

We can then use the coefficients to estimate and plot the values and samples of each row, or the entire dataset.

1 2 3 4 5 6 7 8 9 10 11 dispersion = SevGamma.scale shape = 1/dispersion num=5 fig,axs = plt.subplots(1,num,figsize=(15,4),sharex=True,sharey=True) for i,pred,ax in zip(c.index[:num],SevGamma.predict(test_severity)[:num],axs.flatten()): scale = np.exp(c.loc[[i]].sum(axis=1))*dispersion sample = np.random.gamma(shape=shape,scale=scale,size=10000) sns.kdeplot(sample,label=i,ax=ax,shade=True) ax.set_title(f'Sample Mu {sample.mean():.0f}\nPredicted Mu {pred:.0f}',fontsize=12) 

# Putting it all together: Frequency-Severity model

Portfolio price

Given that this is motor insurance, lets assume a thin margin for %0.05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 # make a dummy portfolio and make predictions portfolio = test.reset_index(drop=True).copy() portfolio['annual_frq'] = FreqBinom.predict(portfolio) portfolio['expected_sev'] = SevGamma.predict(portfolio) # expected annual loss portfolio['annual_exp_loss'] = portfolio['annual_frq'] * portfolio['expected_sev'] # set pricing ratio tolerable_loss = 0.05 pricing_ratio = 1.0-tolerable_loss portfolio['annual_tech_premium'] = portfolio['annual_exp_loss']/pricing_ratio portfolio['result'] = portfolio['annual_tech_premium'] - portfolio['Severity'] portfolio.iloc[:3,-4:] 
027818.169806277.316838291.912461291.912461
121753.203517300.218299316.019262316.019262
218524.970346298.920877314.653555314.653555

This summary illustrates the:

• observed Claims and Losses (Severity)
• the predicted frq, losses, and expected-losses
• the profit loss result (calculated on full Severity rather than SeverityAvg)
Summary201.05377478.0249.0388674.054346e+085.764319e+066.067705e+06690226.503245

Now we can build a Monte Carlo simulation to:

• provide a more robust estimate of our expected profit / loss
• sample the parameter space to gauge the variance and uncertainty in our model and assumptions
• calculate metrics for the AAL, AEP, and AXS loss and return
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # simulate portfolio simulation = dict() frq = list() sev = list() N=999 for it in range(N): print(f"{it}/{N}",end='\r') frq.append(np.random.binomial(n=1,p=portfolio['annual_frq'])) sev.append(np.random.gamma(shape=portfolio['gamma_shape'],scale=portfolio['gamma_scale'])) # calculate Frq * Sev frq_sev = np.array(frq)*np.array(sev) # summarise the simulations simulation['sim'] = pd.DataFrame({'iteration':range(N), 'claim_num':np.array(frq).sum(axis=1),# num claims 'loss_min':np.array(frq_sev).min(axis=1), # min claim 'loss_sum':np.array(frq_sev).sum(axis=1), # total 'loss_max':np.array(frq_sev).max(axis=1), # max claim 'loss_avg':np.array(frq_sev).mean(axis=1) }) 
1 998/999 

Calculate the Annual Exceedence Probability

• https://sciencing.com/calculate-exceedance-probability-5365868.html
• https://serc.carleton.edu/quantskills/methods/quantlit/RInt.html
• https://www.earthdatascience.org/courses/use-data-open-source-python/use-time-series-data-in-python/floods-return-period-and-probability/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def exceedance_prob(df,feature,ascending=False): data = df.copy() data['value'] = data[feature] data[f'Rank'] = data['value'].rank(ascending=ascending,method='dense') data[f'RI'] = (len(data)+1.0)/data[f'Rank'] data[f'EP'] = 1.0/data[f'RI'] data.sort_values([f'value'],ascending=ascending,inplace=True) data.reset_index(drop=True,inplace=True) return data # profit loss based on technical premium simulation['sim']['technical_premium'] = portfolio['annual_tech_premium'].sum() simulation['sim']['result'] = (simulation['sim']['technical_premium'] - simulation['sim']['loss_sum']) # recurrence intervals of each measure simulation['loss_sum'] = exceedance_prob(simulation['sim'],'loss_sum') simulation['loss_max'] = exceedance_prob(simulation['sim'],'loss_max') simulation['result'] = exceedance_prob(simulation['sim'],'result',ascending=True) 

This illustrates the profit and loss scenarios for our pricing strategy across $N$ iterations.

And more intuitive EP curves.