Using machine learning to cleanse datasets: classifying column headers in data tables
Post
Cancel

# Using machine learning to cleanse datasets: classifying column headers in data tables

When dealing with large volumes of inbound data files and from multiple different sources, the data recieved can often come in a variety of formats, structures and to varying standards. One particularly challenging issue is data files that, although representing the same type of information, feature a variety of different label and data formats. For instance, addresses coded with “Zip” or “Postal Code”, “Street” or “Line 1” and “£1000”, “£1 K”, “GBP 1000” or “one thousand pounds”.

The Machine Learning solution is to build a model that can ingest messy labelled data (i.e. missing and with variable field names) and to make predctions for what the data fields are. These models can then be integrated within data transformation pipelines to automatically or to make suggestions of the correct data labels.

*Jupyter notebooks to recreate the synthetic dataset and to train and test the model are avaliable in this Git repo

1 2 3 4 5 6 7 # load packages import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import metrics import os %matplotlib inline 
1 2 3 4 # change the paths to your local directory data_path = 'P:\\MyWork\\demoColumnTyper\\data' data_path_external = 'P:\\MyWork\\demoColumnTyper\\data\\external' data_path_model = 'P:\\MyWork\\demoColumnTyper\\data\\model' 

Fields & Formats
The training-data has the correct headers attached. We want to predict these on inbound messy “unlabelled” data. I have included some very generic data fields and common formats. For instance, money varies between text and symbol currency values and phone includes a variety of formats and extensions. We also have some generic text_categorical and numeric values in there.

1 2 training_data = pd.read_csv(os.path.join(data_path_model,'training_data.csv')) training_data[:5] 
01745 T Street SoutheastNaNWashingtonDC2002038.867033-76.979235GBP 487760.350SPCH308mex@yahoo.comKimberlee Turlington0345 42 0274KimberleeTurlingtonB35181
16007 Applegate LaneNaNLouisvilleKY4021938.134301-85.649851€ 7321963.108VDEY870rg@aol.comMiguel Eveland095086-173-31-37MiguelEvelandA42163
2560 Penstock DriveNaNGrass ValleyCA9594539.213076-121.077583EUR 3341992.053ZFPH671ejbyy@hotmail.comAlonzo Schroyer057843 018 15-85AlonzoSchroyerD48193
3150 Carter StreetNaNManchesterCT604041.765560-72.473091€ 4397323.917WMDG542tfanfw@gmail.comDevon Osei0698-1368378DevonOseiC16134
Xstreetroadave£$@.co.comdrive0 27454Gural0.00.00.00.00.00.00.00.00.00.0 22112004-570 4-610.00.00.00.00.00.00.00.00.01.0 2262007 62460140.00.00.00.00.00.00.00.00.01.0 1052138.5128940.00.00.00.00.00.00.00.00.00.0 16661VSQO4040.00.00.00.00.00.00.00.00.01.0 ## Feature Selection - univariate Ok. So that lest step actually created about 11,000 numeric features on our training data. We need to trim this down by selecting the data value features that we believe are most likley to be correlated to our target column headers. Lets use some Chi-square correlation analyses to test the features. If you want to skip the stats, the bar plot below shows the 75 features that were found to have the highest correlation to our target column headers that we want to predict. ### Chi-squared The chi-square test is a statistical test of independence to determine the dependency of two variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data. • If Statistic >= Critical Value: significant result, reject null hypothesis (H0), dependent. There IS a relationship. • If Statistic < Critical Value: not significant result, fail to reject null hypothesis (H0), independent. There IS NOT a relationship. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # chi-squared test with similar proportions from scipy.stats import chi2_contingency from scipy.stats import chi2 prob = 0.95 alpha = 1.0 - prob # run test stat, p, dof, expected = chi2_contingency(pd.crosstab(Xy['Y'],Xy['n_vowels'])) # dof print('\ndof=%d' % dof) # interpret test-statistic prob = 0.95 critical = chi2.ppf(prob, dof) print('\nprobability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat)) if abs(stat) >= critical: print('\tDependent (reject H0)') else: print('\tIndependent (fail to reject H0)') # interpret p-value print('\nsignificance=%.3f, p=%.3f' % (alpha, p)) if p <= alpha: print('\tDependent (reject H0)') else: print('\tIndependent (fail to reject H0)')  1 2 3 4 5 6 7 dof=240 probability=0.950, critical=277.138, stat=60001.619 Dependent (reject H0) significance=0.050, p=0.000 Dependent (reject H0)  1 2 3 4 5 6 7 8 9 10 11 12 13 14 features = Xy_vect.columns[2:] stats = list() p_values = list() dofs = list() for feat in features: # run test stat, p, dof, expected = chi2_contingency(pd.crosstab(Xy_vect['Y'],Xy_vect[feat])) stats.append(stat) p_values.append(p) dofs.append(dof) chi2_results = pd.DataFrame({'feature':features,'X2':stats,'DoF':dofs,'pvalue':p_values,'sig':[x<=0.05 for x in p_values]}) chi2_results.sort_values(by='X2',ascending=False, inplace=True)  1 2 3 4 5 6 fig,axs = plt.subplots(figsize=(25,5)) n = 75 axs.bar(x = range(n), height=chi2_results['X2'][:n]) axs.set_xticks(range(n)); axs.set_xticklabels(chi2_results['feature'][:n],rotation=90) axs.set_title('Results of Correlation Analysis (top 75 features)',fontsize=15);  # Model Training Here we are building a model that can predict the column type using: i) the training data, and ii) the subset of engineered-features selected by correlation analysis. 1 2 3 from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from sklearn.naive_bayes import MultinomialNB  1 model_features = chi2_results['feature'][:100].values  1 2 3 4 5 6 7 8 # model parameters NB_params = {'model__alpha':(1e-1, 1e-3)} NB_pipe = Pipeline([('model', MultinomialNB())]) gs_NB = GridSearchCV(NB_pipe, param_grid=NB_params, n_jobs=2, cv=5) gs_NB = gs_NB.fit(Xy_vect[model_features], Xy_vect['Y']) print(gs_NB.best_score_, gs_NB.best_params_)  1 0.7691470588235294 {'model__alpha': 0.1}  # Load Test Data Here lets load the testing data. Note that the data is missing column headers. If we were cleaning up this data we would have to manually add these. 1 2 3 testing_data = [pd.read_csv(os.path.join(data_path_model,x),skiprows=1,header=None) for x in os.listdir(data_path_model) if 'testing' in x] testing_data = pd.concat(testing_data) testing_data[:10]  012345678910111213141516 011501 Maple WayNaNLouisvilleKY4022938.097617-85.659825EUR 2274211.206WBSW110vmtjcbp@gmail.comJena Quilliams039520 805 804JenaQuilliamsC24795 198 Lee DriveNaNAnnapolisMD2140338.933313-76.493310EUR 3405197.530SWLP214hdst@gmail.comLaila Arpin08249 265 6568LailaArpinB25021 2126 Sunshine RoadOSavannahGA3140532.059784-81.202271EUR 7458286.525VUYW948aminwvel@yahoo.comCesar Severson08-095 51-90CesarSeversonA05600 34313 Wisconsin Street#APT 000007AnchorageAK9951761.181060-149.942792€ 5481056.898SEXL728gevypgf@mail.kzChi Hollinsworth030164-378 59 20ChiHollinsworthB13746 4829 Main StreetNaNManchesterCT604041.770678-72.520917GBP 4247524.506VQZY333yibhnaha@mail.kzJan Reagans00132-612-74-32JanReagansC47541 537 Spring StreetNaNGrotonCT634041.320683-71.991475$ 5717317.815BSRO462lppsimxwb@mail.kzMarcos Hoistion018-8094-25MarcosHoistionE35201
6266 South J StreetNaNLivermoreCA9455037.680570-121.768021$1921095.325SNPX594tgjt@aol.comLoan Wadsworth040 9272 23LoanWadsworthC37463 77952 South Algonquian WayNaNAuroraCO8001639.573350-104.716211$ 8932865.979ZNNS529snjral@aol.comMarisa Blaskovich087 469 8912MarisaBlaskovichB07738
89223 Elgin CircleNaNAnchorageAK9950261.136803-149.965463GBP 3666134.645ZCJN210ebl@gmail.comNguyet Lytch00748296 39 4NguyetLytchA16734
9224 Michael Sears RoadNaNBelchertownMA100742.234610-72.359730EUR 2406344.382ZOSG574qeaf@aol.comCorrie Tolhurst053601-410-69-0CorrieTolhurstD21521

Below we apply the same feature engineering and selection as applied to our training data.

1 2 3 4 5 6 7 8 9 10 11 testing_data = [pd.read_csv(os.path.join(data_path_model,x)) for x in os.listdir(data_path_model) if 'testing' in x] testing_data = pd.concat(testing_data) Xy_test = pd.concat([pd.DataFrame(data={'X':list(testing_data[col]),'Y':col}) for col in testing_data.columns]) Xy_test.reset_index(drop=True,inplace=True) Xy_test.fillna('',inplace=True) # fill nan values with empty string Xy_test = feat_eng(Xy_test) test_vect = vectorizer.transform(Xy_test['X'].astype(str).str.encode('utf-8')) test_vect = pd.DataFrame(data = test_vect.todense(), columns=vectorizer.get_feature_names()) Xy_test = Xy_test.merge(test_vect,how='left',left_index=True,right_index=True).copy() 

### How good is the model?

Overall our model is able to correctly identify and label 77 % of the test data. Remember the model has never seen the test data before so that is pretty good.

1 2 Xy_test['pred'] = gs_NB.predict(Xy_test[model_features]) print('Model accuracy: %.3f' %np.mean(Xy_test['pred'] == Xy_test['Y'])) 
1 Model accuracy: 0.774 

There are also some other things we could try to improve our model further. Below we see a more detailed report and bar plot that shows us how good our model was on each data column we were trying to predict. We could for isntance, look where our model is underperfoming and try to create and select other features that better capture the data fields.

See this guide for interpretation of accuracy, precision, recall etc..

1 print(metrics.classification_report(Xy_test['Y'], Xy_test['pred'])) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 precision recall f1-score support address1 0.53 0.99 0.69 1220 address2 0.86 0.08 0.14 1220 city 0.64 0.56 0.60 1220 email 1.00 1.00 1.00 1220 first_name 0.46 0.51 0.48 1220 last_name 0.51 0.46 0.49 1220 lat 0.98 1.00 0.99 1220 lng 0.88 1.00 0.94 1220 money 1.00 1.00 1.00 1220 num_cat 0.82 1.00 0.90 1220 numeric 0.70 0.68 0.69 1220 person_name 0.88 0.98 0.92 1220 phone 0.99 0.85 0.91 1220 reference 1.00 0.96 0.98 1220 state 0.80 0.50 0.61 1220 txt_cat 0.74 1.00 0.85 1220 zip 0.68 0.59 0.63 1220 micro avg 0.77 0.77 0.77 20740 macro avg 0.79 0.77 0.75 20740 weighted avg 0.79 0.77 0.75 20740 

The bar charts below show the distribution of predictions for each target.

For instance:

• address1: the model was able to identify all of the address1 values in the testing data.
• city: The model indentified roughly 50% of the correct city values, but often confused these with human names.
1 2 3 4 5 6 7 8 9 10 11 fig,axs = plt.subplots(5,4,figsize=(10,20),sharey=True) for target,ax in zip(Xy_test['Y'].unique(),axs.flatten()): heights = Xy_test.loc[Xy_test['Y']==target,'pred'].value_counts() x = range(len(heights)) ax.bar(x=x,height=heights/1220,color='blue',edgecolor='black',alpha=0.6) ax.set_xticks(x) ax.set_xticklabels(heights.index,rotation=90) ax.set_title(target) plt.tight_layout() 

That’s all folks.