Jupyter notebook here The Financial Crimes Enforcement Network (finCEN) files are more than 2,500 documents, most of which were suspicious activity reports (SARs) files that banks sent to the US authorities between 2000 and 2017. The SARs document the flow of $2 trillion of transactions between some of the world’s biggest banks. Whilst SARs are filed by bank compliance officers, and are therefore not necessarily evidence of wrongdoing by themselves. The leaked documents raise concerns about what the banks clients might be doing and how global banking jurisdictions facilitate money laundering at global scale.

These documents are some of the international banking system’s most closely guarded secrets. Banks use them to report suspicious behaviour but they are not proof of wrongdoing or crime. They were leaked to Buzzfeed News and shared with a group that brings together investigative journalists from around the world, which distributed them to 108 news organisations in 88 countries, including the BBC’s Panorama programme. Hundreds of journalists have been sifting through the dense, technical documentation, uncovering some of the activities that banks would prefer the public not to know about.BBC

1
2
3
4
5
6
7
8
9
import os
import pandas as pd
import numpy as np
import re
import glob
import requests, zipfile, io
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
import seaborn as sns

Download, Extract, and Load data

There are two data tables:

  1. Connections: detailing Bank’s by terriroty that are related to one another by finCEN reported transactions.
  2. Transactions: detailing the transaction counts and value ($) between banks and intermediaries.
1
2
3
4
5
# download and extract zip file
zip_file_url = 'https://media.icij.org/uploads/2020/09/download_data_fincen_files.zip'
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./data/fincen")
1
2
3
4
# load data
# methods to extract str between patterns https://stackoverflow.com/a/4917004/4538066
data = {re.search('download_(.+?).csv', f).group(1):pd.read_csv(f,encoding='utf-8-sig') for f in glob.glob(os.path.join('./data/fincen', "*.csv"))}
data.keys()

Data Transforms

1
2
3
connecitons = data['bank_connections']
connecitons['filer_org_name'].replace('Soci�t� G�n�rale SA','Societe Generale SA',inplace=True)
transactions = data['transactions_map']
1
2
3
4
5
6
7
# create clean name columm
names = ['filer_org_name','originator_bank', 'beneficiary_bank',]
for nm in names: 
    transactions[f'{nm}_'] = transactions[nm].str.upper().str.replace('(?i)[^a-z&0-9 ]','').str.strip()
    
# can't have 0 transactions if amount exists
transactions['number_transactions'].replace(0.0,1.0,inplace=True)
1
2
3
4
o_num = len(data['transactions_map']['originator_bank_id'].unique())
f_num = len(data['transactions_map']['filer_org_name_id'].unique())
b_num = len(data['transactions_map']['beneficiary_bank_id'].unique())
print(f"Originators:\t{o_num}\nFilers:\t\t{f_num}\nBenficiaries:\t{b_num}")
1
2
3
Originators:	892
Filers:		26
Benficiaries:	1392

Explore the data

Transaction Values between Originator and Beneficiary countries

The heatmaps below illustrate patterns of suspicious transactions ocurring both within and between countries. Interesting patterns are shown including between politically linked former USSR countries Latvia and Russia, and Great Britain and self-governing British oversees territory The Cayman Islands (right). The top countries (left) that are both sending and recieving high volumes and values of suspicous transactions include:

  • Latvia
  • Russia
  • Switzerland
  • The Netherlands
  • Singapore
  • Great Britain
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
fig,axs = plt.subplots(1,2,figsize=(25,10))

# calculate flow between ISO countries
iso_matrix = pd.crosstab(data['transactions_map']['beneficiary_iso'],
                         data['transactions_map']['originator_iso'],
                         values=data['transactions_map']['amount_transactions'],aggfunc=np.sum)
iso_matrix = iso_matrix.dropna(thresh=10,axis=0).dropna(thresh=10,axis=1) # filter results to top n crossovers
iso_matrix = (iso_matrix/1000000).fillna(0.0) # make $ values in Millions

# sort by total x (originator)
iso_matrix = iso_matrix[iso_matrix.sum(axis=0).sort_values().index]

# sort by total y (benficiary)
iso_matrix = iso_matrix.assign(sort=iso_matrix.sum(axis=1)).sort_values('sort',ascending=False).drop('sort',1)

# plot heatmap
n = 5
sns.heatmap(iso_matrix.iloc[:n,-n:],cmap='Reds',cbar=False,linewidths=1,annot=True,fmt='.0f',ax=axs[0])
sns.heatmap(iso_matrix,cmap='Reds',linewidths=1,annot=True,fmt='.0f',ax=axs[1],
            annot_kws={"size": 9},
            cbar_kws={'label': 'Transaction Value ($ M)'})

# fix y axis limits
for ax,title in zip(axs,[f'Top {n} Countries','All Countries']):
    ax.set_title(title,fontsize=18)
    b, t = ax.get_ylim() # discover the values for bottom and top
    b += 0.5 # Add 0.5 to the bottom
    t -= 0.5 # Subtract 0.5 from the top
    ax.set_ylim(b, t);

png

1
2
# Cluster example
# pairwise_ dist

Transaction chains between originators, filers, and beneficiaries

The transaction chains highlight the parties initiating and facilitating payments. Below are the top 3 transaction chains by value and by volume of transactions.

1
2
3
cols = ['originator_bank_', 'filer_org_name_', 'beneficiary_bank_']
transaction_chains = transactions.groupby(cols).agg({'number_transactions':'sum',
                                                     'amount_transactions':'sum'}).reset_index()
1
2
# top chains in terms of transcation value
transaction_chains.sort_values(axis=0,by='amount_transactions',ascending=False)[:3]
originator_bank_filer_org_name_beneficiary_bank_number_transactionsamount_transactions
102AMSTERDAM TRADE BANK NVTHE BANK OF NEW YORK MELLON CORPROSBANK4.02.747023e+09
1985RIGENSIS BANK ASDEUTSCHE BANK AGING NETHERLAND NV47.01.201172e+09
1347ING NETHERLAND NVDEUTSCHE BANK AGRIGENSIS BANK AS33.01.199220e+09
1
2
3
# top chains in terms of transaction number.
# Bank of New York Mellon
transaction_chains.sort_values(by='number_transactions',ascending=False)[:3]
originator_bank_filer_org_name_beneficiary_bank_number_transactionsamount_transactions
139AS EXPOBANKTHE BANK OF NEW YORK MELLON CORPCREDIT SUISSE AG296.08.880407e+08
101AMSTERDAM TRADE BANK NVSTANDARD CHARTERED PLCRAIFFEISEN BANK INTERNATIONAL AG190.04.026005e+08
701CIMB BANK BERHADTHE BANK OF NEW YORK MELLON CORPBARCLAYS BANK PLC186.01.731369e+08

We can employ several analytical and visualisation techniques to more easily inpsect and unpick the transaction chains. This includes flow and Sankey diagrams, and network analyses.

1
2
#quantile = transaction_chains.groupby(cols)['amount_transactions'].sum().quantile(q=0.9)
filtered = transaction_chains.sort_values(by='number_transactions',ascending=False)[:50]

Here is A funciton to convert any edge list to the Plotly Sankey diagram format. I have created a gist here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def df_2_sankey(df,cols,values,hover_value=None):
    """ Helper function to convert a dataframe of relationships to 
        Plotly Sankey format.
    """
    colors = ['#D0EDA6','#EDEF7B','#EF7B84','#00308f','#eedc82','#c66']
    out = df.copy()
    entities = dict()
    vals_ = [values,hover_value]
    try: 
        vals_.remove(None)
    except:
        pass
    
    # get dict of uid for each entity in each level from 0...n
    for n,c in enumerate(cols):
        if n == 0:
            out[f'{c}_'] = pd.factorize(out[c])[0]
        else:
            out[f'{c}_'] = pd.factorize(out[c])[0] + (out[f'{cols[n-1]}_'].max()+1)
        entities[n] = dict(out[[cols[n]+'_',cols[n]]].values)

    # create df edge list between source,targets
    edge_list = list()
    for i in range(0, len(cols), 1):
        slice_ = cols[i:i+2]
        if len(slice_)==2:
            el_ = out.groupby([f'{c}_' for c in slice_],as_index=False)[vals_].sum()
            el_.columns = ['source','target']+vals_
            edge_list.append(el_)
    edge_list = pd.concat(edge_list,ignore_index=True)

    # Sankey node definitions are simply the ordered uid/names of each entity 
    labels = [v[i] for k,v in entities.items() for i in v] # use v[i] for name, i for uid
    node_colors = [colors[k] for k in entities.keys() for i in entities[k]]

    # Sankey edge definitions
    source = list(edge_list['source'])
    target = list(edge_list['target'])
    values = list(edge_list[values])
    if hover_value is None:
        hover_values = list()
    else:
        hover_values = list(edge_list[hover_value])

    if len(source) == len(target) == len(values)==len(hover_values):
        return edge_list,entities,labels,node_colors,source,target,values,hover_values
    else:
        raise Exception('Output test fail: lists are of unequal lengths')
1
edge_list,entities,labels,colors,source,target,values,hover_values = df_2_sankey(filtered,cols,values='number_transactions',hover_value='amount_transactions')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=False)

fig = go.Figure(data=[go.Sankey(
    valueformat = ".0f",
    node = dict(
      pad = 5,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = labels,
      color = colors,
      customdata=hover_values,
      hovertemplate='Source: %{label}<br />'+
                    'Amount: %{value}<br />'+
                    'Transactions: %{customdata}<extra></extra>'
    ),
    link = dict(
      source = source, # indices correspond to labels, eg A1, A2, A2, B1, ...
      target = target,
      value = values,
      customdata=hover_values,
      hovertemplate='Source: %{source.label}<br />'+
                    'Target: %{target.label}<br />'+
                    'Amount: %{value}<br />'+
                    'Transactions: %{customdata}<extra></extra>', 
  ))])

fig.update_layout(autosize=True,width=800,height=800,title_text="Origin > Filer > Beneficiary", font_size=10)
fig.show()
# iplot(fig,show_link=False)