David Bholat, Senior Analyst, Advanced Analytics Division,
Bank of England
Maciej Piechocki, Financial Services Partner, BearingPoint
Iman van Lelyveld, Statistics Division, The Netherlands Bank and VU Amsterdam
Many observers believe big data has the potential to open up new possibilities for monetary policy-making, financial supervision, and economic research. Nowcasting, text mining, machine learning and other new techniques have been made possible by improvements in processing technology and new, larger, more granular datasets. Nevertheless, questions remain over how central banks make best use of the new methods.
This forum draws on the viewpoints of three experts who discuss financial stability and supervisory applications, direct uses in economics and modelling, who should ‘own’ big data, e-sourcing and budgets, future developments, as well as the operational challenges of gathering, structuring, storing and processing data. The online forum took place live on September 28, and is also available as a video.
Central Banking: What does big data mean to you?
David Bholat, Bank of England: Defining big data, you could do worse than follow the standard schema of the ‘three Vs’: volume, velocity and variety. For a central bank like the Bank of England (BoE), regarding data with more volume, it’s about complementing our traditional focus on macroeconomic time series and data extracted from financial statements with having more granular financial transaction data.
Data that has greater velocity, such as Twitter data, complements the data that flows into the central bank traditionally on a slower cycle. Finally, for variety we’re talking not only about structured numerical data, but also unstructured data, like text.
Iman van Lelyveld, The Netherlands Bank and VU Amsterdam: I would add that the data is large but that means it’s approaching the population, which is different from the traditional statistics where you would sample part of the population. So you have to look at different analytical tools.
Maciej Piechocki, BearingPoint: What was mentioned about the three different aspects of big data is really interesting. I believe there are also different maturities across central banks – big data is within the industry and serving not only the central banks but also the wider financial and non-financial services industries.
Central Banking: Some within central banks, IT departments especially, view big data as “just another fad”. Do you view big data as a fad?
Iman van Lelyveld: No, it is here to stay. What might be faddish is the claim that it gives you the tools to analyse everything – in the sense that you now know everything, therefore you can analyse everything. That’s pushing it because what you can’t see are things that people want to reveal. Big data, as in totally unstructured and scraped from somewhere, has a limit.
Maciej Piechocki: I have a different view because, for many IT staff at central banks, it’s actually a great topic. There are several completely new technologies that some of the more innovative IT personnel at central banks love to work with.
I see a danger there as well because big data shouldn’t be treated solely from an IT perspective – it’s extremely important to analyse it from the use-case perspective. Big data has to be user-driven and not just what we were getting before – a huge warehousing project where data was merely stored but not really used.
David Bholat: At one level it is a fad for some of the nomenclatures, while at another level, it is not – the tools are here to stay. What’s quite hot now is ‘data-driven analysis’, though I’m not entirely sure how that’s different from what we used to call ‘evidence-based policy-making’.
Central Banking: Looking at some of the main applications, there’s a lot of activity going on around financial stability and supervision. What do you see as the main applications for big data?
Maciej Piechocki: As far as supervisory statistical stability applications are concerned, the big drivers are for derivatives data; so if you look at Dodd–Frank and European Market Infrastructure Regulation (Emir) requirements, they are bringing great granularity and the possibility of definitely going beyond the aggregates. Also, for continental Europe, there is a commotion about loan-level data. The loan-by-loan view of the borrowers and, on the whole, markets and flows are there, which is interesting if you start combining this with the aggregates the central banks are holding. It allows much more flexibility in finding out where the risks are really sitting without the industry actually incurring the burden.
Iman van Lelyveld: The key is the ability to link up data sets. For large data sets, this starts in payment systems where this kind of data is readily available, and you can match up the information from the payment systems and interbank lending with supervisory data, for instance.
Central Banking: What about the direct usage for big data when it comes to the economic side, the modelling, and
David Bholat: Big data is not only applicable in the financial stability domain, but also when performing real-economy analysis. For example, in the UK labour market, we have been pleasantly surprised that, coming out of the financial crisis, the employment numbers have been quite strong. At the same time, there are some worries about automation increasing unemployment, and we are using vacancies data from websites to understand this.
Maciej Piechocki: Returning to the issue of volume and velocity, that’s a good use case for the supervision of financial stability, where central banks have a good grip on structured data. Unstructured data is coming along extremely well. Having very different data sources, you can pull the most from the unstructured big data and support the structured data sets.
Central Banking: How are some of the governance rules developing with regard to big data? How are people managing all this data in terms of governance?
Maciej Piechocki: For many central banks, it is a matter of staffing and skill sets, and there is a huge demand for data scientists, for example, everywhere in the industry. Another important topic is governance within central banks. There are a few central banks that have a designated chief data officer (CDO), while others are looking at the statistical function to fill this role, so it’s a question of who has ownership of big data within central banks.
Iman van Lelyveld: It’s very important to keep the ownership of the data within the business. As soon as it becomes a statistics or – even worse – an IT thing, there is a distance between the users and the definition of the data, solely between the way it’s stored and the way it’s accessed.
David Bholat: Many outside observers of central banks focus mostly on policy decisions, our forecasts, and so on, and don’t actually realise that, at very senior levels, data is discussed. It’s taken very seriously, and increasingly so.
Central Banking: Some central banks have set up data officers, some focus on the tech side, while others are looking at it from the statistics side. Is there a correct approach or a desirable approach people can take?
Maciej Piechocki: It doesn’t really matter as long as it’s a user-driven function. The danger with having too much tech is that it will become a purpose in itself, which it shouldn’t be. It doesn’t really matter if there is board-level ownership of the big data, which does happen at some central banks. If it is driven by statistics, stability or supervision depends on the central banking governance, with strong support from the technology departments.
Iman van Lelyveld: It depends to some degree on which part of the data, and whether it is really something operational. For instance, we look at how banknotes are checked, a very process-driven, huge-volume data set. It’s easy because nobody else needs this data, whereas for other data sets there might be multiple uses – and then it is much more important to perhaps have a management layer that co-ordinates these requests for the same data from different parts of the banks.
David Bholat: In terms of data, the concept of ‘ownership’ can be quite unhelpful
because it leads to silo thinking. The real push, at least inside the BoE, is to be very ‘one bank’ and cross-organisational – we’re trying to set up a kind of ecosystem. The BoE owns the data. We have three divisions spearheading different efforts: our Statistics and Regulatory Data Division, which collects much of the regulatory report data; the Chief Data Officer Division, which puts in place the IT infrastructure to facilitate sharing; and Advanced Analytics, into which we’re bringing some data science machine-learning techniques.
Iman van Lelyveld: This is an important point – try to make everyone the owner of this data and make this kind of data as widely available as possible because that’s the only way to unlock the potential of combining different pieces of information in ways you haven’t thought about before. If you are in a silo organisation, it’s the death of data since you’re not using it. And, if you don’t use it, it will become polluted, you can’t use it any more and it will die.
David Bholat: A key initiative that our CDO division has been spearheading is the creation of a data inventory, a central register of all of the data sets – whether proprietary or purchased or open-source public – that are actually in the building. Previously, there would be someone working on financial stability issues who might have access to an interesting data set that their colleague – for example, an economist working on monetary analysis – didn’t even know existed in the building. If you have a central register, you can share more effectively.
Central Banking: Are there any special considerations regarding some of the confidential data? In certain functions, there’s information that is given for those purposes that obviously has great value in other parts of the central bank but, perhaps for confidentiality or even legal reasons, cannot be shared or needs to be shared in a way that doesn’t pinpoint where it’s coming from.
Maciej Piechocki: Certainly there are consequences, especially for central banks with monetary as well as supervisory functions. There have been conversations about how governance should be organised around data sharing between these functions to ensure they are sufficiently separated.
Also, if you look at the large volumes of data, they’re often related to more granular contract data, down to the personal-level data. There has been much discussion around loan-by-loan collections, and the extent to which central banks are allowed to reach this level of information.
Iman van Lelyveld: These are important issues, but most of this personal data is already shared somewhere, and we need to sort them out in a proper way. What you would like to do from an analytic point of view, for loan-level data or mortgages, is also to have some information on the tax situation.
Central Banking: Now that central banks have access to all this data from trading and clearing platforms, are all the reporting requirements on major market participants necessary? They are costly and could put large firms at a disadvantage
Iman van Lelyveld: If you look at most of this data, specifically the trade repository data, it’s not up to scratch. We cannot replace other surveys or reports with this data just yet. Naturally, if that was possible, I’d be all for it. So we need to flesh out the areas where other reports have the same bits of information and see if we can replace that with trade repository data.
Maciej Piechocki: The first central bank to trial this approach is in Austria. It’s an interesting example, where instead of collecting the aggregated data, which is quite a costly burden, they’re trying to obtain the contract-level data, and are transforming the whole regulatory value chain. As for logistics becoming much more granular, I can well imagine in the future that plugging into the trading systems even earlier could lead to a situation in which you could remove the whole value chain and the classical current one by pulling the information that a central bank needs.
Central Banking: People have been working on it for some time, so why are the trade repositories not up to scratch?
Iman van Lelyveld: In any data set that you build up, you need some time to get the quality up. With a trade repository it’s such a huge set and the governance is so dispersed that it’s difficult to get a feedback loop between the users of the data and the people submitting the data. As you’re trying to get a very wide coverage, relatively small non-financial firms need to report, but they’re not matched up to the regular reporting frameworks.
Central Banking: If there are so many challenges in getting the structured data right, how can one hope to get all the unstructured data?
David Bholat: In a sense, the structured/unstructured divide is artificial. This is what makes the Austrian central bank approach quite innovative – when you go back to what a financial instrument or a product is, it’s a contract, something unstructured. Then, as it proceeds through different parts of an individual firm and that data is disseminated to regulators and the regulators publish the aggregate data, it becomes structured. We want to get more granular because then that reduces regulatory reporting costs in the long term.
Central Banking: Databases are relatively easy for IT departments to centralise, but what about derived data series and data produced and modified by analysts on the business side? Do good enough tools exist to achieve this without the involvement of IT departments?
Maciej Piechocki: This is a very important topic because, if you are aiming at flexible working with data, you are automatically switching to self-service-oriented aspects of the business that can dive into data. We are beginning to see the flexibility of systems that can handle this, but the aim is also to provide self-service for the business departments, not only to construct new analyses, but to provide aggregated reports with the possibility of drill-down and to broadcast them to other departments across the organisational boundaries.
Iman van Lelyveld: It puts the burden on the analyst to create an analysis suite in a more structured way. It will do more that just load things onto an Excel spreadsheet, build a series of graphs and say: ‘here’s my analysis’. Over time, we will move to coding it up in some way and having a paper trail of analysis. Then, with some sort of centralised IT function that keeps the source data – a ‘golden copy’ – you can trace back your analysis.
David Bholat: Spreadsheets have traditionally been a workhorse, but with the problem that when changes are made you don’t know who made them, at what time and in what direction. What is valuable about some of these coding technologies is that you can actually trace how that data was brought into the data frame.
The principle is to have one golden copy of data, not multiple versions of the data in different spreadsheets. With a golden source of data you have a coding interface on top that is traceable and auditable in terms of what kind of analysis was done.
Maciej Piechocki: A single data dictionary is a great idea, with all terms defined across the central bank. But this goes against the object of velocity because the time taken to create this dictionary, or to enter something new into it, is time that you are losing from working with the data.
Iman van Lelyveld: I disagree, I don’t think it is time lost. For instance, what is an entity? Are we looking at a consolidated entity or a sub-consolidated entity? You need a legal entity identifier, some sort of unique identifier of the bits you need to put together. Generally, this has not been given much thought, so people make their own constellation of a series of legal entity identifiers and then, two months later, you have to do the same analysis.
Central Banking: What are some of the security issues related to big data, whether it’s cyber security, physical security, confidentiality when using external providers, and so on?
David Bholat: When we went through our strategic review and Mark Carney took over as governor, one of the outputs was to set up an Information Security Division. They use some machine-learning approaches, and information security has ratcheted up the agenda because I’m constantly doing compliance exercises.
Iman van Lelyveld: I’m in the same position. Security is being ratcheted up, and I certainly can’t share confidential data outside of our systems, so any kind of innovation needs to be a local installation; I cannot go through the internet, for example.
Maciej Piechocki: As a solutions service provider to central banks, over the last decade I have found security has improved vastly, and the requirements of providers in terms of certifications and security screening are definitely growing.
Central Banking: How can central banks attract the correct sort of talent to work in this area – a combination of economists, statisticians and computer scientists? They’re still not that common, so how do central banks recruit the people they need? They’re competing against some of the technology pioneers, so it can’t be easy.
David Bholat: Where we can compete, I think, is if you’re intellectually interested in working on really tough problems. Often, what we find when we’re recruiting tends to be people who are just coming out of university, who are interested in trying to address these big-picture problems and also have a sense of service. We exist at the BoE to serve the people of the UK, to promote their common good and that’s the kind of service ethos that attracts a lot of people to us.
Iman van Lelyveld: It’s the same pitch we make, but we also offer something regarding the balance of work. We have interesting data and questions, but also space to actually investigate. There are important things to come up with solutions for, which for inquisitive people straight from university is fantastic.
Maciej Piechocki: That’s interesting because we are aiming for exactly the same profiles of the market and industry. I’ve seen approaches from several central banks to sponsor universities or high schools. Also, economics is losing some competitive edge and we are aiming more at mathematicians, those with majors in physics and natural sciences, because many already have a good grip on data.
Iman van Lelyveld: It’s important to give them time to learn the language. I’ve been working with theoretical physicists, and they know a lot more about the structure in data, etc., but we need to learn to talk to each other. For instance, having algorithmic learning without actually knowing how the rules work doesn’t generally find anything. We need a combination of the two, at least in the same analysis team.
Central Banking: There is the whole aspect of causal reasons for correlations – if an online retailer sees those correlations between selling two different items, they’ll put it up and ask you if you want to buy it, whereas a central bank taking policy actions based on data correlations isn’t necessarily a great communication strategy, is it?
David Bholat: At central banks we have to tell stories. Correlation is never enough and, even if you get a really good correlation, that doesn’t of course mean causation.
Iman van Lelyveld: In terms of supervisory policies, we can’t go to a bank and say, “you have to build down your portfolio because there’s some correlation with the sale of a certain product”. You need a causal story of how that product affects the exposure on your derivatives portfolio. So that’s a challenge because, if there’s a high correlation, it might lead to further investigation in that area, but certainly not to direct policy actions.
Central Banking: If the data is perfect, to what degree can central banks depart from traditional economic methods and use or combine them with new machine-learning methods?
David Bholat: Assuming there are no data quality issues, then the value of using a machine-learning approach is if you think the pattern in the data is non-linear in form.
Central Banking: What could be the big breakthrough in big data in coming years?
Iman van Lelyveld: I’m very excited about the Emir data because we’re looking to make big steps. But at the bank we’re also looking at other big data sets that are going to be helpful, one being older chamber of commerce data that tells us a lot about individual non-financial firms.
David Bholat: Over the next 12 to 18 months, we want to add the most value through the supervisory arm of the BoE, the Prudential Regulation Authority. There has been a lot of central bank research produced, but not much on the micro-prudential function. We’re text mining the letters that are sent from our supervisors to firms they regulate to understand whether we’re being consistent or systematically biased in our communication. We’re also going to work with all these new, large, granular data sets, as well as changing how supervisors go about their jobs by building more interactive and visually appealing dashboards.
Maciej Piechocki: I share the view that for the next 12 to 18 months the central banks will be exposed to Emir derivatives data and money markets statistics; the Markets in Financial Instruments Directive II is also coming with another large set of very granular data sets. Also repo data; and credit registers are being revamped, not only in Europe, but also outside Europe. There has been an exposure to large, granular data volumes that are then driving different analytical approaches and statistical applications, which are transforming the data functions in central banks.