Alastair Firrell, Data Scientist, Advanced Analytics, Bank of England
Maciej Piechocki, Financial Services Partner, BearingPoint
Leif Anders Thorsrud, Senior Researcher, Monetary Policy Research, Norges Bank
Bruno Tissot, Head of Statistics and Research Support, Bank for International Settlements, and Secretary, Irving Fisher Committee on Central Bank Statistics
Central Banking convened a panel of experts to discuss how central banks can harness big data for their needs, hopefully without falling foul of some of the many pitfalls that await.
Big data has emerged as a hot topic in central banking circles over the past few years, with many official institutions reviewing their data policies while increasing their staffing and systems capabilities in a bid to unlock big data’s potential. Public attention has often focused on the scraping, cleaning and analysis of unstructured publicly available data, which offers the potential for macroeconomic nowcasting, for example, or securing real-time communications feedback.
Central banks also hold large amounts of structured, sometimes confidential, data that they must categorise, hold securely and process in an appropriate manner – sometimes using big data techniques. Then there are the multi-faceted challenges associated with making use of all these different combinations of data, including textual data. In this forum, our panel discusses the results of the Big data in central banks 2016 survey, infrastructure support for large datasets, the effectiveness of central banks’ efforts, the real impact of machine learning, changing regulatory views, support from the executive level, and staffing and systems.
Central Banking: According to last year’s survey on big data in central banks, most central banks think of big data as large, unstructured, external data – is this an accurate perception?
Leif Anders Thorsrud, Norges Bank: Big data is something of a buzz-word. Data is as useful as the questions you ask of it – big data is like any other type of data in that respect. Maybe the reason this came up was because most central bankers are more familiar with structured data, big or small. Unstructured data is more exotic, so that’s why people tend to think big data is something new – and therefore must be unstructured. Clearly though, big data can be both structured or unstructured.
Alastair Firrell, Bank of England: In a lot of cases, big data is just data that your current systems don’t cope with very well. It’s bigger, more varied or stranger than the data that you already deal with. Central banks have traditionally been good at middle-sized – particularly aggregated – numeric data, so we are seeing a lot of interest in textual data and data that is unstructured, semi-structured or variably structured – so it’s not white noise, it’s not that unstructured and we’re seeing some properly large datasets. Although in many cases what we would call large is not the same as a lot of the data community would.
Maciej Piechocki, BearingPoint: With big data you have different aspects, and there is relevance to how central banks deal with the data in general. When you look into the responses to the survey, they clearly show that, although it is unstructured data as far as the research is concerned, it could be structured and voluminous for other purposes – such as the credit register. I think there is a question about what the data is used for, and not so much the size or the structured versus unstructured demarcation.
Bruno Tissot, Bank for International Settlements (BIS): There are two camps. Firstly, there are those who say big data is primarily the type of unstructured data the private sector is dealing with. According to a recent BIS review, central banks are clearly interested too, for example, in looking at internet searches for nowcasting. A second area that is really key for central banks is in dealing with very large administrative and financial datasets. It is not simply because it is large that makes it big data, but because it is large and complex. In addition, you sometimes need big data techniques to facilitate/improve the analysis of relatively simple structured datasets.
Central Banking: There has been a lot of use of big data in research at central banks, but has there also been an important role for actual policymaking?
Leif Anders Thorsrud: In terms of big data, my impression is that most of it has been going on in research departments in central banks, but I think we are now seeing more research being taken into policy analysis. Norges Bank has two different big data pillars – one more focused on structured, granular data types, and one where we use unstructured data types. Both motivated by policy questions, and with the goal of making research policy relevant.
Maciej Piechocki: That is an interesting topic from the timing perspective because for policy proposals there are often topics that start with the research department – and we should not forget that central banks can mandate certain data collections. Let’s say you find it interesting to look at certain granular data on the housing market, what follows are things like the European credit register, for example, so central banks are realising they need it on a more regular basis.
Bruno Tissot: Regarding the private sector type of big data, it is fair to say that, on average, central banks are just starting to explore the internet of things, it’s not really in production on a large scale. But then there is the big part – dealing with administrative, commercial and financial micro-level datasets. What has really increased since the financial crisis is the need to manage this large and expanding amount of information and go beyond the aggregates, by making use of available micro datasets. This is a key factor driving central banks’ interest in big data techniques.
Central Banking: What are some of the uses of big data – filling in for a void in statistics or nowcasting requirements, for example? What progress is being made there, and are there concrete examples of where that really adds value?
Leif Anders Thorsrud: There are several projects where we have used unstructured textual data from newspapers, and crunched that using machine-learning techniques to put it into a nowcasting framework. We have found really good results when we do that. I’ve been working a lot with standard time‑series models, model combinations and forecast combination techniques – what we typically see is that we have different models, they work reasonably well, but it’s the data you put into it that determines how well you’ll do in terms of predicting the present.
Central Banking: Are there any areas where big data isn’t living up to its full potential?
Alastair Firrell: There are certainly areas in which it is very challenging to get value. We’ve utilised Twitter in a few scenarios, looking at specific events or for particular quantities of information or trends, and there’s a lot of noise in there. If you’re happy that you’re looking for the presence of a particular firm, then maybe that’s fine as they will always be tagged in the same way. But trying to get sentiment out of something like that can be wobbly.
That can even just be down to a lack of context. For example, the reuse of Mervyn King – there’s Mervyn King, the governor of the Bank of England; Mervyn King, the bowls player; Mervyn King, the darts player, and so on – you get it at the wrong time and suddenly there’s a spike, but people are talking about a different Mervyn King. These are not necessarily big data problems, just data problems – but if you are receiving a constant stream and don’t have a big context around them and someone sitting there cleaning them, this can create problems quite quickly.
We have had a lot of good results from news and our own internal textual data. For example, we have regional agents who talk to firms and write up their interviews, using that information to look over a lengthy time period for indicators of different discussions and hot topics. There is an awful lot of value to be had, but as with all data analysis you have to know the context – whether what it is telling you is really meaningful.
Bruno Tissot: It seems to me that we are just at the beginning of making sense of the increasing volume and variety of data we can access. Micro-level datasets can be very complex. Sometimes you are merging information from different and inconsistent sources; so you have to choose among them, and this choice may depend on circumstances. You may also want to aggregate granular information in a way that can also evolve over time.
A good example is what we do at the BIS to compute international debt issuance statistics. We aggregate micro-level information based on the residency concept, to compute all the debt issued by the economic agents that are located in a given country (in line with System of National Accounts principles). But we can also aggregate the same statistics on a so-called “nationality basis”, by looking at all the debt issued not only by a resident firm of the country but also by the foreign entities controlled by this national firm – these affiliates are located outside of the country and are therefore not captured by a residency-based framework. Constructing such nationality-based statistics can be quite challenging: one has to identify the perimeter of global firms, reclassify their individual units and consolidate granular information at the group level.
Central Banking: A lot is going on that is not this internet unstructured data. There is a lot of work with BIS involved too, so in terms of regulators holding large pools of data, are they likely to remain granular with regard to data analysis?
Bruno Tissot: We are facing a revolution in financial statistics that perhaps in the future people will compare to what happened in the 1930s for the real economy. At that time, the Great Depression influenced the development of the national accounts framework. Similarly, the recent financial crisis has triggered unprecedented efforts to collect more information on the financial sector – especially in the context of the Data Gaps Initiative endorsed by the Group of 20. Large micro datasets have become in high demand in this context. For instance, you cannot just look at a group of various financial institutions in an aggregated way, you must also look at those that are systemic on an individual basis. Or you need to have a sense of the distribution of macro aggregates and look at “fat tails”, and so on.
Central Banking: How are you using big data techniques with regard to textual data – what projects do you have running?
Leif Anders Thorsrud: Norges Bank has an unstructured big data project, mostly using business newspaper data for nowcasting applications, but also for more structural oriented analysis. We are working on US data to do the same there, constructing business cycle indicators, pricing returns on the stock exchange and using basically the same type of raw data throughout all of these applications – we get really consistent results. In terms of techniques, we use a blend of the traditional toolkit and things from the machine-learning literature – it is about dynamic verbal selection techniques and clustering algorithms.
Central Banking: How can individual-level payments systems data be used to answer practical questions at a central bank?
Alastair Firrell: There is a range, so some questions that can be answered are primarily operational: How is your payment system doing? Which scenarios lead to system blockages? Is there a way to change the way banks and institutions interact with your payments systems to free that up, or to inform your IT personnel going forward?
There is information within the payments that is not just numeric bank‑to‑bank payment values, but whole streams from where the payment comes and where it goes to, and you have at least a certain amount of information about the kind of institutions or people at each end of those. From this you can start looking at the impact of decisions – changes in sectoral trading, for example, and geographic dispersion. So there’s a reasonable amount to be done – as well as looking at things such as anti-money laundering and financial crime, there is the imperative for anyone working on payment systems to be checking what’s going on.
Bruno Tissot: Apart from the traditional role played by central banks in payments systems, there are also new uses of these data for economic analysis. In Portugal, where tourism is important, payments systems data has been used to assess its impact on the economy. But the use of specific datasets will often depend on the policy question you have in mind, so not all countries will have the same practices.
Central Banking: In last year’s survey, many central bankers cited a lack of support from policymakers as the most significant challenge to increasing the use of datasets in their institution. What needs to change for big data analysis to gain more support?
Maciej Piechocki: With topics on data – especially big data – there are two issues. One is that regulators and central banks are getting more data – there’s a lot coming up, and you need to install the governance that can handle it properly. The second issue is accountability – the more data you have, the more accountability you need to figure out the information and get valuable insights.
Bruno Tissot: Our recent Irving Fisher Committee work shows that there is strong policy support for exploring big data within central banks. But this involves a lot of costs, resources and time. More importantly, a holistic approach is needed to ensure a continuum between the IT organisation, the collection of data, the statistical processing and policy use. Ensuring a vast amount of information is not just collected and prepared, that it is useful, really is key.
Alastair Firrell: There is the desire from senior personnel to use the data – seeing there is benefit whether it is the larger or the more varied datasets, to complement or completely take over the traditional sets. If you don’t see the value, you’re never going to sponsor it. Then there is the desire to actually manage it properly – to put the structures in place, instead of just saying “that’s great, let’s use that”, hoping they magically appear.
Leif Anders Thorsrud: It is important to clearly separate what is in production and what is in the development phase. There are different requirements in these environments – in the development phase the researcher or the analyst working on the data may be more in charge of the data, but if something useful comes out of it and it goes into the production phase then issues of ownership and data governance become more important.
Maciej Piechocki: The survey is clear about executive sponsorship, and I think this is regardless of whether it’s a research type of project or a regular production project. I agree on the kind of operational governance of these topics, but the executive sponsorship – if it is the chief data officer or head of statistics – is less relevant but helps to get these topics moving forward.
Leif Anders Thorsrud: You need some support, but in terms of how many bells and whistles you put onto the governance of a short-term or exploratory project, I think it is important to have a fair degree of flexibility. If not, it will be years before you are actually able to do something with that data.
Central Banking: Eighty-five per cent of survey respondents said there was no single allocated budget for data, and there is also the issue of whether to have a chief data officer – someone who flies the flag for data at an institution. Do you think the fact there is no single budget is part of the problem?
Bruno Tissot: It is key to extract information from the data collected. To this end you need an IT infrastructure, adequate statistical applications, sometimes legal and HR support – there is a full production chain to get from data to information. This requires good co-ordination. But whether central banks should have a specific way of organising themselves – set up a data framework or a “data lake”, appoint a “chief data officer” – depends on circumstances. And it is not the key issue; what matters is not as much the organisational structure than the coherence of the information management process to transform “data” into (useful) “information”.
Central Banking: A lot of key developments in big data have come from IT personnel – not necessarily from the front‑end parties that have been calling for use of it. They are saying that some traditional economists’ mindsets may be obsolete. Have you encountered this?
Alastair Firrell: Because there is an awful lot of blather around what big data is and the value of machine learning, many don’t see concrete examples of what can be done. These ideas buzz around but don’t necessarily make it to where they need to. I think it is incumbent on technicians and data people to try things out and ask the economists, statisticians and policy people the right questions: if we did something a bit like this, does that spark any ideas?; if I show you clustering, does that make you think?; if I show you anomaly detection, does that capture anything?; or if I show you topic modelling out of text, does that grab you in any way?
Otherwise, people will say: “These policymakers never ask us for anything, just the same old stats”, and the policymakers will say: “I don’t really know what data we’ve got, and I don’t know how we can use it, so I’m just going to ask for the same old stats”. With this dialogue, is not always easy to marry the burning question with the person who knows how they might be able to answer it.
Maciej Piechocki: It’s also a generation question. Some skilled IT graduates are coming from universities where they have access to all types of open‑source data. They are coming to a central bank and are used to working with these tools, and they have innovative ideas to take top-notch technologies and to popularise them within central banks.
Central Banking: How useful has machine learning been? Leif has already described it as a new buzzword, but what real impact is it making?
Leif Anders Thorsrud: Machine learning comes with a new set of tools, and it is always better to have more tools in the toolkit. I think it is useful – we are applying it in most of our work these days alongside more standard tools.
Alastair Firrell: We use anomaly detection in trading patterns and the like, and this would be a very traditional machine-learning thing. We’re using it in a more fuzzy way for more fuzzy tasks around extracting, for example. Extracting information from job advertisements to find out truly what the job is, for example – which is harder than you would think. It’s about a toolbox full of tools, and at different times you’ll use different ones.
Bruno Tissot: These techniques can address things that we saw as important during the crisis but are difficult to model – non-linearities and network analysis, for example. But the choice of a given technique will depend on the questions you face. It’s very good to have new, sophisticated tools available, but the risk is to develop black boxes, which cannot deliver meaningful messages. It is essential to explore these techniques, work on specific projects and, perhaps more importantly, to define exactly the question you want to answer. This exploratory work can be shared within and across institutions. Central banks want to see precisely what other central banks and authorities are doing in terms of big data projects.
Central Banking: Big data is not always representative data –how should central banks ensure they have adequate accuracy, confidentiality, responsiveness and representativeness?
Bruno Tissot: This is an important issue that is sometimes overlooked. People tend to think that, because it’s a very large dataset, by definition it is a reliable source of information. But you cannot really judge the accuracy of a dataset if you don’t know its coverage bias, which can be significant. Even extremely large big data samples can compare unfavourably with (smaller) traditional probabilistic samples – that are, in contrast, designed to be representative of the population of interest. We should be mindful of these limitations: the risk is to have misguided policy decisions if they are based on inaccurate data.
Central Banking: What are the challenges people have to deal with when coping with unstructured open‑source, textual and confidential data?
Maciej Piechocki: I think that proper data management and data strategies are key here; for many central banks the challenges lie in matching different datasets – matching findings from research on unstructured data with semi-automated analyses that you can run on the granular datasets. The matching of data is still at the beginning, but we are starting to see a lot of streams on big data in the research area and a lot of dynamics in terms of the granular datasets going beyond aggregate. What I have not seen much is this being brought together.
Central Banking: What is the optimal way of storing data? Most central banks still use their own data platforms rather than commercial platforms, so what in-house solutions do they have and why are they superior?
Alastair Firrell: Central banks face a big security challenge regarding their ability to use cloud infrastructure. Various other organisations might be able to farm a greater proportion of their data out to cloud infrastructure – at least for storage, but ideally for processing, analytics and all sorts of other things.
Maciej Piechocki: You should never answer a question about storing data before you know what you are going to do with it – but interestingly for many central banks the question of storage comes first. If you want to run big data techniques there will be different storage, if you want to do simple querying on very structured data then a structured query language database will do well.
Central Banking: We’ve seen databases experiencing problems recently and there are concerns around whether there are techniques that could detect issues such as these in the future. Are central banks on top of potential breaches?
Bruno Tissot: It is at the top of the agenda – because of the importance of the information the central bank has access to, you may have privacy issues, you may have a legal issue if there is a leak of confidential information, and you may face financial consequences. Key is perhaps reputation risk. If authorities collect confidential private information and this information is not protected adequately, it can be very damaging for their reputation and credibility.
Maciej Piechocki: It’s a critical piece of the infrastructure, especially on the collection side. The US Securities and Exchange Commission uses a collection portal that collects data from listed and regulated identities, which is what every central bank is doing as well – and there is lots of sensitive data. If there is also personal data that will come into the game that is extremely exposed to cyber risk.
Alastair Firrell: The techniques and the ability to apply them are there, for central banks to spot a lot of these issues. Security is paramount, even if it constrains our use of commercial platforms for bigger data analysis.
This is a summary of the forum that was convened by Central Banking and moderated by Central Banking’s editor, Christopher Jeffery. The commentary and responses to this forum are personal and do not necessarily reflect the views and opinions of the panellists’ respective organisations.