‘Data Lakes’ are gaining explosive traction with payments businesses in Europe and North America. Proponents say they offer a more flexible, scaleable and cheaper data storage solution than traditional data warehousing, alongside the improved analytics capacity the payments industry craves. However, these same advantages may contribute to making them more risky than traditional data solutions. PCM Editor James Wood dives in…
Companies across the payments value chain are waking up to the huge potential of big data. In March 2019, FICO and Equifax announced a partnership to sell pools of anonymised data to banks and other financial services companies, along with the software tools to help analyse that data. Alongside these “data supermarket” offerings, major players like IBM, Dell, Amazon, Google and Microsoft have all announced the introduction of “data lake” solutions as an alternative to traditional means of storing data such as warehousing on physical servers or the cloud.
Data lakes are centralised repositories that allow storage of both structured and unstructured data. Data can be stored in the form it is received without the requirement for structuring and sorting which traditional data warehousing solutions need. Various analytics solutions can then be applied using the latest processing and analytics packages. A recent estimate from Market Research Futures suggests that the market for data lakes is set to grow by more than 20 percent per annum over the next five years, reaching $14 billion by 2023.
Proponents of data lakes argue that they are faster and cheaper to set up than traditional data warehousing solutions, and more flexible, since data is stored in its “raw” format, rather than being sorted and formatted prior to storage. A 2017 study by the Boston Consulting Group estimated that between 60 and 75 percent of data storage costs were incurred in the acquisition, sorting and filtering of data prior to storage. Most intriguingly, the fact that data lakes feature unprocessed information makes it possible for companies to retro-fit this data to any future analytics packages. Compare this to previous practice, where data would be lost forever if it didn’t fit a particular warehouse structure, and the payments sector’s interest in data lakes becomes clear.
There’s little doubt that massive growth in analytical power and the capacity to capture customer data is driving this industry’s interest in data lakes. World-wide data storage capacity is set to increase to more than 400 exabytes by 2020 – that’s 400 billion gigabytes. In a recent survey, only 20 percent of financial services firms said that their data storage requirement would be less than one terabyte this year, with 30 percent predicting a data storage requirement of more than ten petabytes, or ten million gigabytes.
Numbers like these help to explain why data storage is set to be one of the world’s fastest-growing segments in the next five years according to Forbes magazine. For the payments industry, specific applications of data lakes include the integration of data from different payment platform functions – credit, debit, wire transfers, mobile and other payment methods – into one data repository. Data lakes are also being used to help identify fraud patterns, predict customer and merchant behaviours, and in marketing and product development.
McKinsey and Company have been firm advocates of data lakes in financial services, noting in a 2017 paper that, “There’s a lot [for financial services companies] to like about data lakes – companies can use affordable, easy-to-obtain hardware, and data sets do not need to be indexed and prepped at induction.” In addition to being easy to use, McKinsey argue that the ever-decreasing cost of data storage (down from $10 per gigabyte in 2000 to three cents per gigabyte in 2015) renders the option of data lakes even more attractive. Finally, McKinsey believe the advent of Open Banking, which will make it possible to share huge customer data sets between acquiring and issuing banks, merchants and service providers via open APIs, make data lakes seriously interesting.
Allen Pettis, EVP and Chief Customer Officer at TSYS, a provider of outsourced payments solutions, says his company runs data lakes on behalf of its clients. At present these lakes are operated within the company’s physical data warehousing facilities, rather than in the cloud. But TSYS does see the cloud as an integral part of the future for advanced analytics and insights. For Pettis, data lakes allow TSYS and its clients “to better understand everything that’s going on around customer interactions. We’re developing machine learning techniques that help our clients to identify customer insights from their data lakes. There’s no doubt that lakes — utilizing advanced analytics and data science — can offer richer insights to our clients than traditional data warehousing.”
However, Pettis cautions that data lakes are not necessarily cheaper than traditional data warehousing approaches, noting that the cost will depend on how much data is being stored, and the storage method (cloud or physical storage), as well as software development costs. In essence, although the unstructured nature of data lakes makes them faster and cheaper to set up than traditional data warehousing solutions, the management and analytics costs may quickly begin to rise, especially in modern payments, where terabytes of data are now generated each year from transactions across a wide range of payments methods.
Another cloud on the horizon for data lakes in the payments industry is the advent of General Data Protection Regulation (GDPR) legislation in a wide range of markets worldwide. Launched in late 2018 in the EU, and with “copycat” legislation now under consideration by markets such as Australia, Singapore, Brazil, Canada and at least four US states including New York and California, GDPR mandates that customers should have rights over the kind of data companies hold on them, and how that data is used. Such legislation would appear to require at least some level of sifting and sorting of customer data prior to storage to ensure that unpermitted data sets do not enter the storage ecosystem.
According to Amandeep Khurana, CEO and Co-founder of Okera, a data management company, the regulatory environment around data storage and management is “getting much more challenging now. Most organisations cannot adequately control access to data for compliance purposes while providing fast access to the high-quality data business users need. This can lead to compliance failures, vulnerability to breaches, low productivity, a lack of agility, higher costs for data management and general frustration among all stakeholders.”
Look before you leap
Because data lakes are unstructured, they present a unique problem from both regulatory and security perspectives. At the simplest level, companies may be unsure of exactly what is being held in their data lakes – a fact that can make “retrofitting” data lakes to suit future regulatory change a complex, if not impossible, task. With a structured database, any data field that runs contrary to regulatory requirements can be quickly deleted. By contrast, information must first be identified in a data lake, then isolated and purged – not necessarily a trivial task. Furthermore, hackers acquiring access to a data lake might access a potential goldmine of information to enable fraud – not just cardholder numbers, but names, addresses, account activity and customer histories.
Most consultants and industry experts agree that the answer lies in careful consideration of your company’s needs prior to initiating a data lake. Doug Wick, Chief Product and Marketing Officer at ALTR, a data protection and security company, argues that companies should engage a “data broker” to identify and analyse data prior to entry to the data lake, and that data should be tagged according to its origin to provide at least some level of identity for information inside the data lake. ALTR is developing protection solutions for data lakes, including the use of tokenisation and fragmentation via blockchain technology.
The application of blockchain to data storage is an intriguing development, since it would allow the source and type of data to be tagged quickly and effectively without the creation of complex rules and filters. ALTR’s blockchain data solution employs the same tokenisation techniques used by crypto-currencies such as Ripple and Ethereum. Using blockchain-enabled tagging, ALTR’s solution splits a company’s data up between different storage locations, then decrypts and reintegrates the data as needed for analytics and interpretation. ALTR’s solution is based on a private blockchain, helping to avoid the speed and energy use issues experienced by public blockchain systems.
Wick believes that most data lakes currently stored in physical warehouses will end up migrating to the cloud. “It’s essential to think about the future when it comes to the structure of your data lakes, and companies should create a lake which can be migrated to the cloud, since that’s the future of data storage.”
US payment processor Global Payments has taken exactly this approach, tokenising information stored in a data lake via the cloud. According to Mark Kubik, Vice President of Business Information at Global Payments, their system works only because they adopted a step-by-step approach to developing their data lake, consulting with customers at every opportunity: “One of our first drivers was customer feedback. We put the user interfaces out there so they could test it.” Kubik admits that the process was not simple, with the system’s first iteration “look[ing] like IT guys had designed it”, and that “it takes a long time to stand up infrastructure … we had to script everything.”
A final challenge to the concept of data lakes lies in the maturity of the analytics software packages used to interrogate each “lake.” Most data lakes use the Apache Hadoop operating environment, an open-source code set first launched in 2006, but not settled until 2018. Given the need to tailor software for the specific requirements of each company, some may start to find development costs prohibitive. This will be even more the case now that TechGenie has listed Hadoop Developers as one of the ten most sought-after skills in technology for 2019, with salaries up by more than nine per cent.
Despite these challenges, data lakes offer genuine benefits compared to traditional data warehousing – but only if set up and managed properly. Leaving aside the mooted cost advantages, which appear to be questionable, there’s no doubting the longer-term benefit of storing data in a format without limits on how the data can be used in the future, and sufficiently flexible to allow data to be used for everything from fraud management to new product development.
Payments companies looking to jump into data lakes should also be aware of current and future regulatory requirements – not least the longer-term trend towards protecting customer data from unscrupulous use, and the growing number of mandated security protocols for customer and transaction data.
McKinsey, BCG and other leading consultancy firms concur in advocating a multi-stage, firm-wide approach to structuring and integrating data lakes in a company’s operations. One interesting omission from McKinsey’s schematic, reproduced below, is the need to ensure data quality and integrity before storing in the data lake – something McKinsey’s own research recognises is a problem for more than 50 percent of the financial firms it spoke to in developing its White Paper.
In the final analysis, the jump in data volumes generated by digital payments and tremendous opportunities offered by new analytics methods mean that payments firms still using traditional data warehousing will have to rethink their approach. Negotiating the pitfalls presented by growing regulatory and security demands, to say nothing of rising development costs, will not be trivial. While the analytics opportunities presented by data lakes probably trump the challenges they present, creating and integrating a data lake that fits your company’s needs is not as simple as proponents of this approach would have us believe.
The post PCM Feature: Data Lakes gaining traction with payments businesses in Europe and North America appeared first on Payments Cards & Mobile.