This guest blog is penned by Ben Harden, Managing Director at CapTech Consulting, a Cloudera partner specializing in helping customers use data to make better decisions and drive innovation.
The highly regulated financial services industry demands a flexible data platform that combines the ability to store, mine and explore large amounts of data while maintaining strong security and governance capabilities to protect the organization. Cloudera Enterprise’s ability to meet these requirements is unique in the big data market.
CapTech worked with a Fortune 500 financial services firm that had built their own data repository filled with rich data – website traffic, audio calls, credit card, ACH payments, credit scoring, statistical model output and online transactions — but did not have the platforms in place to allow data scientists and business users the ability to catalog, collaborate on and mine metadata. In addition to having rich data for the data science community, they needed a repository that was able to comply with regulatory reporting requirements such as Basel Reporting, AML reporting and Dodd Frank. Teaming with the client, Cloudera and CapTech implemented their vision for an enterprise data hub (EDH), built on Cloudera Enterprise, to address their requirements.
The customer wanted to understand the data ingested into their environment at a deeper level: Was it sensitive information? Was it considered high risk? What were the business definitions for the data? Additionally, the customer wanted to provide the ability for data stewards to review and approve business metadata to ensure its quality.
Solution
In order to address their vision, the client engaged Cloudera due to its well-known security and governance features, including that Cloudera is the only Hadoop distribution to have passed a compliance audit. The client engaged CapTech to help implement a custom metadata registry with Cloudera Navigator, the leading data management and governance platform for Hadoop, as a governance solution designed to easily catalog business, operational, and technical metadata in a standard format that could be searched, segmented and understood by data consumers. In addition to business metadata, the solution automatically captured the technical schema and the movement and transformation of data through the ecosystem. This enabled data analysts to see a full attribute-level lineage of data from ingest through to egress.
To ensure the integrity of the metadata captured, CapTech created a data governance workflow and notification engine so data owners were able to review and approve business metadata. Additionally, CapTech enabled real-time notification of when high-risk data requirements were unmet or un-reviewed. This empowered the client to protect and preserve their enterprise data assets, ensuring their reputation would not be tarnished through an increasingly common data breach.
The solution was designed and built to work directly with Cloudera Enterprise and leveraged Cloudera Navigator as the underlying metadata repository. CapTech used Cloudera Navigator features such as data lineage and the searchable metadata repository, extended using Cloudera Navigator APIs that allowed the application to capture all the client specific custom metadata within the Cloudera Navigator repository. Storing all the metadata together in Cloudera Navigator provided an easy way to enable rich search capabilities of all metadata, allowing users to easily see the entire business, technical, and operational picture for a given dataset.
Results
The resulting application provided the client with a holistic Hadoop metadata management solution satisfying all enterprise metadata requirements. The application captured metadata at the time of ingest on over 20,000 files/day, giving data scientists a single catalog of metadata for all data stored in the ecosystem. The notification system delivered near real-time alerts of activities occurring in the EDH. Together, this enabled the customer to better understand and secure their data environment.
Most importantly, the solution has built a foundation for the future, giving the client an option to move sensitive data into the EDH, which is expected to be their platform for high-risk data needs such as regulatory reporting and statistical modeling. This will save the client time and money, and open up future opportunities to address customer needs through a better understanding of their data.