Using Data Analytics for Loss Prevention

Jonathan Lowsley, CIO, ADrive

When looking at any business there are key performance indicators that measure the success of the company.  Across verticals a variety of metrics specific to the industry are used to gauge performance, however, profitability could be considered the universal metric of success and the primary defining factor of business performance.

While the obvious way to increase this performance is to increase revenue, the flipside of reducing cost is just as important.  Every CxO focuses on operating within a budget, delivering the systems infrastructure, and achieving technology goals established by the needs of the business.  Using creative ways to decrease cost is vital in any business role, but perhaps the most overlooked way to positively affect performance is to engage in “cost validation.”  By this I mean employing a hybrid of decreasing cost and increasing revenue by closely accounting for the services and infrastructure being consumed by internal departments and external customers.

Before the advent of Big Data the consumer data storage industry thrived on a model of oversubscription. Businesses could count on departments and customers using less of the product or service than they were being charged for.  However, due to the data intensive nature of cloud infrastructure and rapidly growing rate of consumer data generation the oversubscription model is less and less profitable.  Customers and departments are quickly creating enormous amounts of data and they are interested in  only paying for what they use and more importantly using what they pay for.

You spent millions of dollars on your storage infrastructure, and a large portion of it is slipping away.  Just like retail stores have departments dedicated to theft, loss-prevention, and inventory control, IT departments and service providers need to think in the same manner.  We have quota systems, and auditing in place, but in our experience this is not enough to guarantee all the bytes are accounted and paid for.  Instead of throwing hardware at a growing data problem, businesses need to first and foremost validate the usage and be able to account for it. 

Companies now employ the use of Data Scientists to help make sense of patterns and identify data sinks in the infrastructure as well as duplicated and orphaned data.  Leveraging an in-house data scientist, or a senior software engineer from your development team will offer immediate returns.  What if they told you they could recover 100 Terabytes of your storage, alleviating the need for making costly expansions to your platform?  What if they found 100 virtual machines that weren’t being billed to a department properly?  This is all too common in large enterprises, where IT systems teams are responsible for Petabytes of exponentially multiplying data sets, while relying on a somewhat static employee base.

For industries using a “Freemium” pricing strategy, making sure that users are not becoming abusers is essential.  By looking for patterns in what is being stored, how it is being stored, and user metadata, we can start to identify anomalies and focus in on finding users violating the acceptable use policy.  It is crucial to identify these accounts, because if left unchecked they  can chew through the usable space on your storage platforms.  Using a single search identifier is not accurate enough, but combining multiple identifiers exponentially increases the certainty of the findings.  IP address analysis is a great example of this.  As a single identifier it’s unreliable (due to IPv4 NAT Overload or PAT), but in combination with file metadata or email address similarities, we can start to uncover data hot spots and abusive clients.  Collecting application event timestamps, email addresses (including local part plus sign delimited strings), IP addresses, customer names, file names, file sizes, file types, password hashes, client specific identifiers (such as browser user agent), and various other pieces of metadata is just the first step in being able to draw correlations between user groups and data sinks.

When it comes to your applications, don’t skimp on the metadata.  As successful DevOps collaboration becomes a business priority, skilled software architects and systems engineers are coming up with creative ways to handle application metadata collection without compromising performance.  With the analytical tools and techniques available to us in the Big Data landscape, we should adopt a “more data is better” philosophy and let the data scientists sort it out.

Moving forward it’s essential to look to data analytics as a key part of data accounting and cost validation.  Using analytical tools and complex search algorithms to crunch Big Data is a necessity to understanding how the user base is utilizing your infrastructure and ultimately protecting your data storage resource pools.  This in turn improves the efficiency of the systems infrastructure leading to reduced cost and higher business performance.