Assessing What Matters in an EDR Solution
A brief history
Looking back at 2015, it’s hard to dispute that the security industry has been flooded with Endpoint Detection and Response (EDR) products. Walk the sponsor floor at any conference or sample the white papers and marketing pitches from any vendor web site, and you’ll see the same claims repeated ad nauseam: “Prevent, Detect, and Respond” at “enterprise scale” in “real time.” Throw in obligatory references to “anomaly detection,” “threat intelligence,” and “APTs” for good measure.
It’s no wonder that so many organizations struggle to down-select and evaluate vendors. Requests for Proposals yield vague or exaggerated responses. Demos and small proof-of-concept labs with staged intrusion scenarios can make any product look effective. Like any enterprise solution, EDR tools inevitably show their true strengths and weaknesses over time, and only once fully deployed in real, complex, messy networks.
My role at Tanium might preclude me from claiming a truly vendor-neutral point of view; however, having spent over a decade as a consultant conducting security assessments, incident response investigations and remediation efforts, I continually try to remain mindful of the criteria that would have mattered most to me as a practitioner.
I’d like to focus here on three foundational attributes that impact any EDR solution’s effectiveness: the scope of data it provides, performance and scalability, and flexibility.
What scope of data does the solution provide?
The scope and timeliness of data that an endpoint product can search, analyze, or collect represents the absolute core of its capabilities. Nearly every EDR product now claims to provide “enterprise-wide search”, but the critical question is, “Of what information?” Many solutions make significant tradeoffs in scope of data available to bolster scalability or performance. Imagine if Google only allowed users to search the content of sites that it indexed in the 30 days; or conversely, if it provided an unlimited search timeframe, but only across the title tag for popular web pages?
Endpoint data can be roughly divided into two domains: historical and current-state. Why are both important? Products that continuously record endpoint telemetry, such as file I/O, network connections, process execution, log-on events, or registry changes, have become increasingly popular for incident detection and response. As I previously blogged during the launch of Tanium Trace, this capability can accelerate and simplify the effort required to triage a lead, generate alerts, or investigate a system. It preserves and enriches artifacts that might otherwise be lost to gaps in forensic evidence, and reduces the cadence of data retrieval needed to identify and retain short-lived events.
Yet exclusively relying on a sliding window of historical data incurs significant limitations — particularly when hunting-at-scale. Such solutions restrict both the timeframe of what’s retained and the breadth of data available for alerting and analysis. If it’s not recorded, you can’t find it.
To complement this narrower scope of information, effective incident detection and response also requires on-demand access to current-state data from all systems. That means having the ability to search for or collect volatile artifacts that are happening now. There are countless examples I’ve encountered in investigations: Where is a compromised local administrator account currently logged in? What systems currently have a malicious DLL loaded in memory?
Current-state data also encompasses latent artifacts that have not recently changed, or were out of scope for historical preservation, but may be crucial to scoping an incident. Consider the need to search across the environment for any type of file “at rest” by name or hash; a registry value that hasn’t been touched in a year; or a more esoteric forensic artifact like data in the WMI repository. What if you need to install the solution after systems have already been compromised? Working with a constrained scope of data inevitably leads to blind spots and investigative dead-ends.
What is the solution’s performance and scalability?
Nearly every EDR vendor promises some variant of “real-time” speed that scales to “tens of thousands” of endpoints. In practice, it’s unfortunately easy for clever hand-waving to disguise a product’s true performance and scalability limitations, especially if an evaluation process is limited to small test labs. How can an organization ensure an EDR product is able to perform well enough to meet their use-cases? The key is to take a more holistic approach to assessing speed and scale.
First, consider the modes of interaction provided by the solution. Passive workflows include ad-hoc searching (“Where is this hash?”), detection and alerting (“Has an IOC hit or rule triggered?”), and data collection for anomaly analysis (“Obtain all autoruns, stack by frequency of occurrence.”) Active workflows entail changing systems, be it enforcing a quarantine, killing a process, removing malware, or fixing a configuration vulnerability.
Next, overlay the scope of data made available to each of these modes of interaction. Does it include historical activity? Current activity? Latent files or other artifacts at-rest? Finally, assess the performance and scalability of the solution along each of these sets of criteria. Some solutions may provide enterprise-wide access to a set of historical data, but are unable to easily work with the current or latent data at-scale (and vice-versa). Do queries or actions take seconds for some tasks, and hours for others?
Organizations should also evaluate the infrastructure footprint and cost incurred to operate the solution at the desired level of performance and scale. For on-premises solutions that scale horizontally, maintenance costs rise and effectiveness declines over time. More servers mean more points of failure, and the need to re-architect and balance resource utilization as environments grow.
In contrast, cloud-based solutions need to govern the volume of data transmitted over the internet. This leads to reliance on client-side filtering and trigger mechanisms that can curtail the scope of endpoint data available on-demand or retroactively. Depending on your organization’s use cases, those concessions may be unacceptable.
How flexible is the platform?
We’ve already stressed that IR requires fast, scalable access to a broad set of endpoint data. Every EDR tool is capable of working with “core” forensic evidence: process activity, file system metadata, network connections, account activity, OS-specific artifacts like the Windows registry, and so on. But just as attacker techniques rapidly evolve, so too do the sources of evidence introduced by new operating system updates, applications, or researcher discoveries. An EDR solution’s flexibility directly impacts how quickly it can incorporate these new findings.
When comparing products, many organizations simply ask for a list of features and capabilities. I’d suggest going a step further to understand how the product has been updated in the past, and how it’s poised to continue maturing. That can include assessing the following:
- Ask to review the product’s change log for the past year to assess the pace of development. What types of new features have been added — and how quickly?
- Consider how the software is designed, and whether that lends itself to readily integrating new sources of data, or interacting with endpoints in new ways. How much control do customers have? What requires a vendor-supplied agent update?
- What is the state of the user community? Are other customers sharing capabilities that go beyond what’s “out-of-the-box”?
Finally, consider how thoroughly the product addresses the “Response” portion of “EDR.” Tactical remediation features, like killing a process or isolating a machine on the network, are commonplace among most tools in this space (though they may differ in scale or ability to orchestrate such actions.) But just as important — and often neglected — is whether the solution can actually protect systems, complement other preventative controls, and reduce endpoint attack surface. Key capabilities in this area include enforcing control over what is allowed to execute or communicate over the network, assessing and hardening security configuration settings, and maintaining patch levels for OS and 3rd party applications. Simply put, if an EDR solution only makes you better at quickly detecting and responding to attacks, its not actually making your organization more resilient — it’s not helping you break out of the cycle of re-compromise.
Testing, purchasing, and integrating any enterprise software is never an easy task. And each iteration of rip-and-replace for a product that failed to meet expectations brings significant operational risks and expenses. When considering an EDR solution, I hope that some of the points outlined in this blog post can help your organization form the right set of evaluation criteria to identify whichever product is a best fit for your needs.
Ryan Kazanciyan, Chief Security Architect
Interested in seeing Tanium in action? Schedule a one-to-one demo or attend our weekly webinar. Talk to our Tanium experts at our upcoming events.