Businesses that are moving towards a digital transformation that incorporates enabling AI and analytics upon their data are often subject to a tough question: How much data will I need to manage? That’s where building an AI data platform that can withstand great change is key.
It’s particularly difficult because there is an element of positive feedback to a successful AI strategy. If it works, the demand for more data can be a surprise. The core benefit of deep learning, which has revolutionized AI over the past 5 years, is that more data delivers better decisions. And so, organizations often find an increasing need for more data to drive more accurate models.
This is where it pays off long term to consider carefully the AI data platform architecture and how it might scale from initially small, to potentially very large – both in terms of performance and capacity. The ideal AI data platform will provide two routes forward, and everything in between:
- A performance (think fast Flash) route: scale out with more flash and storage appliances horsepower to deliver that to applications.
- A capacity route: keep economics feasible whilst coping with large increases in data volumes.
Typically the candidates for your AI data platform will have two flavors: All Flash and Hybrid Flash + HDD. The All Flash solutions will constrain you to the price of flash, which whilst it is becoming more competitive, still remains expensive as capacities get large.
Hybrid solutions, in principle deliver the best of both worlds, but here is where special attention is needed to avoid troubles later.
Hierarchical storage management (HSM) has been the conventional way of managing hybrid systems that scale. With HSM, typically the primary storage pool is the one that is accessed by applications and based on some algorithm, data can be staged back to a low-cost tier whilst keeping the metadata in the primary pool. This means that applications always get a seamless view of all the data and costs are kept low as capacity spirals.
There are three problems with this approach:
- Administrators now must manage two completely different storage environments – one for primary storage and the lower tier
- End users experience very high latencies when their data is in the back-end. HSM requires data to be pulled back into the primary storage before the user’s request can be satisfied… This often causes a lot of user frustration
- Data in the backend layer might not be directly accessible. Often, if S3 is used as a back-end, the data that is staged back there cannot be directly accessed since it is in a proprietary format. This can be a major concern for organizations who need to maintain complete ownership of all their data… having to rely upon the front end to access all the data can be a non-starter
For these reasons, DDN implemented our Hot Pools technology in EXAScaler. With Hot Pools, the system is setup with a Hot (Flash) pool and a larger Cool (HDD) pool. The Hot Pool is effectively a cache which can be very large. Applications move data in and out of Flash under normal operation. When data is not frequently accessed, the Hot Pools technology *within the same filesystem and namespace* will create a copy in HDD and will remove the copy in the Flash layer. But wherever the data is, it is still directly readable by applications with no latency waiting for data staging. This all happens within one very scalable filesystem – so only one system to manage. Finally, all the data, wherever they are, are just held as files and are not held in some proprietary format.
So essentially, we have created a solution with all the benefits of HSM, but without the drawbacks. The penalty for accessing data from the HDD layer as opposed to the Flash Layer is wholly down to the difference in device performance and so users are never surprised with higher than expected latencies. The underlying data movement technology has been built into the filesystem natively and is also used to add data resilience and to create multiple data replicas for performance advantages. Hot Pools uses a file heat algorithm to select data for movement across pools – this is a sophisticated algorithm that retains some history of file accesses and can optimize for small files to always stay in cache.
Combine Hot Pools with DDN’s EXAScaler Appliances and you get something simpler than anything on the market for scale. Fewer controllers, fewer switches, no complex backend networks that make scaling a pain. DDN’s AI data platforms provide the very best performance in the most simple packages, saving you power, administration pain and, with Hot Pools, keeping the economics great even as your storage volumes go through the roof.
Contact us if you’re ready to learn more!