Dissertation Defense

Efficient Data Center Architectures Using Non-Volatile Memory and Reliability Techniques

David Andrew Roberts
SHARE:

The cost of running a data center is increasingly dominated by energy consumption, contributed by power provisioning, cooling and server components such as processors, memories and disk drives. This problem is made worse by continued growth of data capacity and the amount of data accessed by each workload. Meanwhile, emerging classes of complex data center workloads place a heavier burden on processing and storage hardware. Fortunately, recent emerging technologies may provide improved efficiency. 3D die-stacking can increase I/O bandwidth and performance while reducing access energy for modern I/O intensive data center workloads. Integration of non-volatile (NV) memories for applications such as disk caches can save significant energy. In recent developments, byte-addressable persistent storage such as phase-change memory (PCM) or Memristors can serve as both main memory and permanent storage, reducing layers of hierarchy and data transfer energy.

Further, as the CPU often dominates system power consumption, CPU energy saving schemes will have significant overall impact on energy use. Unfortunately, current processor architectures cannot fully exploit voltage scaling due to the need for safety margins as well as having large caches that fail at higher voltages than the logic circuits, so we also propose solutions to address these issues.

In this thesis, we improve the efficiency of data centers and servers via the following novel techniques;

1) We propose a distributed, energy-efficient data center architecture, replacing hard disk drives and DRAM main memory with non-volatile Memristors or PCM. The system is composed of a network of uniform building blocks called Nanostores that combine processors with a permanent data store. To reduce unnecessary data movement, DRAM and disk layers are eliminated, resulting in a flattened memory hierarchy.

2) Because NV memories wear out with the number of data writes, we propose novel wear-leveling solutions. First we propose distributed data center wear-leveling to address SSD-based and future Nanostore based storage, with a 3.9x improvement in lifetime. Second, we propose server-level reliability improvements for Flash memory based disk caches that provide 20x improvements in lifetime on average.

3) We propose a novel on-chip cache fault tolerance scheme that allows more than a 30% improvement in energy efficiency.

Sponsored by

Trevor N. Mudge