Chowdhury receives VMWare Award to further research on cluster-wide memory efficiency
Chowdhury’s work has produced important results that can make memory in data centers both cheaper and more efficient.
Prof. Mosharaf Chowdhury has been awarded the VMWare Early Career Award to further his research on memory disaggregation. The award provides a $50,000 unrestricted gift for his work, which seeks to one day treat portions of computer memory scattered across machines as one huge, virtual block of memory.
To date, Chowdhury’s work in the area of memory use for large-scale computing applications has produced important results that can make memory in data centers both cheaper and more efficient. He makes heavy use of a technology called Remote Direct Memory Access (RDMA) that can connect all the unused memory in a data cluster and allow fast reading and writing between machines.
One of his recent projects with Prof. Barzan Mozafari and PhD student Dong Young Yoon, an algorithm called “Decentralized and Starvation-free Lock management with RDMA” (DSLR), tackles the problem of granting many users access to a rack’s shared resources simultaneously. The researchers’ experiments showed that DSLR delivers up to 2.8X higher throughput than all existing algorithms that rely on RDMA, while reducing their average and slowest latencies by up to 2.5X and 47X, respectively.
This project built on Chowdhury’s prior major project led by PhD student Juncheng Gu, Infiniswap, that improved memory utilization in a rack or cluster by up to 47 percent. The software lets servers instantly borrow memory from other servers in the cluster when they run out, instead of writing to slower storage media such as disks. Disks are orders of magnitude slower than memory, and data-intensive applications often crash or halt when servers need to page.
Chowdhury expanded on the work further with a recent NSF CAREER Award, which supported his work on several unsolved problems facing memory sharing at the host level, network level, and end-to-end. The key challenges included bridging the latency gaps between RDMA and local memory access; addressing network-wide fault-tolerance, load imbalance, and performance isolation issues; scaling with the size of data centers and number of applications being run; and coexisting with other infrastructures.