Sparse matrix computations are critical for many applications in machine learning, computer vision, and scientific computing. However, optimizing sparse kernels, such as Sparse Matrix-Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM), remain challenging as their performance is sensitive to input characteristics and high dimensionality of the scheduling search space. Specifically, this complexity arises from the interplay of factors such as matrix dimensions, sparsity patterns, sparse storage formats, hardware targets, and compiler-specific scheduling primitives, which together create a highly irregular and non-intuitive performance landscape. While prior work has introduced learned cost models to guide the selection of scheduling primitives, these cost models are typically kernel- and hardware-specific, and either require millions of training samples or depend heavily on expert-designed heuristics. In this work, we frame optimizing sparse matrix kernels as a structured exploration problem and identify key limitations in prior work, including its inability to generalize across kernels and hardware, and to train cost models with limited data samples without relying on expert heuristics. We then propose a solution to automate the data collection effort for cost model training on emerging hardware accelerators. Our method augments a state-of-the-art (SOTA) framework with exploration-aware data sampling and multi-armed bandit-based active learning, enabling data-efficient fine-tuning with minimal manual interventions. Our experimental results demonstrate that these strategies substantially reduce reliance on large training datasets and expert heuristics, while achieving performance comparable to SOTA.
@inproceedings{sudusinghe2025automated,title={Automated Data Selection for Efficient Cost Model Training to Optimize Sparse Matrix Kernels on Emerging Hardware Accelerators},author={Sudusinghe, Chamika and Gerogiannis, Gerasimos and Lenadora, Damitha and Block, Charles and Torrellas, Josep and Mendis, Charith},booktitle={The Exploration in AI Today Workshop at ICML 2025},year={2025},url={https://openreview.net/forum?id=GdKc5XcO9e}}
Sparse tensor programs are essential in deep learning and graph analytics, driving the need for optimized processing. To meet this demand, specialized hardware accelerators are being developed. Optimizing these programs for accelerators is challenging for two reasons: program performance is highly sensitive to variations in sparse inputs, and early-stage accelerators rely on expensive simulators. Therefore, ML-based cost models used for optimizing such programs on general-purpose hardware are often ineffective for early-stage accelerators, as they require large datasets for proper training. To this end, we introduce COGNATE, a novel framework that leverages inexpensive data samples from general-purpose hardware (e.g., CPUs) to train cost models, followed by few-shot fine-tuning on emerging hardware. COGNATE exploits the homogeneity of input features across hardware platforms while effectively mitigating heterogeneity, enabling cost model training with just 5% of the data samples needed by accelerator-specific models to achieve comparable performance. We conduct extensive experiments to demonstrate that COGNATE outperforms existing techniques, achieving average speedups of 1.47× (up to 5.46×) for SpMM and 1.39× (up to 4.22×) for SDDMM.
@inproceedings{sudusinghe2025cognate,title={{COGNATE}: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning},author={Sudusinghe, Chamika and Gerogiannis, Gerasimos and Lenadora, Damitha and Block, Charles and Torrellas, Josep and Mendis, Charith},booktitle={Forty-second International Conference on Machine Learning},year={2025},url={https://openreview.net/forum?id=EV0itGFjmm}}
2024
Cell
On-scalp printing of personalized electroencephalography e-tattoos
Luize Scalco de Vasconcelos, Yichen Yan, Pukar Maharjan, Satyam Kumar, Minsu Zhang, Bowen Yao, Hongbian Li, Sidi Duan, Eric Li, Eric Williams, Sandhya Tiku, Pablo Vidal, and
13 more authors
On-scalp digital printing of custom-designed, temporary-tattoo-like sensors represents a groundbreaking advancement in noninvasive brain-monitoring technologies, advancing the fields of neuroscience, clinical diagnostics, and brain-computer interfaces (BCIs). Traditional electroencephalography (EEG) systems involve time-consuming manual electrode placement, conductive liquid gels, and cumbersome cables, which are prone to signal degradation and discomfort during prolonged use. Our approach overcomes these limitations by combining material innovations with non-contact, on-body digital printing techniques to fabricate e-tattoos that are self-drying, ultrathin, and compatible with hairy scalps. These skin-conformal EEG e-tattoo sensors enable comfortable, long-term, high-quality brain activity monitoring without the discomfort associated with traditional EEG systems. Using individual 3D head scans, custom sensor layout design, and a 5-axis microjet printing robot, we have created EEG e-tattoos with precise, tailored placement over the entire scalp. The inks for electrodes and interconnects have slightly different compositions to achieve low skin contact impedance and high bulk conductivity, respectively. After printing and self-drying, the inks form conductive, stretchable, and breathable thin films that ensure high signal fidelity, even over extended periods. This technology paves the way for non-invasive, high-performance, and user-friendly brain monitoring that will enhance both patient care and the understanding of the human brain. The broader significance of this technology lies in its potential applications beyond traditional EEG use. On-scalp printed ultrathin e-tattoos could play a pivotal role in developing BCIs for various industries, including prosthetics, virtual reality (VR), and human-robot teaming. This work also opens the possibility of on-body digital manufacture of other types of e-tattoo devices in areas beyond the head, leading to large-area, skin-covered yet deformable and breathable functional e-tattoos.
@article{vasconcelos24scalp,author={{Scalco de Vasconcelos}, Luize and Yan, Yichen and Maharjan, Pukar and Kumar, Satyam and Zhang, Minsu and Yao, Bowen and Li, Hongbian and Duan, Sidi and Li, Eric and Williams, Eric and Tiku, Sandhya and Vidal, Pablo and Solorzano-Vargas, R. Sergio and Hong, Wen and Du, Yingjie and Liu, Zixiao and Iwane, Fumiaki and Block, Charles and Repetski, Andrew T. and Tan, Philip and Wang, Pulin and Martín, Martín G. and Millán, José del R. and He, Ximin and Lu, Nanshu},title={On-scalp printing of personalized electroencephalography e-tattoos},year={2024},month=dec,journal={Cell Biomaterials},url={https://doi.org/10.1016/j.celbio.2024.100004},issn={3050-5623},publisher={Elsevier}}
We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5\texttimes performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.
@inproceedings{ranawaka24sc,author={Ranawaka, Isuru and Hussain, Md Taufique and Block, Charles and Gerogiannis, Gerasimos and Torrellas, Josep and Azad, Ariful},title={Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication},year={2024},month=nov,isbn={9798350352917},publisher={IEEE Press},url={https://doi.org/10.1109/SC41406.2024.00052},doi={10.1109/SC41406.2024.00052},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis},articleno={46},numpages={17},location={Atlanta, GA, USA},series={SC '24}}
Resource-constrained system-on-chips (SoCs) are increasingly heterogeneous with specialized accelerators for various tasks. Acceleration taxes due to control and data movement, however, diminish end-to-end speedups from hardware acceleration. Meanwhile, emerging workloads are increasingly task-diverse with several, potentially shared, fine-grained acceleration candidates. This motivates a paradigm of parallel and disaggregated acceleration. Compared to a monolithic accelerator, disaggregation provides higher flexibility, reuse, and utilization, but at the cost of higher control and data acceleration taxes. We propose a novel SoC architecture, Mozart, that enables efficient accelerator disaggregation by leveraging shared-memory to tame control and data acceleration taxes. To address the control tax, Mozart includes a lightweight, modular, and general accelerator synchronization interface (ASI). ASI eliminates the typical CPU-centric accelerator control in favor of a decentralized, uniform synchronization interface through shared-memory. This enables accelerators to directly and transparently synchronize with each other (or CPUs) using the same shared-memory interface as CPUs. To address the data tax, Mozart leverages the Spandex-FCS heterogeneous coherence protocol, which supports decentralized data movement and per-word coherence specialization. We demonstrate the first RTL implementation of Spandex-FCS and the first evaluation of its benefits for a heterogeneous SoC with fixed-function accelerators, running real-world applications with Linux. Mozart simultaneously enables, for the first time, (1) finer-grained acceleration than previously possible, (2) programmable and transparent composition of fine-grained, disaggregated accelerators, (3) efficient accelerator pipelining through shared-memory and decentralization, and (4) a performance-competitive disaggregated alternative to specialized monolithic accelerators. We demonstrate these capabilities of Mozart with a comprehensive one-of-a-kind evaluation of more than 70 hardware configurations prototyped on an FPGA employing various accelerators, running real-world applications on Linux, and a scalability analysis with up to 15 accelerators. We also present an analytical performance model to understand and explore system design choices and to validate the results.
@inproceedings{suresh24pact,author={Suresh, Vignesh and Mishra, Bakshree and Jing, Ying and Zhu, Zeran and Jin, Naiyin and Block, Charles and Mantovani, Paolo and Giri, Davide and Zuckerman, Joseph and Carloni, Luca P. and Adve, Sarita V.},title={Mozart: Taming Taxes and Composing Accelerators with Shared-Memory},year={2024},month=oct,isbn={9798400706318},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3656019.3676896},doi={10.1145/3656019.3676896},booktitle={Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques},pages={183–200},numpages={18},keywords={Accelerator Synchronization, Cache Coherence, Disaggregated Acceleration, Heterogeneous Systems, Shared-Memory},location={Long Beach, CA, USA},series={PACT '24}}
Touch-sensitive stretchable electronic skins (e-skins) hold promise for soft robots, prosthetics, bio-mimetics, and bio-sensors. However, a long-standing challenge has been the interference of stretching in pressure readings. Addressing this, we introduce an intrinsically stretchable hybrid response pressure sensor (SHRPS) composed of a laminate featuring a barely conductive porous nanocomposite and an ultrathin dielectric layer situated between two stretchable electrodes. The combined piezoresistive and piezocapacitive responses of the SHRPS enable ultrahigh pressure sensitivity while effectively negating stretch-induced interference. Our findings are underpinned by an experimentally validated electromechanical model. In practical applications, SHRPS mounted on inflatable probes demonstrated safe and precise palpation on the human wrist and conformable and firm gripping of contoured objects. The debut of SHRPS promises to significantly expand the versatile applications of e-skins.
@article{ha24shrps,title={Stretchable hybrid response pressure sensors},journal={Matter},volume={7},number={5},pages={1895-1908},year={2024},month=may,issn={2590-2385},doi={https://doi.org/10.1016/j.matt.2024.04.009},url={https://www.sciencedirect.com/science/article/pii/S2590238524001644},author={Ha, Kyoung-Ho and Li, Zhengjie and Kim, Sangjun and Huh, Heeyong and Wang, Zheliang and Shi, Hongyang and Block, Charles and Bhattacharya, Sarnab and Lu, Nanshu},keywords={stretchable pressure sensor, electronic skin, decoupling of pressure and stretch, inflatable soft robots, human-computer interaction, heartbeat palpation}}
In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, La Jolla, CA, USA, Apr 2024
Sparse matrix dense matrix multiplication (SpMM) is commonly used in applications ranging from scientific computing to graph neural networks. Typically, when SpMM is executed in a distributed platform, communication costs dominate. Such costs depend on how communication is scheduled. If it is scheduled in a sparsity-unaware manner, such as with collectives, execution is often inefficient due to unnecessary data transfers. On the other hand, if communication is scheduled in a fine-grained sparsity-aware manner, communicating only the necessary data, execution can also be inefficient due to high software overhead.We observe that individual sparse matrices often contain regions that are denser and regions that are sparser. Based on this observation, we develop a model that partitions communication into sparsity-unaware and sparsity-aware components. Leveraging the partition, we develop a new algorithm that performs collective communication for the denser regions, and fine-grained, one-sided communication for the sparser regions. We call the algorithm Two-Face. We show that Two-Face attains an average speedup of 2.11x over prior work when evaluated on a 4096-core supercomputer. Additionally, Two-Face scales well with the machine size.
@inproceedings{block24twoface,author={Block, Charles and Gerogiannis, Gerasimos and Mendis, Charith and Azad, Ariful and Torrellas, Josep},title={Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM},year={2024},month=apr,isbn={9798400703850},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3620665.3640427},doi={10.1145/3620665.3640427},booktitle={Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},pages={1200–1217},numpages={18},keywords={high-performance computing, distributed algorithms, sparse matrices, SpMM},location={La Jolla, CA, USA},series={ASPLOS '24}}