@aras I would still go for a two-pass approach here, conceptually:
for (chunks of N elements) {
for groups of 4 streams {
read and interleave N values from 4 streams each, store to stack
}
for elements {
read and interleave groups of 4 streams from stack, sum into running total, store to dest
}
}