This report documents the profiling and optimization of event encoders in the next.orly.dev/pkg/encoders/event package. The optimization focused on reducing memory allocations and CPU processing time for JSON, binary, and canonical encoders.
- JSON marshaling/unmarshaling - Binary marshaling/unmarshaling - Canonical encoding - ID generation (canonical + SHA256) - Round-trip operations - Small and large event sizes
- CPU profiling (-cpuprofile)
- Memory profiling (-memprofile)
- Allocation tracking (-benchmem)
The profiling data revealed several key bottlenecks:
- text.NostrEscape: 3.95GB total allocations (45.34% of all allocations)
- event.Marshal: 1.39GB allocations
- event.ToCanonical: 0.22GB allocations
- text.NostrEscape: 4.39s (23.12% of CPU time)
- runtime.mallocgc: 3.98s (20.96% of CPU time)
- event.Marshal: 3.16s (16.64% of CPU time)
Problem: Multiple allocations from make([]byte, ...) calls and buffer growth during append operations.
Solution:
EstimateSize() when dst is nilCode Changes (event.go):
func (ev *E) Marshal(dst []byte) (b []byte) {
b = dst
// Pre-allocate buffer if nil to reduce reallocations
if b == nil {
estimatedSize := ev.EstimateSize()
estimatedSize += 100 // JSON structure overhead
b = make([]byte, 0, estimatedSize)
}
// ... rest of implementation
}
Results:
Problem: Similar allocation issues as JSON marshal, with additional overhead from tag and content escaping.
Solution:
Code Changes (canonical.go):
func (ev *E) ToCanonical(dst []byte) (b []byte) {
b = dst
if b == nil {
estimatedSize := 5 + 2*len(ev.Pubkey) + 20 + 10 + 100
if ev.Tags != nil {
for _, tag := range *ev.Tags {
for _, elem := range tag.T {
estimatedSize += len(elem)*2 + 10
}
}
}
estimatedSize += len(ev.Content)*2 + 10
b = make([]byte, 0, estimatedSize)
}
// ... rest of implementation
}
Results:
Problem: varint.Encode writes one byte at a time, causing many small allocations. Also, nil tags were not handled explicitly.
Solution:
Len() on nilMarshalBinaryToBytes helper method that uses bytes.Buffer with pre-allocated capacityCode Changes (binary.go):
func (ev *E) MarshalBinary(w io.Writer) {
// ... existing code ...
if ev.Tags == nil {
varint.Encode(w, 0)
} else {
varint.Encode(w, uint64(ev.Tags.Len()))
// ... rest of tags encoding
}
// ... rest of implementation
}
func (ev *E) MarshalBinaryToBytes(dst []byte) []byte {
// New helper method with pre-allocated buffer
// ... implementation
}
Results:
MarshalBinary (nil check optimization)MarshalBinaryToBytes method provides better performance when bytes are needed directlyProblem: Always allocating tags slice even when nTags is 0.
Solution:
nTags == 0 and set ev.Tags = nil instead of allocating empty sliceCode Changes (binary.go):
func (ev *E) UnmarshalBinary(r io.Reader) (err error) {
// ... existing code ...
if nTags == 0 {
ev.Tags = nil
} else {
ev.Tags = tag.NewSWithCap(int(nTags))
// ... rest of tag unmarshaling
}
// ... rest of implementation
}
Results:
| Operation | Metric | Before | After | Improvement |
|---|---|---|---|---|
| JSON Marshal | Time | 1758 ns/op | 1325 ns/op | 24% faster |
| JSON Marshal | Memory | 2232 B/op | 1024 B/op | 54% less |
| JSON Marshal | Allocations | 6 allocs/op | 1 allocs/op | 83% fewer |
| Canonical | Time | 1523 ns/op | 1272 ns/op | 16% faster |
| Canonical | Memory | 1208 B/op | 896 B/op | 26% less |
| Canonical | Allocations | 5 allocs/op | 1 allocs/op | 80% fewer |
| GetIDBytes | Time | 1739 ns/op | 1552 ns/op | 11% faster |
| GetIDBytes | Memory | 1240 B/op | 928 B/op | 25% less |
| GetIDBytes | Allocations | 6 allocs/op | 2 allocs/op | 67% fewer |
| Operation | Metric | Before | After | Improvement |
|---|---|---|---|---|
| JSON Marshal | Time | 19751 ns/op | 17666 ns/op | 11% faster |
| JSON Marshal | Memory | 18616 B/op | 9472 B/op | 49% less |
| JSON Marshal | Allocations | 11 allocs/op | 1 allocs/op | 91% fewer |
| Canonical | Time | 19725 ns/op | 17903 ns/op | 9% faster |
| Canonical | Memory | 18616 B/op | 10240 B/op | 45% less |
| Canonical | Allocations | 11 allocs/op | 1 allocs/op | 91% fewer |
| Operation | Metric | Before | After | Notes |
|---|---|---|---|---|
| Binary Marshal | Time | 347.4 ns/op | 297.2 ns/op | 14% faster |
| Binary Marshal | Allocations | 13 allocs/op | 13 allocs/op | No change (varint limitation) |
| Binary Unmarshal | Time | 990.5 ns/op | 1028 ns/op | Slight regression (nil check overhead) |
| Binary Unmarshal | Allocations | 32 allocs/op | 32 allocs/op | No change (varint limitation) |
Note: Binary operations are limited by the `varint` package which writes one byte at a time, causing many small allocations. Further optimization would require changes to the varint encoding implementation.
The most significant improvement came from reducing allocations:
This reduction has cascading benefits:
Pre-allocating buffers based on EstimateSize() proved highly effective:
varint.Encode function writes one byte at a time, causing many small allocations. Optimizing this would require:- Batch encoding into a temporary buffer - Or refactoring the varint package to support batch writes
text.NostrEscape function directly, we could:- Pre-allocate destination buffer based on source size estimate - Use a pool of buffers for repeated operations
Marshal, ToCanonical, or MarshalBinaryToBytes repeatedly, consider reusing buffers: `go
buf := make([]byte, 0, ev.EstimateSize()+100)
json := ev.Marshal(buf)
`
varint package or creating a specialized batch varint encoder for event marshaling.The optimizations implemented significantly improved encoder performance:
These improvements will reduce GC pressure and improve overall system throughput, especially under high load conditions. The optimizations maintain backward compatibility and require no changes to calling code.
Full benchmark output:
BenchmarkJSONMarshal-12 799773 1325 ns/op 1024 B/op 1 allocs/op
BenchmarkJSONMarshalLarge-12 68712 17666 ns/op 9472 B/op 1 allocs/op
BenchmarkJSONUnmarshal-12 538311 2195 ns/op 824 B/op 24 allocs/op
BenchmarkBinaryMarshal-12 3955064 297.2 ns/op 13 B/op 13 allocs/op
BenchmarkBinaryMarshalLarge-12 673252 1756 ns/op 85 B/op 85 allocs/op
BenchmarkBinaryUnmarshal-12 1000000 1028 ns/op 752 B/op 32 allocs/op
BenchmarkCanonical-12 835960 1272 ns/op 896 B/op 1 allocs/op
BenchmarkCanonicalLarge-12 69620 17903 ns/op 10240 B/op 1 allocs/op
BenchmarkGetIDBytes-12 704444 1552 ns/op 928 B/op 2 allocs/op
BenchmarkRoundTripJSON-12 312724 3673 ns/op 1848 B/op 25 allocs/op
BenchmarkRoundTripBinary-12 857373 1325 ns/op 765 B/op 45 allocs/op
BenchmarkEstimateSize-12 295157716 4.012 ns/op 0 B/op 0 allocs/op
Report generated: 2025-11-02