PERFORMANCE_REPORT.md raw

Event Encoder Performance Optimization Report

Executive Summary

This report documents the profiling and optimization of event encoders in the next.orly.dev/pkg/encoders/event package. The optimization focused on reducing memory allocations and CPU processing time for JSON, binary, and canonical encoders.

Methodology

Profiling Setup

  1. Created comprehensive benchmark tests covering:

- JSON marshaling/unmarshaling - Binary marshaling/unmarshaling - Canonical encoding - ID generation (canonical + SHA256) - Round-trip operations - Small and large event sizes

  1. Used Go's built-in profiling tools:

- CPU profiling (-cpuprofile) - Memory profiling (-memprofile) - Allocation tracking (-benchmem)

Initial Findings

The profiling data revealed several key bottlenecks:

  1. JSON Marshal: 6 allocations per operation, 2232 bytes allocated
  2. Canonical Encoding: 5 allocations per operation, 1208 bytes allocated
  3. Memory Allocations: Primary hotspots identified:

- text.NostrEscape: 3.95GB total allocations (45.34% of all allocations) - event.Marshal: 1.39GB allocations - event.ToCanonical: 0.22GB allocations

  1. CPU Processing: Primary hotspots:

- text.NostrEscape: 4.39s (23.12% of CPU time) - runtime.mallocgc: 3.98s (20.96% of CPU time) - event.Marshal: 3.16s (16.64% of CPU time)

Optimizations Implemented

1. JSON Marshal Optimization

Problem: Multiple allocations from make([]byte, ...) calls and buffer growth during append operations.

Solution:

Code Changes (event.go):

func (ev *E) Marshal(dst []byte) (b []byte) {
	b = dst
	// Pre-allocate buffer if nil to reduce reallocations
	if b == nil {
		estimatedSize := ev.EstimateSize()
		estimatedSize += 100 // JSON structure overhead
		b = make([]byte, 0, estimatedSize)
	}
	// ... rest of implementation
}

Results:

2. Canonical Encoding Optimization

Problem: Similar allocation issues as JSON marshal, with additional overhead from tag and content escaping.

Solution:

Code Changes (canonical.go):

func (ev *E) ToCanonical(dst []byte) (b []byte) {
	b = dst
	if b == nil {
		estimatedSize := 5 + 2*len(ev.Pubkey) + 20 + 10 + 100
		if ev.Tags != nil {
			for _, tag := range *ev.Tags {
				for _, elem := range tag.T {
					estimatedSize += len(elem)*2 + 10
				}
			}
		}
		estimatedSize += len(ev.Content)*2 + 10
		b = make([]byte, 0, estimatedSize)
	}
	// ... rest of implementation
}

Results:

3. Binary Marshal Optimization

Problem: varint.Encode writes one byte at a time, causing many small allocations. Also, nil tags were not handled explicitly.

Solution:

Code Changes (binary.go):

func (ev *E) MarshalBinary(w io.Writer) {
	// ... existing code ...
	if ev.Tags == nil {
		varint.Encode(w, 0)
	} else {
		varint.Encode(w, uint64(ev.Tags.Len()))
		// ... rest of tags encoding
	}
	// ... rest of implementation
}

func (ev *E) MarshalBinaryToBytes(dst []byte) []byte {
	// New helper method with pre-allocated buffer
	// ... implementation
}

Results:

4. Binary Unmarshal Optimization

Problem: Always allocating tags slice even when nTags is 0.

Solution:

Code Changes (binary.go):

func (ev *E) UnmarshalBinary(r io.Reader) (err error) {
	// ... existing code ...
	if nTags == 0 {
		ev.Tags = nil
	} else {
		ev.Tags = tag.NewSWithCap(int(nTags))
		// ... rest of tag unmarshaling
	}
	// ... rest of implementation
}

Results:

Performance Comparison

Small Events (Standard Test Event)

OperationMetricBeforeAfterImprovement
JSON MarshalTime1758 ns/op1325 ns/op24% faster
JSON MarshalMemory2232 B/op1024 B/op54% less
JSON MarshalAllocations6 allocs/op1 allocs/op83% fewer
CanonicalTime1523 ns/op1272 ns/op16% faster
CanonicalMemory1208 B/op896 B/op26% less
CanonicalAllocations5 allocs/op1 allocs/op80% fewer
GetIDBytesTime1739 ns/op1552 ns/op11% faster
GetIDBytesMemory1240 B/op928 B/op25% less
GetIDBytesAllocations6 allocs/op2 allocs/op67% fewer

Large Events (20+ Tags, 4KB Content)

OperationMetricBeforeAfterImprovement
JSON MarshalTime19751 ns/op17666 ns/op11% faster
JSON MarshalMemory18616 B/op9472 B/op49% less
JSON MarshalAllocations11 allocs/op1 allocs/op91% fewer
CanonicalTime19725 ns/op17903 ns/op9% faster
CanonicalMemory18616 B/op10240 B/op45% less
CanonicalAllocations11 allocs/op1 allocs/op91% fewer

Binary Operations

OperationMetricBeforeAfterNotes
Binary MarshalTime347.4 ns/op297.2 ns/op14% faster
Binary MarshalAllocations13 allocs/op13 allocs/opNo change (varint limitation)
Binary UnmarshalTime990.5 ns/op1028 ns/opSlight regression (nil check overhead)
Binary UnmarshalAllocations32 allocs/op32 allocs/opNo change (varint limitation)

Note: Binary operations are limited by the `varint` package which writes one byte at a time, causing many small allocations. Further optimization would require changes to the varint encoding implementation.

Key Insights

Allocation Reduction

The most significant improvement came from reducing allocations:

This reduction has cascading benefits:

Buffer Pre-allocation Strategy

Pre-allocating buffers based on EstimateSize() proved highly effective:

Remaining Optimization Opportunities

  1. Varint Encoding: The varint.Encode function writes one byte at a time, causing many small allocations. Optimizing this would require:

- Batch encoding into a temporary buffer - Or refactoring the varint package to support batch writes

  1. NostrEscape: While we can't modify the text.NostrEscape function directly, we could:

- Pre-allocate destination buffer based on source size estimate - Use a pool of buffers for repeated operations

  1. Tag Marshaling: Tag marshaling could benefit from similar pre-allocation strategies

Recommendations

  1. Use Pre-allocated Buffers: When calling Marshal, ToCanonical, or MarshalBinaryToBytes repeatedly, consider reusing buffers:

`go buf := make([]byte, 0, ev.EstimateSize()+100) json := ev.Marshal(buf) `

  1. Consider Buffer Pooling: For high-throughput scenarios, implement a buffer pool for frequently used buffer sizes.
  1. Monitor Large Events: Large events (many tags, large content) benefit most from these optimizations.
  1. Future Work: Consider optimizing the varint package or creating a specialized batch varint encoder for event marshaling.

Conclusion

The optimizations implemented significantly improved encoder performance:

These improvements will reduce GC pressure and improve overall system throughput, especially under high load conditions. The optimizations maintain backward compatibility and require no changes to calling code.

Benchmark Results

Full benchmark output:

BenchmarkJSONMarshal-12           	  799773	      1325 ns/op	    1024 B/op	       1 allocs/op
BenchmarkJSONMarshalLarge-12      	   68712	     17666 ns/op	    9472 B/op	       1 allocs/op
BenchmarkJSONUnmarshal-12         	  538311	      2195 ns/op	     824 B/op	      24 allocs/op
BenchmarkBinaryMarshal-12         	 3955064	       297.2 ns/op	      13 B/op	      13 allocs/op
BenchmarkBinaryMarshalLarge-12    	  673252	      1756 ns/op	      85 B/op	      85 allocs/op
BenchmarkBinaryUnmarshal-12       	 1000000	      1028 ns/op	     752 B/op	      32 allocs/op
BenchmarkCanonical-12             	  835960	      1272 ns/op	     896 B/op	       1 allocs/op
BenchmarkCanonicalLarge-12        	   69620	     17903 ns/op	   10240 B/op	       1 allocs/op
BenchmarkGetIDBytes-12            	  704444	      1552 ns/op	     928 B/op	       2 allocs/op
BenchmarkRoundTripJSON-12         	  312724	      3673 ns/op	    1848 B/op	      25 allocs/op
BenchmarkRoundTripBinary-12       	  857373	      1325 ns/op	     765 B/op	      45 allocs/op
BenchmarkEstimateSize-12          	295157716	         4.012 ns/op	       0 B/op	       0 allocs/op

Date

Report generated: 2025-11-02