| Measure | What it captures | Use case |
|---|---|---|
| Degree | Raw connections | First-pass popularity. |
| Betweenness | Sits on shortest paths | Bridge between communities. |
| Closeness | Average shortest path to all others | Speed of broadcast. |
| Eigenvector | Connected to other influential nodes | Quality of network, not just quantity. |
| PageRank | Random-walk stationary distribution | Authority. |
The problem
Pick an influencer for a shoe brand. The naïve heuristic (most followers) ignores:
- Followers can be bought.
- A 1M-follower generic celeb may have lower engagement in your niche than a 50k niche expert.
- Structural position (bridging communities) matters more than raw size.
We needed a measurement of influence by network structure, not by follower count.
The data
- ~70,000 nodes scraped from the shoe-collector / sneakerhead community on Instagram.
- Edges = following relationships.
- Sparse + scale-free — most nodes have few connections, a few have thousands.
- Standard graph problems: visualising 70k nodes is meaningless; computing centrality on the full graph is slow.
Graph reduction
You can't just sample randomly — it destroys structural properties (community structure flattens, hubs disappear).
Approach:
- K-core decomposition — peel off low-degree nodes until you keep only the "engaged core".
- Snowball sampling from a seed → keeps neighbourhood structure of the seed's community.
- Subgraph by tag / hashtag affinity — narrow to nodes posting about the relevant niche.
Result: a reduced graph that preserved community structure instead of looking like a random Erdős–Rényi.
Centrality measures
Each measures a different "influence":
A good influencer scores high on betweenness + eigenvector, not just degree.
Community detection
We used Louvain to find communities — modularity-based, fast, multi-level.
Why it matters for influencer choice:
- Some influencers sit inside one community (deep but narrow reach).
- Some sit at the boundary between communities (broad reach, high diffusion).
- Brands usually want the boundary type — content spreads further.
Visualised with Gephi + ForceAtlas2 layout. Coloured by community.
SIR diffusion simulation
To test "if X posts about our brand, who sees it?":
- S — Susceptible (could be exposed).
- I — Infected (has seen the post and shared).
- R — Recovered (saw it, didn't re-share, moves on).
Parameters:
β(infection rate) — engagement-rate proxy.γ(recovery rate) — speed of moving past the post.
Simulate from each candidate's seed node and count final R — total reach. Repeat across candidates, pick the highest expected reach.