Cheat sheet

Instagram Influencer Graph — Cheat Sheet

Network analysis on a 70k-node Instagram shoe community. Reducing the graph, ranking influence with centrality + community detection, and simulating diffusion with SIR.

Read the full projectUpdated December 2023
1

The problem

Pick an influencer for a shoe brand. The naïve heuristic (most followers) ignores:

  • Followers can be bought.
  • A 1M-follower generic celeb may have lower engagement in your niche than a 50k niche expert.
  • Structural position (bridging communities) matters more than raw size.

We needed a measurement of influence by network structure, not by follower count.

2

The data

  • ~70,000 nodes scraped from the shoe-collector / sneakerhead community on Instagram.
  • Edges = following relationships.
  • Sparse + scale-free — most nodes have few connections, a few have thousands.
  • Standard graph problems: visualising 70k nodes is meaningless; computing centrality on the full graph is slow.
3

Graph reduction

You can't just sample randomly — it destroys structural properties (community structure flattens, hubs disappear).

Approach:

  1. K-core decomposition — peel off low-degree nodes until you keep only the "engaged core".
  2. Snowball sampling from a seed → keeps neighbourhood structure of the seed's community.
  3. Subgraph by tag / hashtag affinity — narrow to nodes posting about the relevant niche.

Result: a reduced graph that preserved community structure instead of looking like a random Erdős–Rényi.

4

Centrality measures

Each measures a different "influence":

MeasureWhat it capturesUse case
DegreeRaw connectionsFirst-pass popularity.
BetweennessSits on shortest pathsBridge between communities.
ClosenessAverage shortest path to all othersSpeed of broadcast.
EigenvectorConnected to other influential nodesQuality of network, not just quantity.
PageRankRandom-walk stationary distributionAuthority.

A good influencer scores high on betweenness + eigenvector, not just degree.

5

Community detection

We used Louvain to find communities — modularity-based, fast, multi-level.

Why it matters for influencer choice:

  • Some influencers sit inside one community (deep but narrow reach).
  • Some sit at the boundary between communities (broad reach, high diffusion).
  • Brands usually want the boundary type — content spreads further.

Visualised with Gephi + ForceAtlas2 layout. Coloured by community.

6

SIR diffusion simulation

To test "if X posts about our brand, who sees it?":

  • S — Susceptible (could be exposed).
  • I — Infected (has seen the post and shared).
  • R — Recovered (saw it, didn't re-share, moves on).

Parameters:

  • β (infection rate) — engagement-rate proxy.
  • γ (recovery rate) — speed of moving past the post.

Simulate from each candidate's seed node and count final R — total reach. Repeat across candidates, pick the highest expected reach.