A structured evidence appendix presenting the complete Develop and Deliver phases — from research and divergent approaches through critical assessment, action planning, simulated deployment, and cultural enablement. Assignment 7: Develop, Prototype, and Launch with AI.
A long-tenured subscriber to a premium DTC wine community needs the operational experience of receiving their order to be held to the same standard of care as the product itself — because the company's primary measure of customer health (a prompted, managed satisfaction metric) consistently reports 'excellent,' whilst the unsolicited, peer-to-peer community forum — where customers speak to each other, not to the brand — reveals a pattern of quiet, irreversible disengagement expressed not as complaint, but as conclusion: 'I am not angry. I am finished.'
This gap between the managed metric and the lived reality is not a data discrepancy. It is where the highest-value customers are disappearing.
User: A 5+ year subscriber who pre-funds independent makers and co-owns the brand story. Need: Needs the final physical delivery to be treated with the same care and intention as the product itself. Insight: Because when a cost-optimised carrier fails that moment, it doesn't produce a complaint — it produces a conclusion: 'I am not angry. I am finished.'
Figures derived from group-level HY26 interim results published December 2025. UK segment estimated at 43% of group revenue per FY24 segmental disclosure. All figures should be verified against full annual report segmental tables before board submission.
| Metric | Value | Source |
|---|---|---|
| Annual churn rate — highest LTV cohort | 24% | CRM analysis |
| Estimated UK high-LTV Angel cohort | ~24,000 | Segment estimate |
| Annualised revenue per Angel | ~£324 | HY26 derived |
| Replacement cost per churned Angel | ~£398 | CAC estimate |
| Acquisition payback window | 44 months | HY26 derived |
| Annual replacement burden (current) | ~£2.3m | Calculated |
| Revenue preserved per 5pp churn reduction | ~£478,000 | Calculated |
| Group NPS | 76 | HY26 reported |
Assignment 6 completed the first diamond and overshot into the second. The course leader's scope clarification corrected the scope. Assignment 7 restarts the Develop and Deliver phases through the taught frameworks, scoped to the Horizon 1 MVP prototype. The diagram below maps both diamonds, the overshoot zone, the scope correction, and the A7 execution path.
The overshoot documents demonstrate what happens when AI-assisted work proceeds without the taught frameworks as constraints. The feasibility study used a bespoke RAG assessment rather than the Desirability/Feasibility/Viability framework. The implementation plan scoped a 48-week full deployment rather than an MVP prototype. The methodology critique referenced these excluded outputs as validated inputs — a contamination chain. Each document is individually competent; collectively they represent a project that had moved past its assignment scope.
The decision to identify, assess, and deliberately exclude these documents rather than retrofit them to the correct frameworks is the learning. Retrofitting would have preserved the outputs but obscured the methodology. Exclusion preserves the methodology at the cost of the outputs — and for a Design Thinking assignment, the methodology is the point.
HITL gating: No prompt was executed without Tim's review and explicit confirmation. Each output was reviewed and challenged before it fed into the next. The LLM generates; the human decides.
Session isolation: Each prompt was run in a separate chat. Where a prompt depends on a prior output, the dependency is stated explicitly and the output is in the shared knowledge base — not carried through conversation history.
Context degradation awareness: Long chat threads degrade contextual fidelity. The project was deliberately structured across isolated sessions to counteract this. The knowledge base is the shared memory; conversation history is disposable.
Framework derivation: Structure derives from the course material frameworks. The Concept Development document defines what the system is; the frameworks define how to plan, assess, and deliver it. When these sources suggest different structures, the frameworks govern.
MVP scope discipline: 'Launch' means launch the MVP, gather feedback, determine next steps, return to the double diamond. Not full system deployment. This constraint was applied at every prompt.
The metric-truth gap as the core insight: The project's analytical foundation is that managed metrics (NPS) actively obscure the signal that matters. Unmanaged peer-to-peer community data is the only reliable source. Any output that relies on prompted metrics as a primary input has failed the HITL Critical Finding.
The complete sequence from Assignment 6 through the transition to Assignment 7. Stages across two assignments, connected by a deliberate knowledge base curation that determined exactly which prior outputs the second diamond could see. Each HITL gate marks a point where a human assessor reviewed AI output against source evidence before the next stage was authorised to proceed.
The flow below shows the A6 stages (slate), the overshoot branch (dashed red — produced and excluded), the Transfer Gate (amber — scope correction), the A7 prompts (blue), and the return loop (green). Arrows are colour-coded by section. The overshoot branch shows what was produced outside the correct frameworks; the Transfer Gate shows the curation that corrected scope before A7 began.
The same contamination-prevention principle operates at two scales across the project. In Assignment 6, it manufactured genuine divergence during ideation. In Assignment 7, it prevents context degradation across the execution sequence. The knowledge base — not conversation history — is the only mechanism through which context persists between sessions.
Three separate browser windows. Only the anonymised POV as input. No shared context, no financial data, no knowledge base. Convergence across isolated sessions is evidence, not artefact. Community Vocabulary Shift appeared independently in all three sessions — the single most structurally validated idea in the 161-idea corpus. If an idea surfaces in three sessions that cannot see each other, its emergence is structural rather than prompted.
21 documents assessed against a single question: does this document provide context that Assignment 7 needs, without prescribing outputs that the taught frameworks should determine? The Concept Development document transfers (defines what the system is). The feasibility study is excluded (prescribes how to assess viability using the wrong framework — bespoke RAG, not D/F/V). The implementation plan is excluded (48-week full deployment, not MVP). The methodology critique is excluded (references excluded outputs as validated — a contamination chain).
The Financial Context document is also excluded — but its content is embedded verbatim in the project instructions, making it available to every prompt without creating a separate document dependency.
Each prompt executed in its own chat session. Dependencies stated explicitly; prior outputs present in the shared knowledge base, not carried through conversation history. This prevents the known risk that long threads erode contextual fidelity, causing later outputs to drift from the original evidence base. The knowledge base contains only reviewed, finalised outputs and source material — not hedged responses, abandoned reasoning, or intermediate drafts.
21 documents reconciled. 12 transferred to the Assignment 7 project. 9 excluded. Every exclusion has a specific rationale grounded in preventing contaminated context from shaping the A7 outputs.
| Document | Format | Decision | Rationale |
|---|---|---|---|
| Transfer — 12 documents in the Assignment 7 knowledge base | |||
| Concept_Development | Google Doc | TRANSFER | The QSS concept: chosen direction entering A7 Develop phase. |
| Silent_Exit_Ideation 1 | Google Doc | TRANSFER | Session 1: 52 raw ideas. Session isolation rigour evidenced. |
| Silent_Exit_Ideation 2 | Google Doc | TRANSFER | Session 2: 53 raw ideas. |
| Silent_Exit_Ideation 3 | Google Doc | TRANSFER | Session 3: 56 raw ideas. Independent convergence on Community Vocabulary Shift. |
| silent_exit_ideation_consolidated | Google Doc | TRANSFER | 18 priority ideas across 6 clusters from 161 total. |
| NakedWines_UK_AI_Risk_Assessment | Google Doc | TRANSFER | AI Risk Register: 7 dimensions, UK regulatory lens. DPIA gate context. |
| Image (Part 1 — In-Depth Research) | PNG | TRANSFER | Assignment 7 brief screenshot — Part 1. |
| Image (Part 2 — Summary/Stretch Goal) | PNG | TRANSFER | Assignment 7 brief screenshot — Part 2 and Stretch Goal. |
| Develop: Selecting Solutions | TRANSFER | Course material — Weighted Scoring Matrix and selection framework. | |
| Develop: Preparation | TRANSFER | Course material — five key aspects of prototyping preparation. | |
| Develop: Continuous Assessment | TRANSFER | Course material — D/F/V sweet spot assessment. | |
| Building the Thing Right: Deliver | TRANSFER | Course material — Deliver phase frameworks (Finalise, Launch, Monitor). | |
| Exclude — 9 documents left in Assignment 6 project | |||
| Assignment6_Complete_Prompt_Sequence_FINAL | Google Doc | EXCLUDE | Contains every prompt that generated excluded outputs. POV, HITL Finding, Financial Context embedded in project instructions instead. |
| feasibility_study | Google Doc | EXCLUDE | Bespoke RAG framework, not D/F/V. Would prescribe the wrong assessment method for A7. |
| QSS_Implementation_Planning | Google Doc | EXCLUDE | 48-week full deployment scope, not MVP. Would anchor the action plan to the wrong scale. |
| methodology_critique | Google Doc | EXCLUDE | References excluded outputs as validated. Would contaminate A7 with superseded conclusions. |
| Assignment6_Ideation_Prompt_Summary | Google Doc | EXCLUDE | Redundant — content covered by the three ideation outputs and the consolidation document. |
| Assignment 6 | Google Doc | EXCLUDE | A6 brief. Backward-looking — A7 brief defines current requirements. |
| The_Quiet_Signal_System___Visual_Outputs.pdf | EXCLUDE | Visuals of the full four-layer system, not MVP scope. Would expand scope beyond what A7 tests. | |
| quiet_signal_visuals.html | HTML | EXCLUDE | Same content as the PDF visuals — duplicate exclusion. |
| Financial_Context — HY26 Public Data | Text file | EXCLUDE | Content embedded verbatim in project instructions — available to every prompt without a separate document. |
Each prompt maps to a named requirement in the assignment brief. Framework derivation — from the course material, not the Concept Development document's internal architecture — was a non-negotiable discipline throughout.
| # | Prompt Title | Brief Requirement | KB / Framework Inputs |
|---|---|---|---|
| 1 | Research & Three Approaches | Part 1: in-depth research, three approaches | Concept Dev, Risk Register, Financial Context |
| 2 | Critical Assessment & Selection | Part 1: assess approaches; Part 2: rationale | Course material (selection methods, D/F/V rubrics); Prompt 1 outputs |
| 3 | Action Plan (Four Phases) | Part 1: action plan, all four phases | Course material (Develop: Preparation; Deliver); Concept Dev, Risk Register |
| 4 | Three Simulated Variations | Part 1: simulate three variations | Prompt 3 output, Concept Dev, Risk Register |
| 5 | Cultural Enablement | Stretch Goal: items a, b, appendix for c | HITL Finding, Concept Dev, Deliver: change management framework |
HITL gating: No prompt was executed without Tim's review and explicit confirmation. Each output was reviewed and challenged before it fed into the next prompt.
Session isolation: Each prompt is self-contained. Where a prompt depends on a prior output, the dependency is stated explicitly and the output is in the shared knowledge base — not carried through conversation history.
Framework derivation: Structure derives from the course material frameworks, not the Concept Development document's internal architecture. When these sources suggest different structures, the course material governs.
MVP scope discipline: 'Launch' means launch the MVP prototype, gather feedback from relationship stewards, and determine what happens next — then return to the start of the double diamond for the next iteration. Not full system deployment.
Assignment 6 gates: POV validation before ideation begins; after the feasibility study before concept development proceeds; after concept development before success criteria are written; after the full critique output before submission. At each gate, a human assessor reviewed AI output against source evidence before the next stage was authorised.
Transition gate: The include/exclude decision on all 21 documents was a human assessment of what context the new project should inherit. The overshoot documents were identified, assessed, and deliberately excluded — not retrofitted to the correct frameworks. The exclusion is the learning.
Assignment 7 gates: Each of the five prompts was reviewed and challenged before it fed into the next. Tim's HITL role included: challenging whether scenario selection in Prompt 4 was the right one; confirming that Approach C was selected on genuine constraint analysis rather than convenience; verifying that the cultural shifts in Prompt 5 are specific to this organisation and this project, not generic change management. The outputs in this appendix reflect the finalised, reviewed versions — not the first draft in every case.
In-depth research into comparable real-world systems where organisations have tried to detect customer disengagement using community or unmanaged signals rather than prompted satisfaction metrics. Obstacle analysis specific to implementing this MVP at Naked Wines UK. Three genuinely different implementation approaches — not tonal variations but structurally distinct choices about how to build and deploy the MVP.
The specific proposal at the core of this MVP — using unmanaged peer community language as the primary signal source for detecting silent disengagement, in preference to prompted satisfaction metrics — sits at a genuine frontier. Most of the retention technology industry is not doing what this system describes. The following analysis maps what exists, what it tells us, and where the transferable lessons lie.
Quitlo, a churn intelligence platform, analysed over 50,000 AI exit conversations with churned SaaS subscribers and found a consistent pattern: companies with reportedly healthy NPS scores were losing customers at rates their survey data never predicted. Their analysis identified the structural reason: NPS captures sentiment from the roughly 4.5% who respond, while churn happens among the 95.5% who stay silent. The most common reason churning customers gave for not completing exit surveys was that they did not believe anyone would read it — a learned response to years of feedback going nowhere.
CustomerGauge, a B2B retention platform, makes a complementary observation: the majority of churn they have observed across enterprise accounts is preceded not by negative feedback but by an absence of signal. Their framing is direct — churn comes from a lack of feedback and data, not from the feedback itself. The absence of complaint is not evidence of satisfaction; it is the signal itself.
Relevance to this project: These findings directly validate the core insight of the Quiet Signal System. The metric-truth gap is not a quirk of Naked Wines' data. It is a structural feature of prompted satisfaction measurement. A group NPS of 76 coexisting with a 24% annual churn rate in the highest-LTV cohort is consistent with the pattern documented across the industry. The anomaly is not the churn. The anomaly is the NPS.
The current industry standard for churn prediction is the customer health score, operationalised by platforms such as Gainsight, ChurnZero, and Planhat. These systems aggregate product usage, support ticket history, NPS/CSAT responses, and engagement data into composite health scores, typically visualised as red/amber/green. Gainsight reports that 84% of companies with formal customer success programmes use health scoring as a cornerstone of their retention strategy.
These platforms are genuinely useful for B2B SaaS retention where the signals are digital and frequent — login frequency, feature adoption, support ticket volume. They are structurally weaker for consumer subscription models where the product is physical, the community is emotional, and the churn signal is expressed in peer-to-peer language rather than product telemetry.
ChurnZero's own analysis of enterprise account data reveals a specific failure mode relevant to this project: health scores perform well for customers in the early stages of a subscription (high sensitivity to early disengagement signals) but significantly underperform for tenured customers who are expected to renew. The long-tenure, high-NPS customer who churns is a known blind spot in health score architectures — exactly the profile of the Naked Wines Angel at risk of silent departure.
Relevance to this project: The health score ecosystem defines the existing state of the art. The Quiet Signal System is differentiated not by its goal (detecting churn) but by its data source (unmanaged community language) and its interface design (verbatim-first, not score-first). The existing platforms solve a different problem for a different customer profile.
Several academic and commercial research programmes have examined whether sentiment in brand-adjacent online communities predicts customer behaviour. The most relevant evidence base is from the open-source software and gaming communities, where product forums function similarly to the Naked Wines community — unmediated peer-to-peer conversation, not brand-managed channels.
Research from the Wharton School (Netzer et al., "Mine Your Own Business") found that analysing customer discussions in product forums predicted market share movements more accurately than surveys, with a lead time of several months. The mechanism was not sentiment polarity (positive/negative) but semantic content — what customers were discussing, how they were framing the product's role in their lives, and whether the framing was changing over time.
Supportbench, a customer success platform, identifies linguistic changes as a leading indicator of disengagement: the shift from collaborative language ('we should fix this') to transactional language ('your product does not do X') to observational language ('I noticed that other products handle this differently') tracks a consistent trajectory toward departure. The emotional register cools before the cancellation is submitted. This directly validates the Silence Classifier's design: not just what the customer says, but how they say it and whether the register is changing.
Relevance to this project: These findings establish that language trajectory — how customers talk about a product over time — is more predictive than point-in-time sentiment. The Community Vocabulary Shift signal is the operationalisation of this principle: longitudinal NLP analysis of individual posting histories, detecting the shift from inside-the-community language to outside-the-community language.
Academic research by Iñigo-Mora (2004) on pronoun use in group discourse establishes that first-person plural ('we', 'our', 'us') versus third-person references ('they', 'them', 'the company') are measurable indicators of group identity alignment. When group members refer to the group in the third person, they have psychologically departed even if their formal membership continues. This finding has been replicated across organisational, political, and consumer contexts.
The application to subscription churn has not been commercially operationalised — no known retention platform currently uses longitudinal pronoun trajectory as a primary churn signal. This represents both the theoretical validity and the practical novelty of the Community Vocabulary Shift component. The academic evidence is strong; the commercial application is untested at scale.
The strongest indirect validation for this project: the Community Vocabulary Shift idea emerged independently across all three isolated ideation sessions. If the idea surfaces in three sessions that cannot see each other, its emergence reflects something the evidence base supports — not a prompt artefact.
Peloton's retention model (documented in their investor materials following the 2022 subscription crisis) is the closest documented example of multi-dimensional community engagement analysis in a DTC subscription context. Their retention analysis found that churn was 60% lower among subscribers who engaged with two or more content disciplines per month — and critically, that engagement variety was more predictive than engagement frequency. A subscriber who used the platform every day for one type of content was more at risk than one who used it three times a week across multiple content types.
The relevant insight is not the specific finding but the methodology — they identified that multi-dimensional engagement patterns were more predictive than any single metric. The Quiet Signal System's three-signal corroboration approach (Forum Divergence Score, Silence Classifier, Community Vocabulary Shift) follows the same structural logic: no single signal is dispositive; convergence across signals is the reliable indicator.
The design principle of surfacing verbatim community language at the top of every subscriber brief — before any model score or risk category — has no direct precedent in the retention technology stack. Every platform in the market (Gainsight, ChurnZero, Braze Predictive Churn, the Pedowitz Group's sentiment-driven churn framework) presents a score first and evidence second.
The Quiet Signal System inverts this deliberately. The rationale is architectural: if the steward sees a number first, the number becomes the decision input and the verbatim language becomes supporting evidence. The system then behaves like a more sophisticated version of the managed metric it was designed to replace. If the steward reads the subscriber's actual words first, the emotional register of the community signal is preserved through to the human decision point. The controlled comparison built into the Approach C prototype — verbatim-first versus score-first presentation of the same briefs — is designed to test whether this architectural commitment produces the expected advantage in practice.
| Source | Transferable Lesson |
|---|---|
| Quitlo (50k exit interviews) | NPS captures sentiment from ~4.5% who respond; churn happens among the 95.5% who stay silent. The metric-truth gap is structural, not anomalous. |
| CustomerGauge | The majority of churn is preceded by an absence of signal, not negative feedback. Silence is the most dangerous indicator. |
| ChurnZero health scores | Health scores fail to predict churn among tenured customers expected to renew. The long-tenure, high-NPS churner is a known blind spot. |
| Wharton / Netzer et al. | Forum semantic analysis predicts behaviour months ahead of surveys. What customers discuss — and how framing changes over time — matters more than sentiment polarity. |
| Peloton retention model | Multi-dimensional engagement patterns are more predictive than any single metric. Corroboration across signals is the reliable architecture. |
| Supportbench silent churn | Linguistic changes (collaborative to transactional to observational language) are a leading indicator of disengagement, preceding behavioural signals. |
| Iñigo-Mora (pronoun research) | Pronoun choices ('we' vs 'they') measurably reflect group identity alignment. The shift is real but has not been commercially operationalised as a churn signal. |
The following obstacles are organised by category but deliberately not ranked. Materiality assessment is a decision for the project owner, informed by which implementation approach is selected.
DPIA sequencing gate. The mandatory DPIA under UK GDPR Article 35 is a hard sequencing constraint on the critical path. The specific obstacle is not completing the DPIA itself — that is a procedural step — but the findings it may produce. If the DPIA concludes that community forum posts constitute special category data when processed through AI behavioural profiling, the lawful basis analysis becomes significantly more complex. The ICO's guidance on AI and data protection is clear that automated profiling with significant effects requires either explicit consent or a documented Legitimate Interests Assessment. Consent is unsuitable given the commercial power imbalance in a subscription context (GDPR Recital 43). The Legitimate Interests Assessment must demonstrate that the processing is necessary, proportionate, and that the data subject's interests do not override the business interest.
Community forum terms of service. Naked Wines' community forum presumably has existing terms governing how user-contributed content may be used. If those terms do not explicitly contemplate AI-driven behavioural analysis of individual posting patterns, there is a gap between what the subscriber consented to when posting and what the system does with their words. This requires legal review and potentially updated terms with grandfathering provisions for existing content. The legitimacy question — can the organisation honestly say to an Angel 'we read what you wrote and used it to understand how you were feeling about us?' — is distinct from the legal question, and both need answering.
ICO exposure. The AI Risk Register identifies the ICO maximum fine exposure at group level as approximately £8m at current revenue. Angel retention damage from a regulatory investigation would compound this materially. The DPIA gate is not a bureaucratic formality — it is the point at which the project either earns regulatory clearance to proceed with live data or determines that it cannot.
System integration. The MVP requires access to individual subscriber forum posting histories linked to subscriber accounts, correlated with CRM data (tenure, referral history, pre-funded balance, email engagement). Forum data and CRM data are likely in different systems with different schemas. The technical effort to join them into a single subscriber record is non-trivial, and the data engineering team's capacity and willingness to prioritise this for an experimental MVP is uncertain. This is the single largest technical dependency for Approaches A and B; Approach C bypasses it entirely.
Coverage gap. Not all Angels post on the forum. If only a fraction of the 24,000 target cohort are active forum contributors, the system's coverage is immediately limited. The Forum Divergence Score and Community Vocabulary Shift signals only work for subscribers who generate community language. For the silent majority who never posted, only the behavioural telemetry signals (referral velocity reversal, email open-time decay) are available. The MVP must account for this coverage gap honestly — a system that covers 30% of the target cohort is a different proposition from one that covers 80%.
Data quality and longitudinal depth. The Community Vocabulary Shift detection requires longitudinal analysis of individual posting histories — not a single-point classification but a trajectory over months or years. This requires a well-structured historical dataset and a clear definition of what constitutes a meaningful shift versus normal linguistic variation. If forum posts were not consistently linked to individual subscriber records over the full tenure window, the longitudinal signal cannot be computed.
The NPS challenge. The system's core premise is that the company's primary measure of customer health (NPS 76) is structurally misleading. This is not a neutral analytical observation — it is a direct challenge to the metric that the leadership team reports to the board and the market. The HY26 interim results cite NPS 76 as evidence of customer health. A system that explicitly argues this metric is obscuring the departure of the highest-value customers will encounter resistance from anyone whose reporting, compensation, or credibility is anchored to NPS. This is the single most likely reason the project stalls at the organisational level.
The relationship steward role. The concept requires a 'relationship steward' with the authority to make relational intervention decisions. Naked Wines' current customer service structure presumably does not include this role in the form the system requires. Creating it — even for the MVP — requires someone with budget authority to allocate staff time. If the MVP is positioned as a technology experiment, it may get engineering resource but not operational resource. It needs both. The MVP cannot be tested without a human steward to review the briefs.
Operational culture and decision-making norms. The system asks stewards to make qualitative, interpretive decisions based on verbatim language rather than threshold-triggered scripts. Organisations whose customer service functions are built on scripted, metric-driven processes may find this genuinely uncomfortable — not because of resistance to the concept, but because the capability infrastructure for qualitative decision-making does not exist.
NLP classification difficulty. The classification task at the core of the MVP — distinguishing concluded departure from active complaint, and satisfied silence from concluded silence — is genuinely difficult. Standard pre-trained sentiment models will not do this out of the box. The system needs to recognise the specific linguistic signatures of conclusion: the absence of emotional language, the shift in pronoun framing, the decline in posting frequency interpreted as signal rather than noise. The MVP can use an LLM as the classifier rather than training a custom model, which reduces the technical barrier significantly, but the prompt engineering and validation work is still substantial.
Longitudinal analysis requirement. The vocabulary shift detection requires longitudinal analysis of individual posting histories — not a single-point classification but a trajectory over months or years. This is computationally straightforward but requires a well-structured historical dataset and a clear definition of what constitutes a meaningful shift versus normal linguistic variation. Context matters enormously: a subscriber who posts less in December may be on holiday rather than disengaging.
False positive rate and steward fatigue. If the system flags too many subscribers who are not at risk, the steward's attention will be diluted and the intervention quality will decline. If it flags too few, it misses the subscribers it was designed to catch. Calibrating the false positive rate requires real data, which Approach C — operating on synthetic signals — cannot provide. This means the threshold calibration problem is deferred to the next iteration.
Steward skill requirement. The steward's value depends on her ability to read verbatim community language and make a qualitative judgement about the subscriber's relational state. This is a skill that most customer service environments do not systematically develop. The MVP cannot be used to train this skill before the test — training before evaluation would contaminate the results. But deploying an undertrained steward risks invalidating the prototype results for a different reason. The orientation protocol in the action plan (Phase 3.3) navigates this tension by providing minimal orientation without interpretive instruction.
The experience of being watched. The system monitors language subscribers wrote in the context of speaking to each other, not to the brand. There is a version of this system that subscribers would experience as surveillance if they learned about it. The legitimacy of the system depends on whether Naked Wines can honestly say to an Angel: 'We read what you wrote in the community forum and used it to understand how you were feeling about us.' If that sentence sounds creepy rather than caring, the system has a legitimacy problem that no DPIA can resolve. The surveillance concern is raised as an explicit data capture point in the MVP review sessions (Phase 4.2 of the action plan).
Three structurally different approaches — not cautious/moderate/ambitious versions of the same plan. Each makes meaningfully different choices about data sources, technology, regulatory sequencing, and what assumption is tested first.
Steward test: Review ten retrospective subscriber briefs for Angels who are already gone. Assess whether the signal was real and whether the brief format supported confident decision-making.
DPIA position: Operates on historical, anonymised data for internal analytical purposes. Substantially more defensible than live profiling — legal counsel's position should be confirmed, but retrospective analysis of anonymised historical data for internal model validation is a different regulatory proposition from live behavioural profiling of current subscribers.
What it prioritises: Evidence before infrastructure. This approach answers the foundational question — does the signal exist in the data? — before any new system is built. It also sidesteps the DPIA sequencing gate for the MVP phase, because it operates on historical data for internal analysis rather than live profiling of current subscribers.
What it trades off: It does not test the live operational loop — the steward receiving a real-time flag and deciding what to do. It validates the signal but not the workflow. There is also a risk of hindsight bias: knowing that the subscriber churned may make the steward see signals that she would not have noticed prospectively.
Assumptions it relies on: That the historical forum and CRM data is accessible and linkable at the individual subscriber level. That churned Angels' forum posting histories are still available in the system. That the legal position on retrospective analysis of historical data is distinct from live profiling.
Steward test: Review live briefs and record decisions (Intervene, Monitor, Escalate) over a 90-day observation window. Track the system's predictions against actual subscriber behaviour.
DPIA position: Requires the DPIA to be completed before deployment. Live profiling of current subscribers, even without intervention, triggers Article 35.
What it prioritises: Predictive validation. This is the only approach that generates genuine evidence of whether the system's signals predict future behaviour, not just whether they describe past behaviour. It also tests the full HITL interface — the steward experiences the dashboard as she would in production, makes real decisions under real conditions, and can provide informed feedback on whether the verbatim signal was genuinely more useful than the model score.
What it trades off: It requires the DPIA to be completed before deployment, which puts the DPIA on the critical path and may add 8–16 weeks depending on DPO capacity and ICO consultation requirements. It requires the full data integration pipeline (forum data linked to CRM records) to be built — the largest technical dependency. It requires a steward to commit real time over 90 days to reviewing briefs that produce no action — a harder organisational sell than a one-off retrospective exercise.
Assumptions it relies on: That the DPIA can be completed within the project timeline. That the data engineering team can deliver the forum-CRM integration. That a suitably senior steward can be allocated for 90 days of shadow operation. That 90 days provides enough observation time to see whether flagged subscribers subsequently churn.
Steward test: Present briefs to stewards who do not know which represent concluded departure and which represent active complaint or satisfied silence. Assess decision quality and interface effectiveness.
DPIA position: No DPIA dependency. No internal subscriber data is processed. Operates entirely on public proxy data and synthetic profiles.
What it prioritises: Speed and HITL design validation. This approach can be built and tested in 2–3 weeks with no data integration, no DPIA dependency, and no access to internal systems. It tests the single thing the concept document identifies as the MVP's success criterion: can the steward, on reviewing a flagged subscriber's brief, make a confident decision and confirm that the verbatim signal was the most useful element? It also enables a direct controlled comparison: the same briefs presented in verbatim-first and score-first formats to test whether the information hierarchy changes the steward's decision-making.
What it trades off: It does not test whether the AI can actually detect the signals in Naked Wines' real data. It validates the interface, the HITL workflow, and the steward's decision-making — but uses hand-crafted signals rather than algorithmically generated ones. It also relies on proxy data that may not fully represent the linguistic register of Naked Wines' own community.
Assumptions it relies on: That public proxy data is a reasonable stand-in for community language during the prototype phase. That stewards can be recruited and briefed without requiring formal organisational approval. That the value of testing the HITL interface independently of the signal pipeline is understood and accepted.
The choice between the three approaches is a strategic decision about what the MVP is for — which risk the project owner judges most material to test first.
| If the biggest risk is… | Choose… | Because… |
|---|---|---|
| The signal does not exist in the data at all | Approach A: Retrospective Validation | It answers the foundational question first with the lowest overhead and fewest dependencies. |
| The signal exists but does not predict future behaviour | Approach B: Live Shadow System | It is the most rigorous validation, generating genuine predictive evidence over a 90-day window. |
| Even a perfect signal will be wasted by a poor interface or unsuitable HITL workflow | Approach C: Simulated Signal Test | It tests the human decision layer independently and can be executed immediately. |
None of the three approaches is wrong. They test different parts of the same system. They could also be sequenced: C first to validate the interface, then A to validate the signal, then B to validate the prediction — which is in effect a three-phase MVP development path. That sequencing decision sits with the project owner and informs the selection rationale in the Critical Assessment tab.
Weighted Scoring Matrix and Sweet Spot Analysis applied to the three implementation approaches from Prompt 1. HITL convergent step — Tim reviewed, challenged, and selected. The selection reflects an honest assessment of the constraints under which this project actually operates, not an optimistic reading of what might be achievable.
Desirability (weight: 3) — Does the approach produce something that the intended user (the relationship steward) would genuinely want to use, find credible, and act on? Lower weight than Feasibility because a desirable approach that cannot be built is not an MVP.
Feasibility (weight: 5) — Can the approach be executed within the constraints that actually exist: no internal data access, no DPIA completed, project lead as sole resource? Joint highest weight. An approach that cannot be executed produces no evidence, regardless of its theoretical merit.
Viability (weight: 4) — Does the approach produce evidence that justifies further investment? A result that is internally interesting but insufficient to influence a senior stakeholder's decision is not viable in the context of this project.
HITL Integrity (weight: 5) — Does the approach preserve the HITL architecture as specified: verbatim community language surfaced before any model score, human steward as the decision-maker, no AI output reaching the subscriber without review? Joint highest weight because the HITL design is a non-negotiable constraint, not a preference.
Approach A: Retrospective Validation — Score: 3. The steward reviews historical cases with known outcomes. Her decisions are retrospective rather than prospective — she knows, or can infer, that the subscriber in question has already churned. This makes the experience partially artificial. The HITL interface can be tested but the steward's confidence in a brief where the outcome is already determined is not the same as her confidence in a live decision. Scored at 3 rather than lower because the verbatim-first format can still be evaluated for its communicative effectiveness.
Approach B: Live Shadow System — Score: 5. The steward reviews live subscriber briefs with no knowledge of how the subscriber subsequently behaves. Her decisions are genuinely prospective, uncontaminated by hindsight. This is the highest-fidelity test of whether the system produces something the steward genuinely finds actionable. If she can make confident decisions on live briefs and later evidence shows those decisions correlated with actual subscriber behaviour, the Desirability case is conclusive.
Approach C: Simulated Signal Test — Score: 3. The steward reviews synthetic briefs constructed from public proxy data mapped onto fictional subscriber profiles. The interface can be tested and the HITL workflow validated. But the steward knows — or can reasonably infer — that the data is simulated. The emotional register of proxy language (Reddit, Trustpilot, app store reviews) may not match the specific tone of Naked Wines' community. The steward is evaluating a demonstration, not using a tool. Sufficient to test the interface design but not sufficient to test whether the steward would trust and act on the system in practice.
Approach A: Retrospective Validation — Score: 4. Operates on historical data for internal analytical purposes. The DPIA position is substantially more defensible than live profiling — retrospective analysis of anonymised historical data for internal model validation is a different regulatory proposition from live behavioural profiling of current subscribers. Requires access to historical forum data linked to CRM records at the individual subscriber level — an organisational dependency but not a regulatory one. Does not require real-time data infrastructure. Scored at 4 rather than 5 because it still requires internal data access, which depends on organisational approval that has not yet been secured.
Approach B: Live Shadow System — Score: 2. Requires the DPIA to be completed before deployment — live profiling of current subscribers, even without intervention, triggers UK GDPR Article 35. The DPIA is on the critical path and may add 8–16 weeks depending on DPO capacity and ICO consultation requirements. Requires the full forum-to-CRM data integration pipeline to be built — the single largest technical dependency in the entire project. Requires a suitably senior steward to commit real time over 90 days to reviewing briefs that produce no action. Scored at 2 because two of these three dependencies (DPIA completion and data pipeline construction) are substantial and neither is within the project's direct control.
Approach C: Simulated Signal Test — Score: 5. No DPIA dependency. No internal subscriber data is processed. No data integration pipeline is required. Operates entirely on public proxy data and synthetic profiles that the project team constructs. Can be built and tested in 2–3 weeks. The only organisational dependency is recruiting stewards for the review exercise, which can be done informally. This is the most executable approach by a significant margin.
Approach A: Retrospective Validation — Score: 4. Produces a specific, evidenced answer to the foundational question: does the signal exist in the historical data? If the retrospective analysis shows that the Forum Divergence Score, Silence Classifier, and Community Vocabulary Shift indicators were consistently elevated in the 12–18 months before high-LTV Angels churned, this is direct evidence that the signal is real and detectable. This is the evidence a senior leadership team would need to justify further investment. Scored at 4 rather than 5 because retrospective evidence is inherently weaker than prospective evidence — it demonstrates correlation in historical data, not prediction of future behaviour.
Approach B: Live Shadow System — Score: 4. Produces the strongest possible evidence: prospective predictions tested against actual subscriber behaviour over a 90-day observation window. If the system flags subscribers who subsequently churn, and does not flag subscribers who remain, the predictive case is made. This is the evidence that would convert the business case from 'plausible' to 'proven.' Scored at 4 equal to Approach A because, while the evidence it produces is stronger, the question of whether the project reaches the point of producing that evidence is a Feasibility problem — and evidence that is never generated has zero viability regardless of its theoretical strength.
Approach C: Simulated Signal Test — Score: 2. Validates the dashboard interface and the steward's decision-making workflow. Does not validate whether the AI can actually detect the signals in Naked Wines' real data. A board-level audience asking 'does this work?' would receive the answer: 'the interface works and the stewards find it usable — but we have not yet tested whether the underlying signal detection is accurate.' Useful evidence for the next iteration but insufficient evidence to justify significant further investment on its own.
Approach A: Retrospective Validation — Score: 4. The HITL architecture is fully testable: the steward reviews subscriber briefs with verbatim community language presented first, before any model score. The verbatim-first principle is preserved. However, the hindsight bias risk introduces a qualification: the steward who knows the subscriber has churned will read the verbatim language differently from one making a genuinely prospective decision. The HITL design is intact; the steward's cognitive state is not fully equivalent to production conditions.
Approach B: Live Shadow System — Score: 5. The HITL architecture is tested in the highest-fidelity conditions. The steward reads verbatim language for a subscriber whose status is genuinely unknown. She makes a decision — Intervene, Monitor, Escalate — without knowing whether it will prove correct. This is the authentic test of whether the verbatim-first design enables better decisions than a score-first format under real operational conditions. No qualification applies.
Approach C: Simulated Signal Test — Score: 4. The verbatim-first design can be tested in the prototype and the controlled comparison between verbatim-first and score-first formats can be run. The HITL workflow is validated. The qualification is that the verbatim language is synthetic, not algorithmically detected from real Naked Wines community data — so the test of whether the steward would trust the system in practice is limited by the artificiality of the data source.
| Criterion | Weight | A: Raw | A: Wtd | B: Raw | B: Wtd | C: Raw | C: Wtd |
|---|---|---|---|---|---|---|---|
| Desirability | 3 | 3 | 9 | 5 | 15 | 3 | 9 |
| Feasibility | 5 | 4 | 20 | 2 | 10 | 5 | 25 |
| Viability | 4 | 4 | 16 | 4 | 16 | 2 | 8 |
| HITL Integrity | 5 | 4 | 20 | 5 | 25 | 4 | 20 |
| TOTAL (Weighted) | 65 | 66 | 62 |
The chart makes two things immediately visible: B is tall on Desirability and HITL but collapsed on Feasibility — the most rigorous approach that cannot be built within current constraints. C is tall on Feasibility but collapsed on Viability — the most executable approach that does not answer the foundational question. A occupies the middle ground on most criteria.
The sweet spot of innovation sits at the intersection of Desirability (Heart), Feasibility (Hands), and Viability (Head). A solution failing on any one dimension produces a specific failure mode. The following analysis maps each approach to this framework and positions them relative to the failure zones.
A sits closest to the sweet spot but carries residual risk of producing a 'dream' — retrospective evidence that the signal exists but insufficient proof that the steward can act on it in real time. B sits in the Desirability circle but outside the Feasibility circle — the most rigorous test that the project cannot currently execute. C sits firmly in the Feasibility circle but outside Viability — the most executable approach that does not answer the foundational question.
| Approach | Strongest Dimension | Weakest Dimension | Sweet Spot Risk |
|---|---|---|---|
| A: Retrospective Validation | Feasibility–Viability axis | Desirability (hindsight bias) | Dream: retrospective evidence compelling but prospective decision-making untested |
| B: Live Shadow System | Desirability + HITL Integrity | Feasibility (DPIA, pipeline, 90-day commitment) | Dream: ideal test that cannot be executed within current constraints |
| C: Simulated Signal Test | Feasibility (5) | Viability (2) | Non-adoption: proves the interface works but not that the underlying signal is real |
Approach A is the only approach that directly answers the foundational question: does the signal exist in the data? If it does not — if the Forum Divergence Score, Silence Classifier, and Community Vocabulary Shift indicators show no consistent pattern before historical churns — then the entire Quiet Signal System concept requires fundamental revision. Approach B would eventually answer this question too, but only after significantly greater investment. Approach C does not answer it at all.
The trade-off is: Approach A answers the most important question first, but under imperfect conditions (retrospective, with hindsight bias). The alternative is to answer a less important question under better conditions (Approach C), or to answer all questions under ideal conditions but at a cost that may prevent the test from happening (Approach B).
Approach B is the only approach that requires the DPIA to be completed before deployment. This is not a bureaucratic hurdle — it is a substantive gate. Live profiling of current subscribers, even in shadow mode without intervention, triggers UK GDPR Article 35 automated decision-making requirements. The DPIA timeline is not within the project's direct control and depends on DPO capacity and potentially ICO consultation. Approach A occupies an intermediate position: it uses internal historical data but for retrospective analysis rather than live profiling, which is a materially different regulatory proposition. Approach C sidesteps the question entirely.
The trade-off is: the most rigorous approach (B) carries the highest regulatory risk and the longest critical path; the most executable approach (C) avoids regulatory engagement entirely but also avoids the data that matters.
The three approaches produce evidence of fundamentally different kinds. Approach A produces correlational evidence: the signal was present before historical churns. Approach B produces predictive evidence: the signal identified subscribers who subsequently churned. Approach C produces usability evidence: the interface enables effective steward decision-making. Each type of evidence serves a different audience and answers a different question. Correlational evidence tells the data science team the signal is real. Predictive evidence tells the board the system works. Usability evidence tells the operations team the dashboard is fit for purpose. None is wrong; they are different. The trade-off is which evidence gap the project owner judges most dangerous to carry into the next iteration.
The three approaches could be sequenced: C first to validate the interface, then A to validate the signal, then B to validate the prediction — a legitimate three-phase MVP development path. But it is important to recognise what that sequencing implies: if C is chosen first, the project spends its first iteration learning whether the dashboard is usable without learning whether the signal is real. If the signal turns out not to exist in the data (tested in the second iteration via A), the interface validated in the first iteration was validated against synthetic signals that the real system cannot reproduce. This is a real cost, acknowledged and accepted.
The weighted scoring matrix does not produce a clear winner — the three approaches score within four points of each other (62–66) and each is strong on different dimensions. The selection is therefore determined not by the matrix alone but by an honest assessment of the constraints under which this project actually operates.
The decisive constraint is data access. This project does not have access to Naked Wines' internal data — neither historical CRM records, nor forum archives linked to individual subscriber profiles, nor live behavioural data. This is a hard boundary, not an unsecured dependency. Approach A scores a Feasibility of 4 in the matrix, but that score assumes internal data access is an organisational dependency that could be obtained. An independent critical assessment of the scoring confirms this is overstated: without internal data, a retrospective analysis would operate on fabricated historical data, which is functionally identical to Approach C with a retrospective framing bolted on. The distinction between A and C collapses when neither has access to the real data that gives A its analytical advantage. Approach B requires not only data access but a completed DPIA and a 90-day steward commitment — dependencies that are further from resolution than A's.
Approach C is the only approach that can be fully executed and demonstrated within the constraints that actually exist. It operates entirely on public proxy data and synthetic subscriber profiles. It requires no DPIA, no data integration pipeline, and no internal organisational approval. It can be built, tested with stewards, and iterated within the assignment timeline. This is not a compromise selection — it is the selection that takes the project's own Feasibility weighting (5, the joint highest) seriously.
The Viability weakness is real and is accepted, not explained away. Approach C does not answer the foundational question: is the signal real in Naked Wines' data? It validates the dashboard interface, the steward's decision-making workflow, the verbatim-first information hierarchy, and the HITL architecture — but does so against synthetic signals. A leadership team reviewing the results would know the tool is usable but would not know whether the underlying signal detection works. This is the correct limitation to carry into the next iteration. What happens next, if Approach C validates the interface and workflow, is Approach A — testing the signal against real data, with the interface design already grounded in steward feedback.
The HITL controlled comparison — presenting the same briefs in verbatim-first and score-first formats — is a specific methodological advantage of Approach C that neither A nor B offers. This directly tests the project's core architectural commitment: that surfacing verbatim community language before model scores produces better steward decisions. If this comparison shows no difference, the information hierarchy requires revision regardless of whether the signal is real. This is valuable evidence in its own right. A completed Approach C with a known Viability gap is more useful than an incomplete Approach A with no findings.
Approach C: Simulated Signal Test. Four phases structured from course material frameworks: Develop: Preparation (five key aspects) and Deliver: Finalise and Prepare for Deployment, Implement and Launch, and Monitor and Iterate Post-Launch. Each activity identifies what it is, who owns it, what it produces, and its dependencies.
Lay the groundwork to transform the selected concept into something testable, enabling rapid learning through steward feedback. The mindset is 'build to learn,' not 'build to launch.'
A focused effort to outline the immediate next steps required to build a basic, testable version of the Relational Health Dashboard. This is not a full project plan. It answers three questions: What are the absolute first actions? Who is responsible? What is the quickest way to create something stewards can react to?
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Define the ten subscriber brief archetypes covering the signal spectrum: concluded departure, active complaint, satisfied engagement, and ambiguous/borderline cases | Project lead (Tim) | Brief specification document listing the ten profiles with target language patterns and CRM characteristics for each | POV statement and HITL Critical Finding from Assignment 6 |
| Confirm the steward review protocol: blind review (steward does not know which briefs are concluded vs active), verbatim-first format vs score-first comparison, and structured decision capture form | Project lead | Review protocol document specifying session format, question sequence, and data capture method | Brief specification (above) |
| Establish the build timeline: two-week sprint from brief specification sign-off to steward review sessions | Project lead | Sprint plan with milestones: Week 1 (data construction and prototype build), Week 2 (steward recruitment, briefing, and review sessions) | Resource availability confirmed |
Pinpointing the essential people, tools, and materials required for the initial prototype. The guiding principle is to use what is readily available and cost-effective.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| People and skills: identify and recruit two to three relationship stewards (or equivalent senior customer-facing staff) willing to participate in a one-hour blind review session each | Project lead, with informal support from retention team lead | Confirmed steward participants with scheduled session times | Organisational willingness to release staff time for an experimental exercise; no formal approval gate required (no real subscriber data involved) |
| Tools and software: confirm the prototyping tool (interactive HTML, built via LLM-assisted development) and the feedback capture method (structured form or spreadsheet) | Project lead | Tool selection confirmed; template feedback form drafted | None (low-fidelity tools, no procurement) |
| Materials and data: assemble public proxy community language from Reddit, Trustpilot, and app store reviews of DTC subscription services; map to the language patterns identified in the project (concluded departure, we-to-they shift, satisfied silence) | Project lead with LLM support | Curated proxy language corpus organised by signal type, ready for brief construction | Public proxy sources accessible; language patterns defined in Assignment 6 ideation outputs |
| Budget: confirm zero direct cost for this prototype phase — all tools are available, all data is public, steward time is the only resource cost | Project lead | Budget note confirming nil incremental cost | None |
A clear decision on the scope and fidelity of the first prototype. The principle is Minimum Testable Prototype — the smallest thing that can be built to learn whether the verbatim-first HITL design produces better steward decisions than the current score-first paradigm.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| 'What' to prototype: the Relational Health Dashboard as the steward experiences it — subscriber queue, verbatim community language displayed first, trajectory data (posting frequency, referral velocity, email engagement), confidence intervals (not binary flags), and steward decision options (Intervene, Monitor, Escalate) | Project lead | Prototype functional specification: what the steward sees, the information hierarchy, and the decision workflow | Brief specification and review protocol from 1.1 |
| 'How' to prototype: interactive HTML built with simulated data. High enough fidelity that the steward can experience the workflow realistically, but not a production system. Ten synthetic subscriber briefs with fabricated forum posts, dummy behavioural telemetry, and simulated NPS scores illustrating the metric-truth gap | Project lead with LLM-assisted development | Working interactive HTML prototype with ten populated subscriber briefs | Proxy language corpus from 1.2; brief archetypes from 1.1 |
| Comparison condition: build a second view of the same briefs in score-first format (model score displayed prominently, verbatim language secondary) to enable a direct controlled comparison of the information hierarchy's effect on steward decision-making | Project lead | Score-first variant of the prototype interface for the same ten briefs | Verbatim-first prototype completed first |
The primary purpose of this prototype is not to build a perfect product but to enable rapid learning through steward testing. The mindset is 'build to learn,' not 'build to launch.'
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Define the critical questions this prototype must answer: (1) Does the verbatim-first information hierarchy change the steward's decision compared to score-first? (2) Can the steward make a confident decision on a flagged subscriber brief? (3) Does the steward confirm that verbatim community language was the most useful element? | Project lead | Documented learning questions linked to the MVP completion criterion | MVP completion criterion from Concept Development; review protocol from 1.1 |
| Design the feedback collection instrument: a structured form capturing decision made (Intervene/Monitor/Escalate), confidence level (1–5), most useful element (verbatim language, trajectory data, model score, other), and free-text observations | Project lead | Steward feedback form ready for use in review sessions | Learning questions defined (above) |
| Establish the success and failure criteria for this iteration: Success = stewards make confident decisions and identify verbatim signal as most useful in at least 7 of 10 briefs. Failure = stewards cannot distinguish signal types, or find the model score more useful than verbatim language. Both outcomes are informative | Project lead | Documented success/failure criteria with explicit statement that failure is a valid learning outcome | Learning questions and MVP completion criterion |
This phase is foundational for the iterative nature of Design Thinking. It prepares the project for a continuous loop of refinement, recognising that this first prototype is rarely the last.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Define the iteration pathway: this MVP (Approach C) tests the interface and steward workflow. The next iteration (Approach A: Retrospective Validation) tests whether the signal exists in real historical data. The third iteration (Approach B: Live Shadow System) tests predictive accuracy. Each iteration returns to the start of the double diamond | Project lead | Documented iteration sequence showing what each cycle tests and how it feeds the next | Critical Assessment and Approach Selection document (sequencing argument) |
| Identify the pivot criteria: if the steward review reveals that the interface design is fundamentally wrong (e.g. verbatim language is overwhelming rather than useful, or the decision options do not match how stewards actually think about subscriber relationships), the next iteration redesigns the interface before proceeding to signal validation | Project lead | Pivot criteria documented as part of the learning framework | Success/failure criteria from 1.4 |
Ensure the prototype is polished, documented, compliant, and ready for the steward review sessions. The goal is a prototype that functions reliably and can be evaluated honestly.
The final stage of perfecting the prototype before it reaches the stewards. This involves addressing any interface issues and ensuring consistent design quality throughout.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Review all ten subscriber briefs for consistency: realistic CRM data ranges (tenure 5–12 years, pre-funded balances, referral histories), plausible forum language that does not read as obviously fabricated, and coherent behavioural telemetry trajectories | Project lead | Reviewed and corrected brief dataset; change log documenting any revisions | Prototype build complete from Phase 1 |
| Test the steward workflow end-to-end: navigate the subscriber queue, open each brief, read verbatim language, review trajectory data, make a decision, complete the feedback form. Identify and fix any interface friction | Project lead (self-test), plus one informal tester if available | Bug list resolved; confirmed smooth workflow from queue to decision to feedback capture | Working prototype and feedback form |
| Verify the comparison condition: confirm that the score-first variant presents identical underlying data with only the information hierarchy changed, so any difference in steward decisions can be attributed to the display order rather than data differences | Project lead | Verified parity between verbatim-first and score-first prototype variants | Both prototype variants built |
Creating clear, accessible materials for the stewards who will interact with the prototype and for the project record.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Steward briefing document: a one-page guide explaining what the steward will see, what they are being asked to do, and what they should not assume (i.e. that this is a finished product or that the data is real). Must not reveal which briefs are concluded departure vs active complaint | Project lead | One-page steward briefing document, reviewed for clarity and absence of bias cues | Prototype finalised from 2.1 |
| Review session facilitation guide: step-by-step instructions for running the session, including the order of brief presentation, when to switch from verbatim-first to score-first format, and how to capture feedback without leading the steward | Project lead | Session facilitation guide | Review protocol from Phase 1; steward briefing document |
| Technical documentation: how the prototype was built, what data sources were used, how the simulated signals were constructed, and what the known limitations are. Required for the assignment appendix and for any future iteration team | Project lead | Technical specification document covering build method, data provenance, and known limitations | Prototype build complete |
A critical check to ensure the prototype adheres to all relevant laws, regulations, and ethical guidelines. Approach C sidesteps the DPIA sequencing gate because no internal subscriber data is processed, but this position must be formally confirmed.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Confirm the regulatory position: Approach C uses only public proxy data and synthetic profiles. No real Naked Wines subscriber data is processed. Confirm with legal counsel that this characterisation is accurate and that the steward review sessions — using fictional briefs with recruited participants — do not trigger any data protection obligations | Project lead (confirm with DPO if available) | Written confirmation of regulatory position for the record | Prototype specification finalised |
| Initiate the DPIA for the next iteration: even though the DPIA is not required for Approach C, the DPIA process for Approach A (Retrospective Validation) should be initiated now. The DPIA may take 8–16 weeks. Starting it during Phase 2 of Approach C means it will be complete (or close to complete) when Approach C concludes and Approach A is ready to begin | Project lead, with DPO as gating authority | DPIA initiation document for Approach A (retrospective analysis of historical subscriber data); timeline established | DPO availability and organisational approval |
| Ethical review of the steward briefing: ensure the briefing document does not prime the steward to prefer verbatim language. The prototype must capture the steward's genuine response, not a response shaped by the briefing | Project lead with independent reader review | Briefing document confirmed as bias-free | Briefing document from 2.2 |
Ensuring the prototype is technically stable and accessible for the steward review sessions.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Deploy the prototype to a stable, accessible URL: a static hosting service (e.g. Cloudflare Pages, GitHub Pages) that the steward can access from any device without installation or login | Project lead | Live prototype URL accessible on desktop and mobile | Prototype build and polishing complete |
| Test prototype performance and cross-browser compatibility: confirm the interface renders correctly on Chrome, Safari, and Edge. Confirm the feedback form submits correctly | Project lead | Cross-browser test log; confirmed form submission | Deployment complete |
| Prepare a fallback: a PDF or printed version of the subscriber briefs in case of technical failure during a session | Project lead | Printed brief set as session backup | Brief content finalised |
Final approval before the prototype is released to stewards. At the MVP stage, the stakeholder group is small — the project lead and the retention team lead at minimum.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Retention team lead review: share the prototype, the briefing document, and the session protocol with the retention team lead. Address any concerns about content, process, or steward selection before sessions begin | Project lead | Retention team lead sign-off (informal, documented) | All Phase 2 outputs complete |
| Confirm steward availability: final confirmation that the recruited stewards are available for their scheduled sessions and have received the briefing document | Project lead | Confirmed session schedule | Steward briefing document from 2.2 |
Execute the steward review sessions and gather the evidence the prototype was designed to produce. 'Launch' for this MVP means running the steward review sessions — not public deployment or subscriber-facing rollout.
The deployment is the steward review sessions. Two to three individual one-hour sessions, each using the same protocol and the same ten briefs.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Session sequencing: run the verbatim-first format for the first six briefs, then present the same briefs in score-first format for the remaining four. This preserves the verbatim-first experience as the primary condition while still enabling the controlled comparison. The order is consistent across all stewards so that any sequence effect is held constant | Project lead (session facilitator) | Consistent session protocol across all stewards | Session facilitation guide from 2.2 |
| Stagger sessions by at least 48 hours: stewards must not discuss the prototype with each other between sessions. The facilitation guide includes an explicit instruction not to share opinions before all sessions are complete | Project lead | Session schedule with adequate separation | Steward availability confirmed |
At MVP stage, 'market introduction' means introducing the prototype to the stewards — not external communication.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Opening orientation (10 minutes per session): walk the steward through the interface, explain what each element of the brief contains, demonstrate the decision workflow with a sample brief that is not in the ten. Do not instruct the steward on how to interpret the signals — the prototype must capture the unguided response | Project lead | Steward oriented and ready to begin; orientation script used consistently | Briefing document and prototype deployed |
| Explain the purpose without disclosing the hypothesis: tell the steward that the exercise is testing a new interface design and that there are no right or wrong answers. Do not reveal that the project is specifically testing whether verbatim language produces better decisions than a model score — this would prime the steward's response | Project lead | Blinded session conditions maintained | Facilitation guide from 2.2 |
At MVP stage, training is minimal and deliberately so. The steward's unguided response is the data.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Orientation only: the facilitation guide specifies exactly what to show and explain, and what not to explain. The steward should understand the mechanics (how to navigate, how to submit decisions) but not the interpretive framework (what signals to look for, how to weight verbatim language vs trajectory data) | Project lead | Consistent orientation across all sessions; no interpretive contamination | Facilitation guide from 2.2 |
| Structured debrief (15 minutes per session): after the steward has completed all ten briefs, conduct an unscripted debrief. Ask open questions: what did you find most useful? What was confusing? What would you want to know that you could not see? Do not reveal the ground truth (which briefs were concluded departure) until after the debrief is complete | Project lead | Debrief notes per steward; verbatim captures of key observations | All ten briefs reviewed and decisions submitted |
| Resistance signal log: specifically capture any steward comments about the surveillance dimension — whether they have concerns about the ethics of monitoring community language that subscribers wrote to each other, not to the brand. This is critical data for the Cultural Enablement plan | Project lead | Surveillance/legitimacy reaction captured per steward, with verbatim quotes where possible | Debrief notes |
The MVP introduces a new workflow and a new decision-making paradigm. Even at prototype stage, the steward is experiencing something that may challenge her existing understanding of what the retention function does.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Acknowledge the role shift explicitly in the debrief: ask the steward whether the stewardship model (qualitative judgement, no save script) feels meaningfully different from the current retention model. Do not advocate for the change — capture the steward's honest response to the contrast | Project lead | Steward perspective on the role-identity dimension of the change, captured in debrief notes | Debrief complete |
| Identify early adopters and resisters: note which stewards engage with the verbatim-first format readily and which find it difficult or uncomfortable. This is the early signal for how the capability-building programme needs to be structured in the post-MVP phase | Project lead | Informal adopter/resister profile for each steward participant | Session and debrief observations |
Analyse the evidence from the steward sessions, identify what the prototype has proven and what it has not, and determine what happens next. The double diamond loop closes here and reopens for the next iteration.
Quantitative analysis of the session data against the success and failure criteria established in Phase 1.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Calculate steward confidence scores: average confidence (1–5) per brief and per steward, across both verbatim-first and score-first conditions. Note any outlier briefs — high or low confidence — that may reveal something about the brief construction or the interface design | Project lead | Confidence score summary: per-brief, per-steward, per-condition | All feedback forms submitted |
| Analyse the most useful element data: calculate what percentage of briefs had verbatim language selected as the most useful element, across all stewards and both conditions. Compare this against the success criterion (at least 7 of 10 briefs) | Project lead | Most useful element distribution: verbatim language vs trajectory data vs model score vs other | Feedback forms submitted |
| Calculate the comparison condition effect: for each brief where the steward reviewed both formats, note whether the decision changed. Record how many decisions changed, in which direction, and whether the steward's confidence increased or decreased when switching to score-first. This is the primary test of the verbatim-first architectural commitment | Project lead | Format comparison table: brief-by-brief decision and confidence delta between verbatim-first and score-first conditions | Both format conditions completed per steward |
| Calculate inter-steward agreement: for each brief, note whether different stewards made the same decision. High agreement on concluded-departure briefs and active-complaint briefs (the clearly differentiated cases) is evidence that the interface communicates the signal clearly. Low agreement on borderline cases is expected and acceptable | Project lead | Agreement matrix: per-brief inter-steward agreement rate | All sessions complete |
| Assemble a metrics dashboard: a single summary document presenting all quantitative outputs in one place, ready for the evidence report in 4.3 | Project lead | One-page metrics summary with all key quantitative findings | All calculations above complete |
Qualitative analysis of the debrief notes and free-text feedback. The most important data from this prototype may be what the stewards said, not what they scored.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Aggregate the quantitative data and organise the free-text observations by theme: what was useful, what was confusing, what was missing, what surprised the steward | Project lead | Aggregated feedback report: quantitative summary plus thematic analysis of qualitative observations | All review sessions and debriefs completed |
| Identify the steward's experience of the prototype at a personal level: did they find it engaging or tedious? Did the simulated data feel realistic enough to evaluate the interface honestly? Would they use this tool if the data were real? These questions matter because the prototype's value proposition depends on the steward wanting to use it, not just being able to | Project lead (via debrief notes and follow-up if needed) | Steward experience summary capturing adoption-relevant qualitative data | Debrief notes from Phase 3 |
| Capture the surveillance reaction: does the steward have concerns about the ethics of monitoring community language that subscribers wrote to each other, not to the brand? If so, record the nature and intensity of the concern. This is critical input for the Cultural Enablement plan | Project lead | Ethics concern log per steward, feeding into the Cultural Enablement plan | Debrief notes and resistance signal log from Phase 3 |
Proactively analysing the collected data and feedback to identify both issues and potential areas for growth.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Synthesise the metrics and feedback to answer the three learning questions from Phase 1 (1.4): (1) Does the verbatim-first hierarchy change decisions? (2) Can stewards decide confidently? (3) Is verbatim signal the most useful element? Present findings as evidence statements, not opinions | Project lead | Evidence report answering the three learning questions with data | Metrics dashboard and aggregated feedback report from 4.1 and 4.2 |
| Conduct root cause analysis on any problems identified: if steward confidence was low on specific briefs, why? If inter-steward agreement was poor, was it the signal, the interface, or the brief construction? If the comparison condition showed no difference, what does that imply about the information hierarchy hypothesis? | Project lead | Root cause analysis per identified problem, distinguishing between prototype issues (fixable) and conceptual issues (requiring rethink) | Evidence report (above) |
| Spot opportunities: did the stewards suggest enhancements, raise use cases the project had not considered, or identify elements of the brief that should be added or removed? Did the exercise reveal anything about how stewards currently make relationship decisions that the project should incorporate? | Project lead | Opportunity log capturing steward-originated suggestions and unexpected findings | Debrief notes and free-text feedback |
Based on the evidence gathered, the project determines what happens next and returns to the start of the double diamond for the next iteration. This is where the Design Thinking loop closes and reopens.
| Activity | Owner | Produces | Dependencies |
|---|---|---|---|
| Make the iteration decision based on the evidence. Three possible paths: (1) Success — the interface works and stewards value verbatim signal; proceed to Approach A (Retrospective Validation) to test whether the signal exists in real data. (2) Partial success — stewards value the concept but the interface needs redesign; iterate on the prototype before proceeding. (3) Failure — stewards do not find the approach useful; revisit the foundational assumptions before investing further | Project lead (decision owner) | Documented iteration decision with evidence-based rationale, explicitly stating which path and why | Evidence report and root cause analysis from 4.3 |
| Produce the iteration brief for the next cycle: what was learned, what assumption is tested next, what changes carry forward, what the entry conditions are for the next iteration (including, if Approach A is next, the DPIA as a hard gate) | Project lead | Iteration brief — a one-page document that serves as the starting context for the next double diamond cycle | Iteration decision (above); DPIA dependency note from Phase 2 (2.3) |
| Update the financial case: based on the evidence, refine (or maintain) the projected value. If the steward review was successful, the financial case for the next iteration is: the cost of running Approach A (retrospective analysis of historical data) against the potential to preserve approximately £478,000 in annual revenue through a 5-percentage-point churn reduction in the high-LTV cohort | Project lead | Updated financial case note for the iteration brief | Evidence report; financial context from project instructions |
Three genuinely different outcomes — not optimistic, moderate, and pessimistic versions of the same trajectory. Each variation traces a structurally different way the plan could play out, testing a different failure or learning mode. Scenario selection was delegated to the LLM; Tim's HITL role was to assess whether the chosen scenarios represented genuinely different failure modes rather than variations of the same scenario.
| Variation | What Goes Right | What Goes Wrong | Core Learning |
|---|---|---|---|
| Variation 1 | The HITL design and interface | Proxy language fidelity — the known limitation becomes the blocking finding | Interface validated; data source insufficient; DPIA becomes the critical path |
| Variation 2 | Prototype build and session execution | The foundational design principle — stewards prefer the format the project was designed to replace | Verbatim-first hierarchy challenged; redesign loop before proceeding; commercial differentiation at risk |
| Variation 3 | The concept broadly works as designed | Nothing breaks — the scope shifts upward | Stewards identify a more valuable, earlier signal; scope expands; ethical stakes rise |
Phases 1–2 are the common trunk: identical build and preparation across all three variations. Phase 3 (the steward review sessions) is where the paths diverge. All three paths return to the start of the double diamond — but with different questions, different entry conditions, and different timelines.
Phases 1 and 2 proceed as planned. The ten subscriber briefs are constructed from public proxy language (Reddit, Trustpilot, app store reviews of DTC subscription services) mapped onto fictional subscriber profiles with realistic CRM data. The brief archetypes cover the full signal spectrum: three concluded departure, two active complaint, three satisfied engagement, and two borderline cases. The interactive HTML prototype is built in Week 1. Both verbatim-first and score-first variants are tested and polished. The steward briefing document is prepared. The compliance position is confirmed: no real subscriber data, no DPIA dependency. Two relationship stewards are recruited and scheduled for individual one-hour sessions in Week 2. Nothing in Phases 1 or 2 signals a problem. The build is clean, the documentation is complete, and the stewards are briefed without issue.
The first steward completes the verbatim-first review of all ten briefs. Her quantitative feedback is strong: confidence scores average 4.2 out of 5 across all ten briefs. She identifies verbatim community language as the most useful element in 8 of 10 briefs. Her decisions on the concluded-departure briefs are correct and confident. Inter-format comparison shows she changes her decision on 2 of 10 briefs when switching to score-first, both times toward less nuanced choices. On the quantitative metrics, this is a pass.
However, in the debrief, she raises a concern the project had anticipated as a known limitation but not as a session-stopping objection. She observes that the forum language in the briefs does not sound like Naked Wines Angels. The vocabulary, the emotional register, and the relational context are wrong. Reddit posts about generic subscription boxes use different idioms, different levels of emotional investment, and a different assumed audience than the Naked Wines community forum. She puts it directly: the concluded-departure briefs are convincing as a concept, but she cannot evaluate whether this interface would work on real data because the proxy language does not feel real.
The second steward independently raises the same concern, though with a different emphasis. He finds the active complaint briefs particularly unconvincing — the proxy sources express frustration differently from how Angels in the Naked Wines community express frustration. His confidence scores are lower (average 3.6) and he attributes the lower confidence specifically to doubting the language, not the interface design.
The evidence report answers the three learning questions with mixed results. Does the verbatim-first hierarchy change decisions? Yes — the comparison condition shows a measurable difference. Can stewards decide confidently? Conditionally — Steward 1 yes, Steward 2 only when she brackets the language concern. Is verbatim signal the most useful element? Yes, when it is believed to be authentic; the value proposition collapses when the language feels constructed. The root cause analysis distinguishes between a prototype issue and a conceptual issue. The interface design is not the problem. The information hierarchy works. The brief format supports confident decision-making. The problem is that Approach C, by design, cannot answer the question the stewards need answered: does this language actually exist in the Naked Wines community?
The project team confirms the planned iteration pathway: proceed to Approach A (Retrospective Validation). The interface has been validated to the degree that simulated data allows. The next iteration must test whether the signal exists in real historical data from churned Angels' actual forum posting histories. This triggers the DPIA dependency. The iteration brief for Cycle 2 carries forward the validated interface design (verbatim-first hierarchy confirmed as superior to score-first) and adds a new entry condition: real community language from the actual Naked Wines forum must be used, which requires data access approval and DPIA sequencing.
Direct cost of this iteration: nil incremental (steward time only). The financial case for the next iteration is unchanged: the potential to preserve approximately £478,000 in annual revenue through a 5-percentage-point churn reduction remains the target. The cost of the next iteration increases because Approach A requires DPIA completion (estimated 8–16 weeks of DPO and legal resource) and data engineering to link forum posting histories to CRM records for churned Angels. The project team estimates this at 2–3 months of elapsed time and a modest internal resource cost, weighed against the £2.3m annual replacement burden. The ratio is favourable, but the timeline extends.
| What Feeds Back into the Double Diamond | |
|---|---|
| Validated | Verbatim-first information hierarchy produces different (and, by steward judgement, better) decisions than score-first. Interface design and steward workflow are fit for purpose. |
| Invalidated | Public proxy language is not a sufficient stand-in for real community data, even for interface testing. This limitation was anticipated but is now confirmed as a hard constraint. |
| New entry condition | DPIA sequencing becomes the critical path item. The project cannot progress without access to real subscriber data. |
| Reframed question | The next iteration no longer asks 'Does the interface work?' It asks 'Does the signal exist in the actual data, and does it look like what we simulated?' |
Phases 1 and 2 proceed identically to Variation 1. The prototype is built, polished, documented, and ready. The same ten briefs, the same two stewards, the same protocol. Nothing in the preparation phases distinguishes this variation from the first.
The first steward completes the verbatim-first review. Her decision confidence is adequate but not strong: average 3.4 out of 5 across all ten briefs. She correctly identifies the three concluded-departure briefs but notes that reading the verbatim language took longer than she expected and that she found herself scrolling past the forum posts to reach the trajectory data and the scores. In her feedback, she selects 'trajectory data' as the most useful element in 5 of 10 briefs and 'verbatim language' in only 3.
When she switches to the score-first format for the comparison condition, her confidence rises to 4.1 out of 5 and her decision speed improves noticeably. She makes the same decisions on 8 of 10 briefs, but with greater confidence and in roughly half the time. In the debrief, she explicitly states that she finds the score-first format more useful: 'I know what I'm looking at immediately. The verbatim stuff is interesting, but it's a lot to read before I can make a decision.'
The second steward produces a similar pattern. His verbatim-first confidence averages 3.6, rising to 4.3 under score-first. He changes his decision on 3 of 10 briefs when switching formats — in two cases, the score-first format makes him more confident about a concluded-departure brief; in one case, it makes him less nuanced about a borderline case. His overall preference is clearly for the score-first format.
The evidence report answers the three learning questions unfavourably. Does the verbatim-first hierarchy change decisions? Yes — but in the opposite direction from the hypothesis: stewards perform better under score-first conditions. Can stewards decide confidently? Yes, under score-first conditions. Is verbatim signal the most useful element? No — trajectory data and model scores rank above verbatim language for both stewards.
The root cause analysis must distinguish between two possible explanations. First, the interface presentation of verbatim language may be the problem — the language is too long, too unstructured, or presented in a format that creates cognitive load rather than clarity. Second, the verbatim-first principle itself may be wrong — stewards genuinely process information better when they have a quantitative anchor first, and the qualitative language functions as contextual support rather than primary signal. These two explanations require different responses.
The project team chooses not to abandon the verbatim-first principle on the basis of a two-person test with simulated data. The steward's preference for scores may itself be evidence of the managed-metric problem: if the operational culture is conditioned to trust numbers over language, then the steward's preference for score-first is predictable but not necessarily correct. However, the team also acknowledges that if the redesigned verbatim-first format still loses to score-first in the next iteration, the hypothesis must be revised. The redesign targets three specific changes: shorter, curated verbatim extracts (two to three sentences, not full posts); a one-line plain-language signal summary above the verbatim text; and a clearer visual hierarchy that reduces the cognitive load of unstructured text. The redesigned prototype is tested in a second iteration before proceeding to Approach A.
The immediate financial impact is a delay — the planned progression from Approach C to Approach A is paused while the interface is redesigned and retested. This adds approximately 2–4 weeks to the overall timeline. Direct cost remains minimal. There is a second-order financial implication: if the eventual finding is that the verbatim-first hypothesis is wrong and stewards genuinely perform better with a score-first format, the project's differentiation from existing customer health scoring platforms (Gainsight, ChurnZero) weakens significantly. The commercial case for the Quiet Signal System rests partly on it doing something structurally different from standard churn prediction. If the HITL design converges toward a conventional dashboard with a score at the top, the question becomes whether the underlying signal (community language analysis) is different enough to justify the investment, even if the interface is conventional.
| What Feeds Back into the Double Diamond | |
|---|---|
| Validated | The comparison condition works as a testing instrument. Stewards can evaluate both formats and articulate clear preferences. The session protocol and feedback capture method are robust. |
| Challenged | The verbatim-first information hierarchy — the project's core HITL design principle. Not invalidated (the test conditions are too limited), but the evidence from this iteration does not support it. |
| New design question | Is the problem the format of the verbatim presentation (fixable) or the principle of verbatim-first itself (fundamental)? The next iteration must disambiguate. |
| Commercial risk | If the verbatim-first principle is ultimately wrong, the Quiet Signal System converges toward conventional health scoring. The signal source (community language) may still differentiate, but the architectural commitment to verbatim-first is what makes the system genuinely novel. |
Phases 1 and 2 proceed identically to the prior variations. The prototype is built, polished, documented, and ready. The same ten briefs, the same two stewards, the same protocol. The preparation is clean and complete.
Both stewards perform well on the concluded-departure briefs. Steward 1's confidence averages 4.4 across the concluded-departure and active-complaint briefs — high and consistent. She identifies verbatim language as the most useful element in 7 of 10 briefs. The comparison condition shows she changes her decision on 2 of 10 briefs when switching to score-first, both times toward less nuanced choices. On the defined success criteria, this is a clear pass.
However, the most interesting data comes from the borderline briefs — the two cases where the subscriber is drifting but has not yet concluded. Both stewards spend significantly more time on these briefs, re-reading the verbatim language multiple times and expressing uncertainty. Steward 1 says: 'This one is the hardest. By the time the language says they're finished, it's already too late. I want to see the ones who are starting to drift.' Steward 2 makes the same observation independently, framing it as a workflow question: 'If this is what they sound like when they're leaving, what do they sound like six months before that? That's when I could actually do something.'
Both stewards explicitly request an earlier intervention point. Neither is dissatisfied with the current prototype; both are articulating a more ambitious use case that the current MVP does not address. The steward feedback is not a critique of what the system does — it is a specification of what they want it to do instead.
The evidence report answers the three learning questions positively. Does the verbatim-first hierarchy change decisions? Yes, in the expected direction. Can stewards decide confidently? Yes, on clearly differentiated briefs. Is verbatim signal the most useful element? Yes, with high consistency. The prototype has succeeded. The more significant finding is the steward's articulation of a more ambitious requirement: earlier detection of the relational shift that precedes conclusion, not just detection of conclusion itself.
The root cause analysis of the borderline brief difficulty reveals that the Community Vocabulary Shift component — designed to detect the 'we' to 'they' shift — is targeting the wrong point in the departure trajectory. By the time the vocabulary shift is fully expressed in the concluded-departure pattern, the subscriber has already decided. The stewards are identifying the stage where the shift is beginning but has not yet settled — a harder NLP problem but a higher-value intervention window.
The project team accepts the scope expansion. The prototype validated that the interface works and the steward is willing to use verbatim community language as a decision input. The next iteration expands the signal detection target to include the pre-conclusion drift stage — a subscriber whose language is changing but who has not yet concluded. This requires the Community Vocabulary Shift component to detect an emerging pattern rather than a settled one, which is a harder NLP problem with a higher false-positive rate. The next iteration (Approach A, with real historical data) must therefore test not only whether the conclusion signal exists in the data, but whether the pre-conclusion shift is detectable and distinguishable from normal linguistic variation. A fourth decision option ('Early Conversation') is added to the steward's workflow, requiring a redesign of the decision interface.
The financial implications of this variation are the most favourable of the three. The steward feedback suggests that catching subscribers at the earlier stage of relational shift, rather than at conclusion, would increase the effective intervention window and therefore the probability of retention. If the system can detect pre-conclusion shift and enable a steward conversation that prevents the shift from reaching conclusion, the per-subscriber save rate improves. The current financial model assumes a 5-percentage-point reduction in churn, preserving approximately £478,000 in annual revenue. This assumes intervention at the conclusion stage, where the steward is essentially attempting to reverse a decision the subscriber has already made. If intervention occurs at the pre-conclusion stage, the save rate could plausibly be higher because the subscriber has not yet decided. The financial model does not need to change at this stage, but the next iteration should include a mechanism to estimate the differential save rate between early-stage and late-stage intervention. The offsetting risk is scope expansion: detecting early vocabulary shift is a harder NLP problem, with a higher false-positive rate and greater steward cognitive load.
| What Feeds Back into the Double Diamond | |
|---|---|
| Validated | Verbatim-first interface design, steward workflow, comparison condition protocol, and the fundamental proposition that community language contains a signal invisible to NPS. |
| Reframed | The primary target signal shifts from concluded departure to pre-conclusion vocabulary shift. The system's value proposition moves from 'catch them leaving' to 'catch them thinking about leaving.' |
| New design requirement | A fourth steward decision option ('Early Conversation') and a redesigned brief format that presents shift-in-progress language differently from concluded-departure language. |
| New technical question | Can the Community Vocabulary Shift component detect the earlier signal reliably, and what is the false-positive rate? This becomes the primary question for the Approach A retrospective. |
| Ethical implication | Earlier detection of relational shift intensifies the surveillance concern. The steward is now reading language that expresses emerging doubt, not settled conclusion — a more intimate form of monitoring that raises the legitimacy standard. |
| Dimension | V1: Source Rejected | V2: Score Wins | V3: Earlier Signal |
|---|---|---|---|
| What breaks | Proxy language fidelity — the known limitation materialises | Verbatim-first hierarchy — the core design principle | Nothing breaks — scope shifts upward |
| Interface validated? | Yes | Challenged — needs redesign | Yes — strongly |
| Next iteration | Approach A with DPIA as hard gate | Redesign loop (2–4 weeks), then DPIA + Approach A | Approach A with expanded scope (earlier signal detection) |
| Financial case | Unchanged — £478k per 5pp still the target | At risk if score-first prevails — differentiation weakens | Potentially stronger — earlier intervention improves save rate |
| Biggest risk forward | Signal doesn't exist in real data | Verbatim-first hypothesis is fundamentally wrong | Early signal too ambiguous to detect reliably at scale |
| Timeline impact | 2–3 months (DPIA sequencing) | 2–4 weeks (redesign) then 2–3 months (DPIA) | 2–3 months (DPIA + expanded data requirement) |
The Quiet Signal System does not ask Naked Wines UK to adopt a new technology. It asks the organisation to accept that its most trusted measure of customer health is actively obscuring the departure of its highest-value subscribers. Every cultural change that follows descends from this single, uncomfortable premise. This document identifies what the organisation must believe, value, and tolerate differently — and then applies the course material change management framework to the cultural changes identified.
NPS is embedded in how the organisation talks to itself about whether it is doing well. The HY26 interim results cite NPS 76 as evidence of customer health. A 24% annual churn rate in the highest-LTV cohort coexists with this score without apparent contradiction in the current reporting framework. The cultural change required is not supplementing NPS with an additional metric — it is accepting that NPS is a managed artefact that functions as an institutional comfort mechanism rather than a diagnostic instrument.
The survey response rate is approximately 4.5% in most DTC contexts. NPS captures the sentiment of the 4.5% who respond positively in a managed context — not the 95.5% who stay silent, and certainly not the Angels who are departing quietly. The metric conflates response willingness with satisfaction. A subscriber who has already decided to leave does not complete the survey. NPS is structurally blind to the departure it is designed to detect.
This shift is not asking the organisation to abandon NPS — it is asking the organisation to understand what NPS actually measures, report it accordingly, and stop using it as the primary evidence of subscriber health when evidence from unmanaged peer-to-peer data says something materially different.
A retention agent works from a triggered event and applies a scripted response. The trigger is a cancellation attempt or a churn risk flag; the response is a save offer calibrated to the subscriber's value band. Success is measured by save rate: how many subscribers who were going to leave were persuaded not to. This model is optimised for the subscriber who is ambivalent — willing to stay if offered the right incentive.
A relationship steward works from an ambient signal and exercises qualitative judgement. The input is verbatim community language that the subscriber wrote to other subscribers, not to the brand. The decision is not 'what save offer should I make?' but 'is this subscriber experiencing a recoverable problem or an irreversible conclusion — and if the latter, how do I ensure the departure is dignified?' A steward who recommends against intervention — because the departure is concluded and the subscriber deserves to be respected rather than intercepted — has succeeded, not failed.
This is a genuinely different role. It is not the retention role with better tools. It requires different skills (interpretive literacy, qualitative judgement, comfort with ambiguity), different authority (the steward's decision must be respected even when it means accepting a departure), and different performance metrics (intervention quality and subscriber experience of departure, not save rate).
Cancellation is currently a terminal event. It closes the CRM record, removes the subscriber from the Angel base, and triggers the acquisition cost to replace the revenue. There is no designed experience of departure — only its prevention and, failing that, its administrative processing. The subscriber who leaves is no longer a customer; she is a churn statistic and an acquisition target.
The Alumni Network concept treats departure as a transition to be managed with the same care as onboarding. An Angel who leaves after seven years of pre-funded investment and community contribution is not a failed retention — she is a relationship in a different state. She knows the makers she has supported. She has co-owned the brand story. Her departure, if respected, preserves the relational equity that makes return possible. Her departure, if handled as a terminal event with a save script, destroys it.
The specific cultural change required is accepting that a subscriber who is well-served in her departure — who feels that the organisation saw her, valued her tenure, and respected her decision — is more likely to return and more likely to recommend the brand than one who was pressured to stay and eventually left anyway. The short-term revenue loss of letting a concluded subscriber leave well is real; the long-term relationship value is the argument for absorbing it.
The primary input to the steward's decision is a block of verbatim text — the subscriber's actual words in a forum post written to other subscribers, not to the brand. This language is ambiguous, context-dependent, and resistant to aggregation. It cannot be reduced to a number without destroying the signal it carries. It surfaces things the organisation may not want to hear: frustration with decisions the leadership made, disillusionment with changes to the community, a sense that the brand has drifted from its original commitment to independent makers.
The managed metric exists partly because it filters this discomfort. NPS of 76 is easy to work with. A community post that reads 'I used to recommend Naked Wines to everyone I knew; now I'm embarrassed to' is not. The cultural change required is not just tolerating this input — it is building the operational infrastructure to act on it: the steward review process, the verbatim-first interface design, the qualitative feedback loop from steward decisions back into the signal system.
Organisations that are accustomed to threshold-based decision rules (flag at NPS below X; escalate at churn probability above Y) will find the ambiguity of qualitative decision-making genuinely uncomfortable. This is not a communication problem or a training problem — it is a cultural problem, and it is the deepest institutional change the Quiet Signal System requires.
The Quiet Signal System monitors community language that subscribers wrote in the context of speaking to each other, not to the brand. The obstacle analysis in the Research tab names this directly: there is a version of this system that subscribers would experience as surveillance if they learned about it. The legitimacy of the system depends on whether Naked Wines can honestly say to an Angel: 'We read what you wrote in the community forum and used it to understand how you were feeling about us.' If that sentence sounds creepy rather than caring, the system has a legitimacy problem that no DPIA can resolve.
This is a cultural challenge, not a compliance challenge. The DPIA addresses the legal basis for processing. Legitimacy addresses whether the organisation has the relational standing to use the data in the way the system proposes. Earning that standing requires Naked Wines to build reciprocal transparency: if the company reads what Angels write to each other, the company must be equally willing to say what it knows and what it has decided to do about it. The Honest Annual Report and the Tenure Council exist in the broader concept specifically to create the institutional architecture that makes this reciprocity credible.
The longer-term cultural shift is from passive data extraction to active relational contract. The Disagreement Forum takes this further: creating a space where Angels can disagree with the company publicly, and the company commits to responding substantively. These are not features. They are cultural infrastructure that earns the legitimacy the Quiet Signal System requires.
The six framework elements applied to the five cultural changes identified above. Each is applied to the specific context of the Quiet Signal System at Naked Wines UK, not reproduced generically. Sequenced across three horizons: MVP phase (concrete), post-MVP validation (directional), and institutional maturity (conditional on MVP evidence).
The cultural changes required by this project affect different stakeholders in different ways. The following analysis identifies each stakeholder group, their relationship to the cultural shifts, and the nature of their interest — supportive, threatened, or conditional.
| Stakeholder Group | Primary Cultural Shift Affected | Nature of Interest | Engagement Priority |
|---|---|---|---|
| Board / CFO | Metric-Trust Shift | Conditional. NPS is reported to investors. Accepting that it obscures churn requires an alternative narrative for the market. The financial case (£478k revenue preserved per 5pp churn reduction) provides the bridge — but the board must first accept that the anomaly (NPS 76 + 24% churn in highest-LTV cohort) is an anomaly worth investigating. | Critical — gatekeeper for the entire programme |
| Retention Team Lead | Retention to Stewardship | Threatened. The shift from save-rate optimisation to relational stewardship redefines success for the function they manage. A steward who recommends against intervention is succeeding — but the team lead's current KPIs do not reflect this. Role and identity are both at stake. | High — operational owner of the steward role |
| Relationship Stewards | Tolerance for Uncomfortable Signal | Directly affected. They are the users of the system and the people who must develop qualitative interpretive skill. MVP steward feedback from the prototype sessions is the primary evidence for how this group responds to the new role requirement. | High — system users and primary feedback source |
| Data and Engineering | Metric-Trust Shift | Conditional. They built the systems that produce and report NPS. The project implicitly says their primary output is misleading. They also own the data infrastructure required to join forum data with CRM records for Approach A. | Medium — enablers, not decision-makers at MVP stage |
| Community Team | Surveillance-to-Legitimacy Boundary | Supportive but cautious. They manage the forum and understand community dynamics. They are likely to recognise the signal's validity but may have the strongest reservations about whether AI monitoring of peer-to-peer language crosses a relational boundary. Custodians of the data source and the community relationship. | High — must be co-designers of the legitimacy framework |
| Angel Subscribers (indirect) | Surveillance-to-Legitimacy; Departure as Relationship State | Not engaged directly at MVP stage. Their interests are represented through the HITL architecture (no AI output reaches them without steward review) and the legitimacy framework. They become direct stakeholders at the institutional maturity phase through the Tenure Council. | Deferred — addressed through system design, not direct engagement, at MVP stage |
The cultural changes create operational, identity, and institutional impacts across the organisation. The following table maps each shift to its impact dimensions.
| Cultural Shift | Operational Impact | Identity Impact | Institutional Impact |
|---|---|---|---|
| Metric-Trust Shift | NPS repositioned as lagging verification rather than leading indicator; Forum Divergence Score becomes the primary signal for high-LTV cohort health. Reporting cadence and dashboards require redesign. | Teams that have built competence around NPS interpretation must accept that their headline metric is insufficient in a specific and material way. This is an identity challenge, not just a process change. | Investor communications require a new narrative. The HY26 report cites NPS 76 as evidence of customer health; the next report must either maintain this framing or begin transitioning the investor narrative. |
| Retention to Stewardship | KPIs shift from save rate to intervention quality. The measurement system for the retention function requires redesign before the new role can be fairly assessed. | Retention agents face genuine role redefinition. The new role requires qualitative judgement and relational literacy — different skills, different aptitudes. Not all current retention staff will be suited to the stewardship model. | Headcount may not change, but the hiring profile and training investment shift materially. The cost of a steward is higher than the cost of a retention agent, offset by the value of better decisions. |
| Departure as Relationship State | Cancellation workflows add a steward review step and an Alumni Network pathway. The subscriber's departure experience is designed rather than defaulted. | The organisation stops treating cancellation as failure. This is emotionally significant for teams whose performance is measured by churn prevention. | Short-term churn numbers may temporarily increase as coerced retentions are released. The financial case depends on demonstrating that Alumni return rates and reduced replacement costs exceed the short-term revenue loss. |
| Tolerance for Uncomfortable Signal | Decision-making workflows incorporate verbatim community language as a primary input. Dashboards surface qualitative data alongside quantitative scores. | Teams accustomed to threshold-based decision rules must develop comfort with ambiguity. The steward reads language and exercises judgement, not a decision tree. | The organisation's decision-making culture shifts from certainty-seeking to evidence-tolerant. This is the deepest institutional change and the slowest to embed. |
| Surveillance-to-Legitimacy | A transparency and reciprocity framework must be developed alongside the signal system. Forum terms of service require updating. Disclosure practices change. | The organisation must see itself as accountable to the community it monitors, not merely as a processor of community data. This is a shift from data-as-resource to data-as-relationship. | Legal, compliance, and community functions must coordinate on a legitimacy framework that exceeds DPIA requirements. The standard is not 'lawful' but 'honest.' |
The communication challenge at the core of this project is that the founding message — 'our most trusted metric is not measuring what we think it is measuring' — is inherently threatening. If communicated as an accusation ('NPS has been misleading us'), it produces defensiveness. If communicated as a discovery ('we have found a signal that NPS cannot see'), it creates curiosity. The framing is everything.
During the MVP phase, the communication strategy is deliberately narrow. The audience is the steward team and the retention team lead, not the broader organisation. The message is not 'NPS is wrong'; it is 'we are testing a new interface to see if there is useful information in community language that our current tools cannot surface.' This framing is honest — it accurately describes Approach C — and it does not require the steward to accept any of the five cultural shifts before the prototype has produced evidence.
The communication vehicle is the steward briefing document (produced in Phase 2 of the action plan), the session facilitation, and the debrief. Nothing is communicated to the broader organisation during the MVP phase. The evidence is gathered first; the argument is made from the evidence, not in advance of it.
If the MVP validates the interface and the steward finds verbatim signal more useful than the model score, the communication shifts. The primary audience is the retention team lead and the data team lead. The message shifts to: 'We ran a test. Here is what the stewards found. The signal in community language is different from what NPS reports. Here is what that might mean for how we understand subscriber health.' The data from the MVP (steward confidence scores, format comparison results, debrief observations) is the argument. The financial context (£2.3m replacement burden, £478k per 5pp reduction) provides the business case.
The board conversation — the Metric-Trust Shift at its most challenging — does not happen until after the Approach A retrospective provides evidence that the signal is real in actual data. Approach C produces usability evidence; Approach A must produce correlational evidence before the investor narrative can be revisited.
At the institutional maturity phase, the communication extends to Angel subscribers themselves. The Honest Annual Report is the primary vehicle: a document sent exclusively to long-tenure subscribers that acknowledges what went wrong, what was learned, and what remains uncertain. This is the communication act that earns the right to use community language as a signal — the organisation has been transparent about what it hears and what it does with it.
Resistance to this project will be specific, predictable, and legitimate in most cases. The management approach for each pattern is designed to take the resistance seriously rather than dismiss it as a misunderstanding.
| Resistance Pattern | How It Manifests | Management Approach |
|---|---|---|
| NPS Loyalty | 'Our NPS is 76 — the forum is a vocal minority.' This is the most common and superficially plausible objection. It conflates a high aggregate score with evidence of health in the specific cohort at risk. | Present the NPS-churn divergence as a data anomaly warranting investigation, not an accusation. The question is not 'is NPS wrong?' but 'why does NPS 76 coexist with 24% annual churn in the highest-LTV cohort?' Framing it as a puzzle generates curiosity; framing it as a verdict generates defensiveness. |
| Score-First Preference | Stewards prefer the score-first format. This emerges in the MVP prototype comparison condition — Variation 2 of the simulated scenarios. If stewards find the verbatim-first format harder to use, the architectural commitment is challenged from the inside. | Record as evidence about organisational readiness rather than evidence that the principle is wrong. The steward's preference for scores may itself reflect the managed-metric culture the system is designed to change. Redesign the verbatim presentation (curated extracts, plain-language signal summary) before abandoning the principle. If the redesigned format still loses, revise the hypothesis. |
| Surveillance Objection | 'You can't read community posts — that's not what they're for.' This objection may come from the community team, from legal counsel, or from senior leadership concerned about the reputational risk of AI-monitored peer-to-peer language. | Take the objection seriously rather than reframing it as a misunderstanding. The legitimacy question is real. Use it as the entry point for the reciprocal transparency conversation: 'If we cannot honestly tell an Angel that we use their community language to understand the relationship, we should not use it. What would make it honest?' This converts resistance into design input for the legitimacy framework. |
| Role-Identity Threat | Retention agents whose competence is built on save-script execution may resist a role redefinition that devalues their existing skills. This resistance is legitimate — the stewardship role genuinely requires different capabilities, and not all current retention staff will be suited to it. | Acknowledge the skill transition honestly. The stewardship role is not a promotion of the retention role — it is a different job. Offer development pathways for those who can transition and honest conversations for those who cannot. Do not disguise the change as an enhancement of the existing role, because that framing will be seen through and will generate distrust. |
| Signal Absorption | The organisation adopts the Forum Divergence Score as an additional KPI on the existing dashboard, NPS retains primacy, and the verbatim layer is stripped out. This is the most insidious resistance because it looks like adoption — the system is technically implemented but culturally neutralised. | Build the architectural safeguard into the system design: the Concept Development document specifies that verbatim language appears at the top of every subscriber brief, before any score. This is not a presentation preference — it is the design decision that prevents the system from recreating the managed-metric problem it was built to solve. If the organisation removes the verbatim layer, the system has been co-opted. The safeguard is technical; the detection is human. |
The capability gap this project creates is not primarily technical. The dashboard is a relatively straightforward interface. The capability gap is interpretive: the steward must be able to read verbatim community language and make a qualitative judgement about the subscriber's relational state. This is a skill that most organisations do not systematically develop, because most organisations do not use unprocessed qualitative data as an operational input.
At the MVP stage, the action plan (Phase 3.3) specifies lightweight orientation: walk the steward through the interface, explain what each element of the brief contains, demonstrate the decision workflow with a sample brief. The orientation is deliberately not instruction on how to interpret the signals. The MVP's value depends on capturing the steward's unguided response — if they are trained to interpret before they evaluate, the prototype cannot distinguish between the interface's value and the training's influence. The capability is measured, not built, at this stage.
If the MVP validates the interface and the subsequent iteration validates the signal, the steward team requires structured development in what might be called community signal literacy: the ability to read peer-to-peer community language and distinguish between active complaint (which demands operational response), satisfied silence (which requires no intervention), concluded departure (which warrants a relational response), and — as Variation 3 of the simulated variations reveals — the shifting language that precedes conclusion (which represents the highest-value intervention window). This capability is built through supervised practice with real examples, not through classroom instruction. The training model is closer to clinical supervision than to a learning module: the steward reviews flagged subscriber briefs, makes a decision, and discusses the reasoning with a senior steward or the project lead. The quality of the decision improves through calibrated repetition, not through rule-following.
At the institutional maturity phase, the capability extends beyond the steward team. The Tenure Council requires facilitation skills for genuine subscriber deliberation. The Disagreement Forum requires moderation skills that preserve dissent rather than resolving it. The Honest Annual Report requires editorial judgement about what to disclose and how to frame failure honestly. These are all expressions of the same underlying organisational capability: working with qualitative, ambiguous, emotionally textured information as a basis for operational decisions. This is a multi-year capability development programme, not a training course.
The Concept Development document identifies senior leadership sponsorship as Condition 5 for the system's feasibility. The Tenure Council and Disagreement Forum are deliberately sequenced as Ambitious-horizon initiatives requiring a separate governance conversation. They are not on the critical path for the first sprint, but they are present in the concept as the longer-term institutional architecture the Quick Wins are designed to support. Their inclusion signals the cultural commitment required.
Leadership sponsorship for this project must be visible, specific, and sustained. It is not sufficient for a senior leader to approve the project. They must be willing to do three things that are each uncomfortable in their own right.
A senior leader must be willing to say, in a context that matters (a board meeting, an all-hands, an investor communication), that NPS is not measuring what the organisation believed it was measuring. This is not the same as saying NPS is wrong — it is saying that NPS is incomplete in a specific, material way that has financial consequences. The £2.3m annual replacement burden is the evidence. The 24% high-LTV churn rate alongside a 76 NPS is the anomaly. The leader's role is to name it — not as a failure, but as a discovery that creates an opportunity. Without this act of naming, every other cultural change is building on an unstated foundation, and the resistance to that foundation will eventually surface in ways that are harder to manage.
The stewardship model only works if the steward's decision is respected. If a steward recommends against intervention and is overruled by a manager who wants to hit a save-rate target, the system has failed at the human level regardless of what the algorithm produces. Leadership must explicitly protect the steward's judgement authority against operational pressure to maximise short-term retention numbers. This means accepting that the save rate will decline in the short term as concluded departures are respected rather than intercepted. It means saying publicly — in a context that the steward team can hear — that a steward who recommends against intervention and proves to be correct has done excellent work. Without this explicit protection, the stewardship role will revert to retention behaviour under pressure.
The legitimacy of monitoring community language depends on the organisation being equally transparent about what it learns and what it does with that learning. A senior leader must be willing to sponsor the Honest Annual Report — a document that tells long-tenure Angels what went wrong, what the company learned, and what it remains uncertain about. This is the institutional act that earns the right to listen. It is uncomfortable because it requires the organisation to say, in writing, to its highest-value subscribers, that it has made mistakes and that it is still learning. But without it, the Quiet Signal System is technically functional and relationally illegitimate. The technical function extracts signal from a community that the organisation has not yet earned the right to monitor at this depth. The reciprocal transparency is what converts extraction into relationship.