Back to case studies

FamePilot: Aspect-Based Review Sentiment at Scale

Client:FamePilot
Period: 2020
Outcome: aspect-level sentiment extraction with LLM-generated review reports across customer feedback at scale
Stack:spaCyRelation ClassifierPrompt EngineeringPythonNLP

Client: FamePilot
Period: 2020
Stack: spaCy, Relation Classifier, Prompt Engineering, Python, NLP

The problem

FamePilot helps brands manage their reputation across review platforms. Reviews come in by the thousand per day across Google, Yelp, app stores, and industry-specific platforms. The product needed two things off this flood of unstructured text.

First, aspect-level sentiment. "The food was great but the service was slow" is not a single sentiment. It is positive about food and negative about service. A document-level sentiment score loses the entire signal a brand needs to act on.

Second, scale. Per-review LLM calls were too expensive given the volume, and the cost would have grown linearly with the customer base. The system had to handle real production load on a budget that did not break the unit economics.

What I built

Aspect extraction with spaCy. I trained custom spaCy NER pipelines to extract domain-specific aspects ("delivery time", "battery life", "checkout process", "wait staff") from review text. spaCy gave us fast inference, deterministic output, and easy retraining as new domains came online.

Opinion extraction. A parallel extractor pulled out opinion spans, the phrases that carry sentiment ("was terrible", "loved it", "took forever"). Aspects without opinions are just nouns. Opinions without aspects are floating polarity. The combination is what makes ABSA actually work.

A custom relation classifier. The hard part of aspect-based sentiment is linking each opinion to the correct aspect. "The food was great but the service was slow" has two aspects and two opinions, and you have to bind them correctly. I built a relation classifier that takes (aspect span, opinion span, surrounding context) and predicts whether they are linked. This is the piece most off-the-shelf sentiment APIs skip entirely, and it is where the production accuracy lives.

Prompt engineering for review reports. Once aspect-sentiment pairs were extracted, the system aggregated them into review reports for brand managers. Here I used LLM prompting carefully, generating natural-language summaries from structured aspect-sentiment data rather than feeding raw reviews to a model. This kept the LLM costs bounded to the report-generation step rather than the per-review processing step, and made outputs more controllable and consistent.

The engineering work that made it ship

Domain customization was constant. A "wait time" in a restaurant context means service speed; in a healthcare context it means appointment scheduling. I built taxonomies per industry and trained the extractors against domain-specific labeled data. Re-training cycles were short because spaCy makes that workflow fast.

Cost engineering mattered as much as accuracy. Per-review LLM scoring would have cost an order of magnitude more than the deterministic spaCy + classifier pipeline I used. The LLM only entered at the report aggregation step, where the input was already small and structured.

Edge cases were the long tail. Sarcasm, mixed sentiment within a single span, negation scope, comparison phrases ("better than last time"). Each of these required pattern-level work, and getting them right was the difference between a demo that worked on cherry-picked reviews and a production system that handled real customer text.

The outcome

MetricResult
Review volume processedthousands per day across customer base
Aspect-level granularityper-aspect sentiment per review
LLM cost per reviewnear zero (LLM only at report aggregation)
Domain customizationper-industry taxonomies and extractors
Outputaggregated review reports with aspect-level breakdowns

The pipeline ran in production powering FamePilot's review analytics for its brand customers.

What this case is proof of

The productization pattern I now use across consulting engagements has its roots in this build. Deterministic input goes in (review text plus domain), structured intermediate representation gets computed cheaply (aspects, opinions, links, sentiments), and the LLM only enters at the step where natural language is actually needed (the report). This stays cost-efficient at scale, stays controllable, and stays accurate.

If you have a high-volume NLP problem where running everything through an LLM is too expensive and a single sentiment score is too coarse, this is the architecture I would design for you.

Want a similar build for your team?

Book a 30-min call. We will talk through the architecture and what it would take to ship something like this for you.