What Makes a Shampoo 'Natural'? A Developer's Guide to Ingredient Classification

What Makes a Shampoo 'Natural'? A Developer's Guide to Ingredient Classification

Published Jun 19, 202617 min read

A product manager drops a one-line ticket into your sprint: "Add a 'natural' badge to product pages." Two-hour job, they figure. You open your ingredient data source expecting a is_natural boolean to query against, and there's nothing there. Welcome to the quiet trap of building a natural ingredient shampoo filter that actually holds up — because the word "natural" has no field, no flag, and no regulatory anchor behind it.

Three problems surface within the first ten minutes. First, "natural" has no legal definition in cosmetics, so there's no authoritative source returning the value you want to display. Second, INCI lists are dense with chemical-sounding names for plant-derived compounds — Sodium Coco-Sulfate is coconut-derived, Tocopherol is vitamin E — so any naive string-match that flags "looks like a chemical" breaks on the first label. Third, the front-label marketing ("botanical," "plant-powered," "100% natural") almost never matches what the back-label INCI panel actually declares.

Get the classification wrong and the consequences are concrete: you mislead users, expose the brand to greenwashing complaints, or fail a compliance review. That last risk is sharpening — according to a NaTrue briefing, the EU is rolling out anti-greenwashing directives that ban vague environmental claims and demand verifiable, specific substantiation.

This guide hands you a classification model you can implement against an ingredient list — schema design, parsing logic, edge-case handling, and the single-versus-batch API decision. Not a marketing essay about what "natural" should mean. A working spec.

A developer's desktop split-screen — left side a code editor showing a JSON ingredient array, right side a shampoo bottle with a "100% Natural Botanicals" front label propped against the monitor. Overhead/slightly-angled desk shot, natural

Table of Contents

Why 'Natural' Has No API-Ready Definition

A natural ingredient shampoo feature fails at the data layer before you write a line of UI code, because no authority returns a queryable definition of "natural" you can trust. This is the single technical truth that shapes everything downstream.

In the United States, the FDA has no regulatory definition of "natural" for cosmetic labeling — the term simply does not appear in cosmetic regulations, according to the Michigan State University Center for Research on Ingredients & Safety. A brand can print "natural" on a bottle whose formula is 90% petrochemical-derived and break no labeling rule.

Contrast that with "organic," which is anchored. Under the USDA National Organic Program, a cosmetic carrying the USDA Organic seal must meet the same four tiers as food, per the USDA Agricultural Marketing Service: "100% organic" (every ingredient organic), "organic" (≥95% organic content), "made with organic [X]" (≥70% organic), and below 70% where only individual organic ingredients may be named — guidance echoed by NSF. Those are measurable thresholds. They map cleanly to a numeric field. "Natural" maps to nothing.

The EU doesn't fix this either, though it constrains it differently. Cosmetic claims fall under Article 20 of Regulation 1223/2009 and Regulation (EU) No 655/2013, which set six common criteria — legal compliance, truthfulness, evidential support, honesty, fairness, and informed decision-making — according to EcoMundo's compliance guide and Cosmetics Europe. A "natural" claim must be substantiated and not misleading. But the regulation defines the criteria for making a claim, not the threshold for being natural. It governs how you may speak, not what the word means.

There's a near-miss worth knowing. A proposed U.S. bill floated a ≥70% natural-substance threshold (excluding water and salt) for any product labeled "natural," per RegASK and ArentFox Schiff. Proposed. Not law. You can't build against a bill that may never pass.

Natural is not a property of an ingredient — it's a policy decision your product makes and must be able to defend.

The downstream consequence is unavoidable: because no source returns an is_natural boolean, every team encodes its own definition. The real engineering task is designing a defensible classification rule, not "looking up the truth value." That reframing is the thesis of everything that follows.

Flat-lay (directly overhead) of three shampoo bottles turned to show "natural"/"botanical"/"plant-powered" front-label claims, each with a printed INCI panel slip placed beside it showing the actual chemical-sounding ing

The feature is worth building because demand is real and measurable. In personal-care surveys, 74% of respondents say organic ingredients matter to them, and 65% want clear ingredient lists they can understand, according to NSF International. In clean-beauty tracking, 68% of consumers actively seek "clean" or natural beauty products, per ESW. And the organic shampoo market alone sits at roughly USD 4.54 billion in 2023, forecast toward USD 7.04 billion by 2030, according to Grand View Research. The appetite for the badge is settled. The defensibility of the badge is your problem to solve.

The Four Classification Models You Can Actually Implement

With no external truth value to query, you pick a model and encode it. Four are realistic to implement. Each makes a different trade between rigor and effort.

Origin-based classification labels each ingredient as plant- or mineral-derived versus synthetic. It's the simplest baseline and maps directly to a single origin/source field per ingredient. You ask one question — where did this come from — and aggregate the answers.

Processing-based (ISO 16128-aligned) uses the ISO 16128 three-class system: "natural" (unmodified plant or mineral source, simple physical processing only), "derived natural" (more than 50% natural origin by molecular weight or renewable carbon content after chemical processing), and "non-natural" (predominantly petrochemical or synthetic). ISO 16128 also produces a natural origin index from 0 to 1 for each ingredient and for the whole formula, letting you compute statements like "87% ingredients of natural origin," according to SILAB, CTPA, Sophim, and the ISO 16128-2 standard. This is the most standards-defensible model.

Certification-mirroring replicates COSMOS or Ecocert allowlists. COSMOS requires nearly all ingredients to be of natural origin with a narrow allowlist of permitted synthetics — specific preservatives, for example — per Ecocert and COSMOS-standard AISBL. Ecocert "natural cosmetics" require ≥95% of total ingredients to be of natural origin, plus packaging environmental criteria, per the Ecco-Verde Ecocert summary and Vivify. Highest defensibility, highest data burden.

Exclusion-list-based defines "natural" as the absence of a blocklist — no sulfates, no parabens, no silicones. Easiest for users to grasp, weakest scientifically. Excluding sulfates doesn't make a product natural; it makes it sulfate-free.

ModelData fields requiredImplementation complexityStandards basisUser-intuitiveness
Origin-basedOrigin/source per ingredientLowNone (ad hoc)High
Processing-based (ISO 16128)Origin + natural-origin index (0–1)MediumISO 16128 (formal)Medium
Certification-mirroringFull allowlist + % compositionHighCOSMOS / EcocertLow
Exclusion-list-basedBlocklist match flagsLowNone (consumer heuristic)High

Match the model to your audience. E-commerce and DTC product pages are usually best served by exclusion-list or origin-based logic, where speed and an intuitive badge matter more than formal rigor — shoppers want a fast, legible signal. Compliance and formulation teams should reach for certification-mirroring or ISO 16128, where defensibility outweighs simplicity and someone may eventually ask you to justify the claim in writing. Scanner apps fit origin-based logic with per-ingredient resolution, since the user is inspecting natural shampoo ingredients one at a time and wants a verdict on each.

The trade-off runs in one direction and it's worth internalizing: intuitiveness and defensibility pull against each other. The most explainable model — the exclusion list — is the least scientifically grounded. The most rigorous — ISO 16128 — needs per-ingredient natural-origin index data that most teams don't have sitting in their database. Picking a model is really picking which of those two costs you're willing to pay.

Reading an INCI List Like a Classifier, Not a Chemist

A chemist reads a label to understand chemistry. You're building a pipeline that routes each INCI entry to a classification programmatically, and a natural ingredient shampoo verdict is the aggregate of many small, repeatable decisions. Here's the per-ingredient sequence.

Step one: normalize the INCI name. Strip casing and whitespace, and handle Latin botanical names consistently — Cocos Nucifera Oil, Aloe Barbadensis Leaf Juice. Normalization is what lets two records of the same ingredient collide into one match instead of two near-misses.

Step two: resolve synonyms and identifiers. Collapse marketing and common names to a canonical INCI name, then disambiguate using CAS and EC numbers. Two ingredients can share a common name; a CAS number is unique. Without a canonical numeric ID, you can't safely decide whether two database rows are the same entity.

Step three: look up origin metadata. Plant, mineral, or synthetic. This is the field that most classification models actually pivot on, and it's the one you cannot infer from the name alone.

Step four: apply the processing rule. Run the entry against your chosen model from the previous section — under ISO 16128, ask whether natural-origin content exceeds 50% to decide between "derived natural" and "non-natural."

Step five: flag ambiguous derived surfactants. Sodium Coco-Sulfate is coconut-derived but heavily processed and contested. Cocamidopropyl Betaine is coconut-derived but a documented irritant and sensitizer — a CAPB safety assessment in Cosmetics journal and a report in Dermatitis both record irritation and sensitization at use concentrations. These pass a naive origin check and fail a closer one. Route them to manual review.

Step six: emit a label and its basis. Output the classification and the field that justified it. "Derived natural — natural-origin content >50%" is auditable. "Natural" with no basis is a liability.

Two trap categories deserve their own attention. Nature-identical actives are synthetic molecules chemically identical to natural ones; the name and the chemistry both look natural while the origin is not. And plant-derived-but-heavily-processed surfactants clear the origin gate but stumble on the processing test.

This pipeline runs many times per product. A typical retail shampoo carries 10–30 INCI entries, according to Healthline, and some mainstream formulas reach 30–60, per Allpa Botanicals — so design step one through six to be cheap, because you'll execute them dozens of times per bottle.

Mapping Ingredient Data Fields to a 'Natural' Label

A defensible rule is only possible with structured fields. Scraping front-label text gives you marketing copy — "natural," "botanical" — not source, not CAS, not irritancy. These are the concrete fields your classifier should query, and why each one matters.

Origin / source. This is the baseline natural-versus-synthetic signal and the field most models pivot on. Without it you're guessing from the ingredient name, and the name lies in both directions: Tocopherol sounds synthetic and is vitamin E, while a pretty botanical name can sit on a heavily processed derivative. Origin metadata is the difference between classification and guesswork.

CAS / EC identifiers. These disambiguate ingredients that share common names. A canonical numeric ID is what lets two database rows be merged or kept distinct reliably. When a user-submitted INCI token is ambiguous, the CAS number is your tiebreaker.

Synonyms. Collapse marketing aliases and trade names down to a canonical INCI so your classifier sees one entity, not five. The same compound may appear under several names across suppliers; without synonym resolution you'll double-count, mismatch, or miss it entirely.

Comedogenicity and irritancy scores (0–5). This is where "natural ≠ safe" lives, and where secondary filtering happens. Dermatologists are blunt about it: Cleveland Clinic experts note that "natural" and "clean" are unregulated terms and that natural ingredients can cause severe allergic reactions. Cocamidopropyl betaine — a coconut-derived surfactant marketed as mild — is a documented sensitizer per the CAPB safety assessment and the Dermatitis report cited earlier. A numeric irritancy score lets your UI tell the truth.

A shampoo can be 100% plant-derived and still score a 4 on irritancy — natural and gentle are different columns in your database.

severity labels and safety_status. Surface these alongside the natural flag so the interface never conflates "natural" with "gentle." The question users implicitly ask — is natural shampoo gentle — is a separate query against separate fields, and your schema should keep it that way.

This is precisely the field set a structured ingredient API returns: origin, CAS/EC, synonyms, comedogenicity and irritancy scores, severity labels, and an overall safety_status. The Dermalytics API exposes exactly these. The architectural point stands on its own — query structured data, not label text — but it's worth noting which platforms already deliver this shape.

Building the Classifier: Single Lookup vs. Batch Analysis

Two integration patterns are available, and the right one depends on whether you're inspecting ingredients one at a time or scoring a whole product.

Single lookup via GET /v1/ingredients/{name} suits a scanner app streaming one ingredient as the user scans, or a "tap an ingredient to learn more" interaction. You fetch per-entry data and assemble the natural verdict client-side. The pattern is granular and responsive, at the cost of orchestrating N calls per product.

Batch analyze via POST /v1/analyze submits the whole natural shampoo formulation in one request and returns per-ingredient data plus an aggregate safety_status. This fits an e-commerce page computing a single product-level natural score at page-build or cache-warm time. One round trip, one response to parse.

DimensionGET /v1/ingredients/{name} (single)POST /v1/analyze (batch)
Best-fit use caseScanner / tap-to-inspectProduct-page / catalogue score
Requests per productOne per ingredient (10–30)One per formulation
Aggregate scoringComputed client-sideReturns aggregate safety_status
Latency profileSub-100ms per callSub-100ms median, single round trip
Credit behaviorCharged per successful matchCharged per successful match
Implementation effortHigher (orchestrate N calls)Lower (one call, parse response)

Scoping matters for both cost and latency. A 10–30 ingredient shampoo, the count Healthline reports for typical retail formulas, is one small batch call — not 30 separate round trips. At catalogue scale, collapsing N requests into one is the difference between a page build that finishes and one that times out. The credit-on-successful-match model means unmatched or garbage INCI tokens don't cost you — useful when user-submitted scan data is noisy.

So the decision splits cleanly. Scanners want streaming single lookups, because the user is inspecting ingredients sequentially and expects a per-ingredient verdict as the camera resolves each name. Catalogues want batch plus caching, because you compute a product-level score once and serve it from cache until the formulation changes. With sub-100ms median latency and a 99.9% uptime SLA, either pattern stays inside a reasonable budget — but the architecture you pick should follow the interaction, not the other way around.

For the integration path itself, the official SDKs are published on npm and PyPI, and the full API contract lives as OpenAPI 3 at api.dermalytics.dev. Generate a client from the contract and you've skipped most of the request-plumbing work before you've written your first classification rule.

A smartphone held in-hand showing an ingredient-scanner app screen — camera has captured a shampoo INCI panel and the UI overlays a per-ingredient "Natural / Derived natural / Synthetic" tag on three of the visible ingredients. Close, over-

Avoiding the Greenwashing Trap: Edge Cases That Break Your Rules

A clean model survives the demo and dies in production. The ingredients below are the ones that generate bug reports, support tickets, and legal questions on a natural ingredient shampoo feature. Each needs an explicit handling rule.

Water (Aqua). Almost always the number-one ingredient by weight, and a quiet disaster for any percentage-by-weight model — is water natural, inert, or neither, and how do you score the single largest line on the label? Recommended behavior: exclude water and salt from natural-percentage calculations. This mirrors the USDA convention, which measures agricultural content by weight excluding water and salt per the USDA AMS, and the proposed U.S. bill's 70%-excluding-water-and-salt framing per RegASK. Hardcode the exclusion or every formula skews toward "mostly water, therefore mostly natural."

Nature-identical compounds. Synthetic molecules chemically identical to natural ones — some preservatives qualify. Under ISO 16128 these typically count as non-natural despite identical chemistry. Recommended behavior: classify by origin and process, not by molecular identity, and document that choice so it's defensible when challenged.

"Derived-from" surfactants. Heavily processed coconut and palm derivatives like Sodium Coco-Sulfate and Cocamidopropyl Betaine pass a naive origin check but can fail the ISO 16128 >50% natural-origin test depending on processing. CAPB allergy commonly presents as eyelid, facial, scalp, and neck dermatitis, according to a ScienceDirect topic overview and a Dermatitis article on coconut-derived surfactant allergenicity. Recommended behavior: route these to the "derived natural" class and decouple the safety flag entirely.

Fragrance / Parfum. An opaque composite that can hide dozens of undisclosed components, which defeats per-ingredient classification on its face. Recommended behavior: treat Parfum as "unclassifiable/unknown" rather than silently assuming it's natural. A false "natural" on the most opaque entry in the list is the easiest greenwashing claim to lose.

Single-ingredient greenwashing. Products that overemphasize one plant ingredient while the rest of the formula is conventional, or lean on green imagery and color to imply an eco-benefit that the INCI panel doesn't support — patterns documented by Provenance, MPR Labs, and the Urban Ecology Center. Recommended behavior: require a formula-level threshold, never a single-ingredient trigger. The EU's anti-greenwashing directives, per NaTrue, now demand verifiable, specific claims — a single hero botanical won't clear that bar.

The ingredient that breaks your classifier is never the obvious chemical — it's the water, the fragrance, and the coconut-derived surfactant nobody flagged.

The framing that keeps all of this honest: "natural" is a marketing category, not a toxicological one. Evaluate safety, irritation potential, and exposure — not origin alone — as the Skin Science Authority and cosmetic science educator Dr. Michelle Wong of Lab Muffin both argue. Your classifier can report origin accurately and still be dangerous if the UI lets users read "natural" as "safe."

Implementation Checklist: Shipping a Defensible Classifier

Run this before you ship the badge. Each item points back to the relevant section without re-litigating it.

  1. Pick and document your classification model. Choose one of the four — origin, ISO 16128 processing, certification-mirroring, or exclusion-list — and write the rule down. The documentation is your defense if a greenwashing claim is ever raised against the natural ingredient shampoo label.
  2. Define your data source and enrichment path. Confirm you can retrieve origin, CAS/EC, and synonyms for every INCI entry. String-matching front-label text is not a data source; it's a guess dressed as one.
  3. Choose single-lookup versus batch architecture for your use case. Scanner interactions point to GET /v1/ingredients/{name}; product pages point to POST /v1/analyze. Pick the pattern that follows the interaction.
  4. Separate the "natural" flag from safety and irritancy in schema and UI. Keep severity, irritancy (0–5), and safety_status as distinct columns so the badge can never quietly imply "gentle."
  5. Handle the known edge cases explicitly. Exclude water and salt from percentage rules, treat Parfum as unclassifiable, and route derived surfactants to "derived natural." These are the entries that break otherwise-clean logic.
  6. Surface the basis of the label to users. Show why something is classified natural. Transparency is the strongest greenwashing defense you have, and it aligns with EU claim-substantiation rules.
  7. Test against real shampoo INCI lists before shipping. Run 10–30-ingredient real formulations through the pipeline and review the ambiguous-flag bucket by hand. The edge cases only reveal themselves against production data.

For a low-friction first step, make a single GET /v1/ingredients/{name} call against a known ingredient — try Sodium Coco-Sulfate or Cocamidopropyl Betaine — using the OpenAPI 3 contract at api.dermalytics.dev or the official npm and PyPI SDKs, and inspect the returned origin, CAS, and irritancy fields. That single response is enough to prototype rule number one and see exactly which fields your classifier will pivot on.

Frequently Asked Questions

Is "natural" the same as "organic" for shampoo?

No. "Organic" is certifiable and regulated — the USDA AMS requires ≥95% organic content for the "organic" seal and ≥70% for "made with organic [X]." "Natural" has no FDA definition at all, per MSU CRIS. One is a measurable threshold you can query; the other is a policy you define and must be ready to defend.

Can an ingredient be both natural and synthetic?

Effectively yes — that's the "nature-identical" category: a synthetic molecule chemically identical to a naturally occurring one, common among certain preservatives. Under ISO 16128 these usually count as non-natural, because classification follows origin and processing rather than molecular identity, according to SILAB and CTPA.

How many ingredients are in a typical shampoo INCI list?

Most retail shampoos list 10–30 ingredients, and some mainstream formulas reach 30–60, especially with multiple fragrance components and plant extracts, per Healthline and Allpa Botanicals. That range is your batch-call sizing target when budgeting latency and credits.

Does a higher natural percentage mean a gentler shampoo?

No — natural and gentle are unrelated axes, and the answer to "is natural shampoo gentle" lives in a different field. Cocamidopropyl betaine is coconut-derived yet a documented irritant and sensitizer, and Cleveland Clinic dermatologists note natural ingredients can cause severe allergic reactions. Keep irritancy as its own column.