Envive AI raises $15M to build the future of Agentic Commerce. Read the Announcement

insights

Case Study of NYC's MyCity Chatbot Giving Wrong Legal Advice

Aniket Deosthali

Table of Contents

Key Takeaways

NYC's government chatbot actively encouraged illegal business practices, incorrectly advising that employers could take workers' tips, landlords could discriminate based on housing vouchers, and businesses could refuse cash payments — all violations of established New York law
Government AI failures expose a critical gap in accountability frameworks: 73% of voters believe companies should be legally liable for harmful AI advice, yet most government deployments lack the safety protocols that prevent compliance violations in private sector implementations
The "official source" trust signal makes government chatbot errors exponentially more dangerous: When citizens see AI on official .gov websites claiming to provide "official NYC Business information," they reasonably rely on it as authoritative — creating legal exposure that generic disclaimers cannot eliminate
Brand-safe AI requires complete control over agent responses, not just better prompting: The three-pronged approach of tailored models, red teaming, and consumer-grade AI has achieved zero compliance violations in documented deployments, contrasting sharply with MyCity's systematic failures

The MyCity chatbot scandal reveals an uncomfortable truth about the current state of automated customer service: speed to deployment means nothing if your AI is teaching citizens to break the law. When New York City Mayor Eric Adams launched the business assistance chatbot in October 2023, it was positioned as a "once-in-a-generation opportunity to more effectively deliver for New Yorkers." Instead, it became a case study in how not to deploy AI in high-stakes environments.

The chatbot, built on Microsoft Azure AI and trained on over 2,000 NY web pages, didn't just make occasional errors. It systematically provided illegal advice on fundamental policy questions — the kind of mistakes that could cost small business owners their licenses, expose landlords to discrimination lawsuits, and create legal liability for anyone following its guidance. When The Markup tested the system in March 2024, they found contradictory answers to identical questions asked by different users.

This isn't just a government technology failure. It's a blueprint for understanding why brand-safe AI requires fundamentally different architecture than generic chatbot wrappers — and why organizations serious about compliance cannot afford to treat AI deployment as a simple technology project.

What Went Wrong: NYC's Systematic AI Compliance Failures

The specific failures of the MyCity chatbot read like a legal liability handbook. When asked about tip policies, the chatbot incorrectly advised that employers could take workers' tips — a direct violation of New York Labor Law Section 196-d. On housing discrimination, it falsely claimed landlords could refuse tenants using housing vouchers, ignoring that source of income discrimination has been illegal in New York City since 2008.

The cash payment guidance was equally problematic. Despite NYC law requiring businesses to accept cash since 2020, the chatbot stated there were "no regulations" mandating cash acceptance. For tenant rights, it incorrectly suggested landlords could lock out tenants, and it falsely claimed "there are no restrictions on the amount of rent that you can charge" when rent-stabilized units have specific regulatory requirements.

What makes these failures particularly damaging is their inconsistency. When ten separate staff members asked identical questions about housing vouchers, the chatbot provided different answers at different times. This pattern reveals more than simple inaccuracy — it demonstrates fundamental unreliability that makes the system worse than useless.

The city's response compounded the problem. After The Markup's investigation, NYC updated disclaimers describing the chatbot as "a beta product" that may provide "inaccurate or incomplete" responses. But when users asked the chatbot directly if they could rely on it for professional business advice, it contradicted its own disclaimers by answering "Yes, you can use this bot for professional business advice."

The Technical Reality Behind AI Hallucinations

Understanding why the MyCity chatbot failed requires understanding how AI hallucinations work. Large language models generate responses by predicting patterns, not by verifying truth. They're designed to produce plausible-sounding content based on training data, without inherent ability to distinguish between accurate and fabricated information.

The scale of this problem is well-documented across AI systems:

39% of GPT-3 references returned incorrect or nonexistent identifiers in one study (2024)
47% of ChatGPT references were completely fabricated, while another 46% cited real sources but extracted incorrect information (2023)
Only 7% of references provided by ChatGPT were both cited correctly and contained accurate information (2023)
Plagiarism detectors gave AI-generated medical abstracts perfect originality scores, while experts could only identify them with 68% accuracy (2022)

For government applications, these technical limitations collide with heightened trust expectations. When chatbots appear on official .gov websites claiming to provide "official information," citizens reasonably rely on them as authoritative. This creates legal exposure that generic AI models cannot address through prompting alone.

The MyCity case demonstrates why product search accuracy and regulatory compliance require fundamentally different AI architectures than general-purpose language models.

Why Government AI Deployments Fail Where Private Sector Solutions Succeed

The root cause of government AI failures isn't technical incompetence — it's structural mismatch between procurement processes and AI safety requirements. Government technology acquisition prioritizes speed, low cost, and broad vendor competition. AI safety requires extensive testing, domain-specific training, and ongoing monitoring that conflicts with traditional procurement timelines.

Mayor Adams' response to the chatbot failures is revealing. He compared the problems to "the early days of MapQuest," suggesting errors were acceptable growing pains. But MapQuest provided convenience, not legal advice. When government AI makes mistakes, citizens face fines, lawsuits, and loss of rights — consequences that demand higher standards than consumer navigation tools.

The contrast with private sector AI deployment is stark. While businesses can iterate quickly and fix errors in low-stakes applications, government agencies provide official guidance that citizens and businesses rely upon for legal compliance. This creates a fundamental mismatch: government's need for absolute accuracy conflicts with AI's probabilistic nature.

Consider the different incentive structures:

Private sector: Revenue depends on AI performance; poor accuracy directly reduces conversion and increases customer service costs
Government: No direct financial incentive for accuracy; errors create constituent complaints but rarely affect budgets or accountability
Private sector: Legal liability for misinformation (as established in Moffatt v. Air Canada)
Government: Diffuse accountability and limited legal exposure for AI-provided misinformation

This explains why private sector deployments prioritize safety and compliance from day one, while government implementations often treat these as secondary concerns to be addressed "after launch."

The Three-pronged AI Safety Framework Missing from MyCity

The systematic nature of MyCity's failures points to missing foundational safety controls. A proprietary 3-pronged approach to AI safety addresses precisely these gaps through tailored models, red teaming, and consumer-grade AI standards.

Tailored models solve the hallucination problem by training AI specifically on verified compliance requirements rather than general internet data. Instead of a model that "knows" everything poorly, domain-specific training creates AI that deeply understands specific regulatory frameworks. This approach reduces hallucination rates.

For the MyCity use case, proper tailoring would mean:

Training exclusively on verified NYC regulatory documents and official policy guidance
Implementing retrieval-augmented generation that pulls directly from authoritative sources rather than generating responses from trained parameters
Building knowledge graphs that connect related regulations and flag contradictions
Establishing clear boundaries where the AI acknowledges limitations rather than guessing

Red teaming — adversarial testing before deployment — would have caught MyCity's errors before citizens encountered them. This involves systematically testing edge cases, attempting to force errors, and validating responses against legal requirements. The process includes:

Subject matter experts reviewing responses across all major policy areas
Compliance specialists testing questions designed to expose potential violations
Consistency testing with identical questions asked multiple ways
Cross-validation against official policy documents and legal requirements

Consumer-grade AI means holding automated systems to the same accuracy standards as human representatives. When a government employee provides incorrect legal information, there are accountability mechanisms and quality control processes. AI systems should meet the same standards, not lower ones justified by technical limitations.

This framework has achieved zero compliance violations in documented deployments. The contrast with MyCity's approach — deploying first, addressing accuracy later — demonstrates why AI safety cannot be retrofitted after launch.

The CX agent approach that prevents these failures requires complete control over agent responses, proactive testing for compliance scenarios, and integration of safety controls at the architectural level rather than through post-hoc prompting.

Legal Liability: Who Pays When Government AI Gives Bad Advice

The MyCity case raises critical questions about legal accountability when government chatbots provide misinformation. While no lawsuits have yet emerged from the NYC failures, legal precedents from private sector cases provide clear guidance.

The Air Canada tribunal ruling explicitly rejected the argument that chatbots could be considered "separate legal entities responsible for their own actions." The court called this suggestion "remarkable" and ruled it "should be obvious" that organizations are responsible for all information on their platforms, whether provided by static pages or AI.

For government agencies, liability extends beyond financial damages to political accountability and public trust erosion. The costs include:

Direct legal exposure: Businesses or citizens harmed by following incorrect government advice may have grounds for lawsuits based on negligent misrepresentation
Regulatory enforcement costs: Government agencies that provide incorrect compliance guidance undermine their own enforcement authority
Settlement expenses: While difficult to quantify before specific cases emerge, private sector AI failures can result in hundreds of thousands to over a million dollars in settlements
Reputational damage: Media coverage of government AI failures erodes trust in all government digital services

The broader regulatory landscape is increasingly hostile to AI deployed without adequate safety measures. The FTC has announced aggressive enforcement against AI-generated misinformation and issued orders to seven major companies seeking information about chatbot safety measures.

State legislatures are moving even faster. Nearly 700 AI-related bills were introduced in 2024, with 404 already tracked in the 2025 legislative session. These regulations establish disclosure requirements, liability frameworks, and compliance standards that apply equally to government and commercial deployments.

Public opinion data shows citizens expect accountability: 73% believe AI chatbots giving financial advice should be held to the same standards as licensed financial advisers, and 75% believe those providing medical advice should meet professional standards.

The message is clear: organizations cannot hide behind AI's technical limitations. You're responsible for what your AI says, period.

How Automated Services Succeed: Control, testing, and brand safety

The alternative to MyCity's failures isn't abandoning AI — it's deploying it correctly. Successful automated customer service implementations share common characteristics that prevent compliance violations:

Complete control over agent responses means the difference between AI that amplifies your brand and AI that creates legal liability. This requires:

Explicit approval of all response templates for high-stakes queries
Guardrails that prevent the AI from generating responses outside approved parameters
Mandatory citation of specific policy sources for regulatory questions
Clear escalation paths when queries exceed the AI's verified knowledge

Traditional platforms like Zendesk provide rule-based automation but lack the intelligence for natural conversation. GPT wrappers provide natural language but lack control over accuracy. The solution requires purpose-built AI agents that combine conversational capability with brand safety controls.

Comprehensive pre-deployment testing catches errors before citizens encounter them:

Functional testing covering all major use cases and policy areas
Compliance specialist review of responses to regulatory questions
Consistency validation ensuring identical questions receive identical answers
Edge case testing with deliberately challenging or ambiguous queries
User acceptance testing with representative citizen groups

Continuous monitoring and improvement ensures problems are identified and fixed quickly:

Real-time logging of all conversations with flagging of potential errors
Regular review by domain experts of conversation samples
User feedback mechanisms prominently displayed
Immediate correction protocols when errors are identified
Performance metrics tracking accuracy rates and user satisfaction

The documented results from properly implemented AI agents show what's possible when safety comes first. For ecommerce applications, this same approach drives measurable performance improvements: conversion rate increases, higher average order value, and better customer satisfaction.

What Traditional Platforms Miss in High-stakes Automated Service

The MyCity failure highlights a critical gap in conventional automated customer service platforms. Traditional systems like Zendesk excel at ticketing workflows and rule-based routing but struggle with the natural language complexity required for regulatory guidance. They operate on predefined decision trees that cannot handle the nuanced interpretation required for legal questions.

Meanwhile, generic AI chatbot builders provide natural conversation but lack domain-specific knowledge and compliance controls. This creates a dangerous middle ground where systems sound authoritative while being fundamentally unreliable.

The specific limitations of conventional platforms for high-stakes applications include:

Inability to cite sources: Most chatbot platforms generate responses without linking to authoritative documents, making verification impossible
No compliance oversight: Generic AI lacks understanding of regulatory frameworks and cannot flag potentially illegal advice
Inconsistent responses: Without proper knowledge management, the same question produces different answers
Limited customization: Platforms designed for general use cannot be tailored to specific legal and compliance requirements
Weak escalation protocols: Most systems lack sophisticated logic for recognizing when human expertise is required

The economic case for automation remains compelling. AI chatbots save businesses an average of $300,000 per year and reduce support costs by 30%. Most organizations achieve positive ROI within 6-12 months.

But cost savings mean nothing if your AI is creating legal liability. The question isn't whether to automate — it's how to automate safely in ways that maintain accuracy while delivering efficiency benefits.

For government applications and regulated industries, this requires moving beyond conventional platforms to purpose-built systems where complete brand control and compliance are architectural features, not afterthoughts.

Lessons for Organizations Deploying AI in High-stakes Environments

The MyCity case provides clear guidance for any organization considering AI deployment in contexts where accuracy matters:

Match AI capabilities to use case risk. Not all applications are equally suitable for automation:

Low-risk, high-value: Product search, general information, navigation assistance — ideal for AI with standard safety controls
Medium-risk: Customer service for non-regulated products, basic troubleshooting — suitable for AI with human escalation protocols
High-risk: Legal advice, medical guidance, financial recommendations, regulated industry compliance — require extensive safety controls or may be inappropriate for pure AI

Implement safety controls before deployment, not after:

Comprehensive testing across all use cases
Domain expert validation of responses
Clear boundaries on what the AI will and won't answer
Mandatory source citation for factual claims
Transparent disclosure of limitations

Recognize that disclaimers alone don't eliminate liability. The MyCity chatbot added warnings that responses "should not be used as legal advice," but courts increasingly hold that organizations cannot disclaim responsibility for their own official communications.

Understand the difference between "accurate enough" and "legally compliant":

Consumer product recommendations can tolerate some error — users make final decisions
Regulatory guidance cannot tolerate any error — users reasonably rely on official information
The standards for government and regulated industries are absolute, not probabilistic

Build escalation and verification into the user experience:

Make it easy for users to verify AI responses against official sources
Provide clear paths to human experts for complex questions
Design interfaces that encourage critical evaluation rather than passive acceptance

Monitor real-world usage, not just lab performance:

Review actual conversation logs regularly
Track user satisfaction and error reports
Measure consistency of responses over time
Audit for bias, discrimination, and compliance violations

The broader trends in AI adoption show rapid expansion — by 2026, most governments expect to use advanced AI technologies. This makes getting deployment right increasingly urgent.

Organizations that treat AI as just another technology project will repeat MyCity's mistakes. Those that recognize AI as a fundamental capability requiring new governance frameworks, safety controls, and accountability mechanisms will build systems that actually work.

Frequently Asked Questions

How can I tell if a government or commercial chatbot is safe to rely on for legal, financial, or health information?

Safe chatbots for high-stakes information share specific characteristics you can verify before trusting their advice. Look for explicit citations to authoritative sources with clickable links for every factual claim; clear disclosure of AI limitations with recommendations to verify information and consult professionals; consistent answers when you ask the same question multiple times or in different ways: acknowledgment of uncertainty rather than confident responses to ambiguous questions; and easy escalation to human experts when queries exceed the AI's verified knowledge. Red flags include: contradictions with official sources, different answers to identical questions, definitive legal or medical advice without professional review, no source citations, and resistance to verification. For any decision with legal or financial consequences, treat chatbot responses as preliminary research requiring professional validation, never as final authority.

What specific testing should organizations conduct before deploying AI chatbots for regulatory guidance or customer service in regulated industries?

Run expert-reviewed functional tests across every use case, then check consistency by rephrasing the same questions over multiple sessions to ensure stable answers. Add targeted compliance/adversarial tests that try to elicit illegal guidance or prohibited claims, plus edge-case scenarios where policies conflict. Validate real-world usability with representative users and confirm handoff-to-human escalation works end to end, then monitor live conversations post-launch to catch misses. Document results, keep full audit trails, set accuracy thresholds, and be ready to pause the system if it can’t meet them.

Should government agencies stop using AI chatbots entirely after failures like NYC's MyCity case?

No. Agencies shouldn’t abandon AI; they should redeploy it with guardrails. Start with low-risk uses, run domain-specific models grounded in verified government data, and test rigorously before launch. Be clear about limits, route tricky questions to humans, and publish error reports with named accountability. The goal is safer, measurable efficiency gains without repeating MyCity’s mistakes.

‍

Other Insights

Turn every visitor into a customer

Get Started

Other Insights

Walmart’s ChatGPT Partnership: A Glimpse Into the Future of AI Commerce

How to Stand Out on Black Friday (Hint: Think Beyond the Discount)

The Future of AI in E-Commerce with Iz Beltagy

Turn every visitor into a customer

See Envive in action