Lesson Report:
Title
From Intentional Manipulation to Algorithmic Bias: Training Data, Proxies, and Predictive Pitfalls
Synopsis: In this week 7 session, the class shifted from intentional disruption of public discourse to unintended, systemic algorithmic bias. Students defined “bias,� analyzed where bias enters AI systems (especially through training data and proxy choices), and applied these concepts by designing and critiquing a simple predictive algorithm to flag first-year students at risk of dropping out. The session set up Thursday’s case discussion on a real-world welfare risk-scoring system (Allegheny) and clarified the first major writing assignment (policy memo).

Attendance
– Absent students mentioned: 0
– Notes: Initial in-room count announced as 16; later indicated “full 24.â€� Late arrivals noted (e.g., Gavin, Sahar). One brief breakout-room mismatch resolved.

Topics Covered (chronological)
1) Admin and assessment updates
– Week framing and modality: Acknowledged the shift from online to in-person and confirmed Week 7.
– Major assignment announced: Policy memo due November 1, 23:59 (Bishkek time). Not a midterm exam. Approximately 4 pages proposing one concrete policy solution to a specific AI-and-democracy problem discussed in the course. Syllabus to be updated; more details on Thursday.
– Video journals: Clarified this due cycle; the requirement to reference a classmate’s video is paused this week. A poll on Thursday will decide whether to drop that requirement permanently.

2) Lecture: Reframing bias (from malign actors to unintended harms)
– Transition from “intentional disruptionâ€� to “good intentions that still produce harm.â€�
– Defining “biasâ€�:
– Student inputs highlighted common definitions (prejudice, unfairness, stereotyping, partiality).
– Instructor synthesis:
– Bias can be neutral: a systematic tendency to favor one outcome independent of data.
– Social biases (e.g., racism) are one subset; algorithmic systems can exhibit bias without intent.
– Key concept: Algorithmic bias — when an algorithm systematically prefers certain outputs or classes in ways that are not reflective of reality or instructions.

3) Case illustration: “Kyrgyzstan� vs “Kyrgyzia� in Russian-language LLM responses
– Phenomenon: An LLM (e.g., ChatGPT) defaults to “Kyrgyziaâ€� in Russian, even after being corrected to use “Kyrgyzstan.â€�
– Explored causes:
– Not programmer malice; rather training-data effects.
– Language-specific corpora: Russian-language texts (including Soviet/older sources) frequently used “Kyrgyzia,â€� producing a strong statistical association.
– Context window limits: Even when corrected in-session, once context falls out of memory, responses revert to prior high-probability patterns.
– Contrast: In Kyrgyz, the model tends to use “Kyrgyzstan,â€� indicating corpus differences by language.
– Takeaway: Models reproduce distributional patterns of the data they ingest; “biasâ€� here is a byproduct of historical and linguistic corpora.

4) Student-sourced examples of algorithmic bias (diagnosing roots)
– Image generation:
– “Ideal cityâ€� prompts defaulted to white people and modern Western skylines (e.g., New York/San Francisco aesthetics).
– “Feminist cityâ€� skewed pink, reflecting stereotyped gender-color associations embedded in training tags and imagery.
– Reason: Overrepresentation of Western imagery and tags in training sets; model turns abstract prompts (“feminist,â€� “idealâ€�) into concrete visual tropes it has seen most often.
– Language model prompt mapping:
– Russian “riddlesâ€� prompt returned probability/logic puzzles rather than classic riddles—likely due to how “zagadkaâ€�/related terms co-occur with math/logic content in the training distribution.
– Cultural mismatch in advice:
– Psychological guidance skewed individualistic and Western when asked about Kyrgyz family/household dynamics (e.g., newly married women living with husband’s family).
– Reason: Psych literature and counseling scripts in the corpus are predominantly Western; the model defaults to those norms absent strong local textual anchors.
– Facial recognition failures:
– In Afghanistan, systems performed poorly for women wearing headscarves or traditional clothing; more false rejects or misidentifications.
– In Western deployments, systems misidentified Black faces at higher rates; documentary “Coded Biasâ€� cited with examples of wrongful flags and real-world harms.
– Root cause: Training data underrepresents certain facial features/clothing/contexts, producing unequal error rates (dataset imbalance and feature generalization gaps).

5) Preparing for Thursday’s case: The Allegheny child welfare risk model
– Framing questions:
– How can a well-intended system (e.g., to prioritize services or triage risk) end up perpetuating inequity?
– Two core failure modes introduced: biased data and flawed proxies (to be traced in the reading).

6) Applied activity: Building and critiquing a predictive algorithm (dropout risk)
– Task setup (Round 1: Build):
– Prompt: As a university analytics team, design a simple algorithm to flag first-year students at high risk of dropping out.
– Deliverable: 5–7 data types you would collect (examples surfaced: attendance logs, GPA thresholds, high school grades, parent engagement, club participation, lateness, fee payment status, etc.).
– Introduced key concept: Proxy — a measurable indicator used as a stand-in for an unobservable target (e.g., “risk of dropping outâ€�). Example proxies provided by instructor:
– Attendance proxy: Missed 3 classes in a row triggers a risk flag.
– Academic proxy: GPA below 2.3 triggers a risk flag.
– Workflow: Small groups drafted data types and at least one proxy threshold per type in Google Docs.
– Task setup (Round 2: Exchange and critique):
– Groups swapped docs to identify:
– Invisible students: Who is likely to be missed by this model even though they are at risk (coverage gaps)?
– Misleading signals: Where proxies would trigger a risk flag for students who are not at risk, or fail to flag those who are (false positives/false negatives)?
– Debrief highlights:
– Parental involvement as a proxy:
– Could miss at-risk students whose parents are highly involved but the student struggles with mental health, isolation, or private stressors.
– Could also look fine on paper if fees are paid and meetings attended, while the student drops out for idiosyncratic reasons (e.g., entrepreneurial pursuit).
– Engagement proxies (clubs/sports, chronic lateness):
– Clubs are optional; non-participation doesn’t equal disengagement.
– Chronic lateness may reflect commute realities (e.g., long travel, traffic) or family responsibilities, not intent to withdraw.
– Instructor synthesis:
– Two systemic pitfalls:
1) Biased data: The dataset omits or underrepresents certain student realities (e.g., commuting patterns, caregiving burdens), creating coverage bias and “invisible students.�
2) Flawed proxies: Indicators like “late 5 times� or “no club activity� may not measure the target outcome (dropout risk), but rather socioeconomic context, schedule constraints, or preference—leading to false positives/negatives.
– Bridge to Thursday:
– Students to read the condensed chapter on the Allegheny model and track where biased data and flawed proxies produced inequitable outcomes.

Actionable Items
Urgent (by Thursday)
– Post/update on LMS:
– Finalize and publish the policy memo brief, rubric, and submission link; confirm due date/time (Nov 1, 23:59 Bishkek).
– Upload/provide access to the condensed Allegheny reading (verify all students can open the file).
– Clarify video journal (“ramblesâ€�) prompt and deadline; confirm that peer-referencing is waived for this week.
– In-class plan for Thursday:
– Prepare guiding questions that map the Allegheny case to “biased dataâ€� and “flawed proxiesâ€� (students should be able to name concrete examples from the text).
– Run the poll on whether to permanently remove the “respond to a peer’s videoâ€� requirement.
– Collect artifacts:
– Ensure all groups have shared their Google Doc links with data types and proxies; archive for reference/discussion Thursday.

Upcoming (next 1–2 weeks)
– Policy memo support:
– Offer example policy areas (e.g., mitigating LLM linguistic bias, auditing facial recognition in public services, culturally aware AI counseling guidance).
– Announce office hours or a Q&A segment for memo scoping and evidence expectations.
– Syllabus refresh:
– Update the syllabus calendar to reflect the memo due date and any adjustments made to the video journal requirement.

Reminders and follow-ups
– Cultural/linguistic context in AI:
– Consider sharing a short note or resources on language-specific biases (e.g., Russian “Kyrgyziaâ€� vs “Kyrgyzstanâ€�) and best practices for prompting/correcting models across languages.
– Optional resource: Recommend “Coded Biasâ€� for students who raised facial recognition concerns.
– Activity feedback:
– Provide a brief recap handout summarizing “proxy design best practicesâ€� and common pitfalls (coverage bias, spurious correlations, threshold setting).
– If time permits, invite groups next session to quickly present one proxy they’d revise and how (to reinforce applied learning).

Homework Instructions:
ASSIGNMENT #1: Reading on the “Allegheny� Child Welfare Algorithm (Biased Data and Flawed Proxies)

You will prepare for Thursday’s discussion by completing the assigned chapter on the Allegheny child welfare algorithm and identifying where well‑intentioned systems can go wrong due to biased data and flawed proxies—directly connecting to today’s examples (e.g., Kyrgyzstan/Kyrgyzia naming, image generators defaulting to “white,� facial recognition and headscarves).

Instructions:
1) Locate the assigned chapter/excerpt on the “Allegheny� (child welfare) algorithm (the shortened ~16-page version the professor mentioned).
2) Read it in full before Thursday’s class.
3) As you read, annotate for:
– Biased data: Who was overrepresented or underrepresented in the data, and how did that shape outcomes?
– Flawed proxies: Which indicators stood in for the real outcomes of interest, and why were those proxies misleading?
4) Write down at least two concrete examples you can share in class:
– One example of biased data and its consequence.
– One example of a flawed proxy and its consequence.
5) Be ready to explain one “invisible� group the system missed and one “misleading signal� (a proxy that triggered a false flag), echoing our in-class “freshman dropout� proxy exercise.
6) Optional connection: Link one example from the reading to a case we discussed today (e.g., training-data bias causing Kyrgyzia defaults, facial recognition errors with headscarves).

Due: Complete before Thursday’s class.

ASSIGNMENT #2: Video Response Journal (“Ramble�) — Reflecting on Algorithmic Bias

You will reinforce today’s learning by recording a brief reflection that applies the biased data vs. flawed proxies framework to a concrete example from your experience or from class.

Instructions:
1) Record a brief video reflection on today’s topic (algorithmic bias). Keep it concise and focused.
2) Choose a focus (examples you may use):
– A personal encounter with algorithmic bias (search, translation, LLMs, image generators, facial recognition).
– The Kyrgyzstan/Kyrgyzia naming case: explain how training data produced this behavior.
– The difference between biased data and flawed proxies using a clear, non-technical example (e.g., attendance as a proxy for dropout risk).
3) State clearly whether the issue you discuss is primarily due to:
– Biased data, or
– A flawed proxy, or
– Both
Provide a short justification.
4) This week only: you do not need to reference or respond to a classmate’s video (the professor explicitly waived that requirement for this week).
5) Upload your video using the usual submission link for Video Response Journals.

Due: Submit before Thursday night.

ASSIGNMENT #3: Policy Memo — Proposing a Solution for an AI-and-Democracy Problem

You will draft a ~4-page policy memo that advocates one concrete solution to a specific AI-and-democracy problem we’ve discussed, practicing targeted, solution-oriented writing.

Instructions:
1) Select your problem: Choose one specific challenge at the intersection of AI and democracy that we’ve covered (e.g., algorithmic bias in public discourse, harmful recommendation systems, inequitable automated decision-making).
2) Choose your solution: Identify one actionable policy or intervention you will advocate (e.g., a change to training data practices, proxy selection/validation, transparency/oversight, safeguards for “invisible� populations).
3) Define your audience and implementer: Decide whom you are advising (e.g., a government agency, a platform company, a university) and tailor your proposal accordingly.
4) Outline your memo (about four pages):
– Problem: Define it succinctly and concretely.
– Proposal: Describe your solution and how it would work in practice.
– Rationale: Explain why it is likely to work (mechanism, expected benefits).
– Risks and trade-offs: Anticipate downsides and challenges.
– Implementation: Identify key steps, stakeholders, and resources.
– Next steps: Provide specific, near-term actions.
5) Draft and revise: Use insights from class (e.g., training-data bias, flawed proxies, “invisible� groups) to strengthen your argument and examples.
6) Check the syllabus for current guidance; additional details will be provided in Thursday’s class (the professor noted this explicitly).
7) Submit your completed memo by the deadline.

Due: November 1 at 23:59 (Bishkek time).

Leave a Reply

Your email address will not be published. Required fields are marked *