27 March 2017

Initial thoughts on fairness in paper recommendation?

There are a handful of definitions of "fairness" lying around, of which the most common is disparate impact: the rate at which you hire members of a protected category should be at least 80% of the rate you hire members not of that category. (Where "hire" is, for our purposes, a prediction problem, and 80% is arbitrary.) DI has all sorts of issues, as do many other notions of fairness, but all the ones I've seen rely on a pre-ordained notion of "protected category".

I've been thinking a lot about something many NLP/ML people have thought about in their musing/navel-gazing hours: something like a recommender system for papers. In fact, Percy Liang and I build something like this a few years ago (called Braque), but it's now defunct, and its job wasn't really to recommend, but rather to do offline search. Recommendation was always lower down the TODO list. I know others have thought about this a lot because over the last 10 years I've seen a handful of proposals and postdoc ads go out on this topic, though I don't really know of any solutions.

A key property that such a "paper recommendation system" should have is that it be fair.

But what does fair mean in this context, where the notion of "protected category" is at best unclear and at worst a bad idea? And to whom should it be fair?

Below are some thoughts, but they are by no means complete and not even necessarily good :P.

In order to talk about fairness of predictions, we have to first define what is being predicted. To make things concrete, I'll go with the following: the prediction is whether the user wants to read the entire paper or not. For instance, a user might be presented with a summary or the abstract of the paper, and the "ground truth" decision is whether they choose to read the rest of the paper.

The most obvious fairness concept is authorship fairness: that whether a paper is recommended or not should be independent of who the authors are (and what institutions they're from). On the bright side, a rule like this attempts to break the rich-get-richer effect, and means that even non-famous authors' papers get seen. On the dark side, authorship is actually a useful feature for determining how much I (as a reader) trust a result. Realistically, though, no recommender system is going to model whether a result is trustworthy: just that someone finds a paper interesting enough to read beyond the abstract. (Though the two are correlated.)

A second obvious but difficult notion of fairness is that performance of the recommender system should not be a function of, eg., how "in domain" the paper is. For example, if our recommender system relies on generating parse trees (I know, comical, but suppose...), and parsing works way better on NLP papers than ML papers, this shouldn't yield markedly worse recommendations for ML papers. Or similarly, if the underlying NLP fares worse on English prose that is slightly non-standard, or slightly non-native (for whatever you choose to be "native"), this should not systematically bias against such papers.

A third notion of fairness might have to do with underlying popularity of topics. I'm not sure how to formalize this, but suppose there are two topics that anyone ever writes papers about: deep learning and discourse. There are far more DL papers that discourse papers, but a notion of fairness might establish that they be recommended at similar rates.

This strong rule seems somewhat dubious to me: if there are lots of papers on DL then probably there are lots of readers, and so probably DL papers should be recommended more. (Of course it could be that there exists an area where tons of papers get written and none get read, in which case this wouldn't be true.)

A weaker version of this rule might state conditions of one-sided error rates. Suppose that every time d discourse paper is recommended, it is read (high precision), but that only about half of the recommended DL papers get read (low precision). Such a situation might be considered unfair to discourse papers because tons of DL papers get recommended when they shouldn't, but not so for discourse papers.

Now, one might argue that this is going to be handled by just maximizing accuracy (aka click-through rate), but this is not the case if the number of people who are interested in discourse is dwarfed by the number interested in DL. Unless otherwise constrained, a system might completely forgo performance on those interested in discourse in favor of those interested in DL.

This is all fine, except that the world doesn't consist of just DL papers and just discourse papers (and nary a paper in the intersection, sorry Yi and Jabob :P). So what can we do then?

Perhaps a strategy is to say: I should not be able to predict the accuracy of recommendation on a specific paper, given its contents. That is: just because I know that a paper includes the words "discourse" and "RST" shouldn't tell me anything about what the error rate is on this paper. (Of course it does tell me something about the recommendations I would make on this paper.) You'd probably need to soften this with some empirical confidence intervals to handle the fact that many papers will have very few observations. You could also think about making a requirement/goal like this simultaneously on both false positives and false negatives.

A related issue is that of bubbles. I've many times been told that one of my (pre-neural-net) papers was done in neural nets land ten years ago; I've many times told-or-wanted-to-tell the opposite. Both of these are failures of exploration. Not out of malice, but just out of lack-of-time. If a user chooses to read papers if and only if they're on DL, should a system continue to recommend non-DL papers to them? If so, why? This directly contradicts the notion of optimizing for accuracy.

Overall, I'm not super convinced by any of these thoughts enough to even try to really formalize them. Some relevant links I found on this topic:

16 March 2017

Trying to Learn How to be Helpful (IWD++)

Over the past week, in honor women of this International Women's Day, I had a several posts, broadly around the topic of women in STEM. Previous posts in this series include: Awesome People: Bonnie Dorr, Awesome People: Ellen Riloff, Awesome People: Lise Getoor, Awesome People: Karen Spärck Jones and Awesome People: Kathy McKeown. (Today's is delayed one day, sorry!)

I've been incredibly fortunate to have a huge number of influential women in my life and my career. Probably the other person (of any gender) who contributed to my career as much as those above is my advisor, Daniel. I've been amazingly supported by these amazing women, and I've been learning and thinking and trying to do a lot over the past few years to do what I can to support women in our field.
There are a lot of really good articles out there on how to be a male ally in "tech." Some of these are more applicable to academia than others and I've linked to a few below.

Sometimes "tech" is not the same as "academia", and in the context of the academy, easily the best resource I've seen is Margaret Mitchell's writeup on a Short Essay on Retaining and Increasing Gender Diversity, with Focus on the Role that Men May Play. You should go read this now.

No really, go read it. I'll be here when you get back.

On Monday I attended a Male Allies in Tech workshop put on by ABI, with awesome organization by Rose Robinson and Lauren Murphy, in which many similar points were made. This post is basically a summary of the first half of that workshop, with my own personal attempt to try to interpret some of the material into an academic setting. Many thanks to especially to Natalia Rodriguez, Erin Grau, Reham Fagiri and Venessa Pestritto on the Women perspectives panel, and Dan Storms, Evin Robinson, Chaim Haas and Kip Zahn on the Men perspective panel (especially Evin Robinson!).

The following summary of the panels has redundancy with many of Margaret's points, which I have not suppressed and have tried to highlight.

  1. Know that you're going to mess up and own it. I put this one first because I'm entirely sure that even in the writing this post, I'm going to mess up. I'm truly uncomfortable writing this (and my fingernails have paid the price) because it's not about me, and I really don't want to center myself. On the other hand, I also think it's important to discuss how men (ie me) can try to be helpful, and shying away from discussion also feels like a problem. The only place I feel like I can honestly speak from is my own experience. This might be the worst idea ever, and if it is, I hope someone will tell me and talk to me about it. So please, feel free to email me, message me, come find me in person or whatever.
  2. Pretty much the most common thing I've heard, read, whatever, is: listen to and trust women. Pretty much all the panelists at the workshop mentioned in this in some form, and Margaret mentions this in several places. As an academic, though, there's more that I've tried to do: read some papers. There's lots of research on the topic of things like unconscious bias in all sorts of settings, and studies of differences in how men and women are cited, and suggested for talks, and everything else under the sun. A reasonable "newbie" place to start might be Lean In (by Sheryl Sandberg) which, for all the issues that it has, provides some overview of research and includes citations to literature. But in general, doing research and reading papers is something I know how to do, so I've been trying to do it. Beyond being important, I've honestly found it really intellectually engaging.
  3. Another very frequently raised topic on the panels, and something that Margeret mentions too, is to say something when you see or hear something sexist. Personally, I'm pretty bad at thinking of good responses in these cases: I'm a very non-type-A person, I'm not good with confrontation, and my brain totally goes into "flight" mode. I've found it really useful to have a cache of go-to responses. An easy one is something like "whoah" or just "not cool" whose primary benefit is being easy [3]. Both seem to work pretty well, and take very little thought/planning. A more elaborate alternative is to ask for clarification. If someone says something sexist, ask what's meant by that. Often in the process of trying to explain it, the issue becomes obvious. (I've been on the receiving side of both such tactics, too, and have found them both effective there as well.)

    Another standard thing in meetings is for men to restate what a woman has stated as their own idea. A suggested response from Rose Robinson (one of the organizers) at the workshop is "I'm so glad you brought that up because maybe it wasn't clear when [woman] brought it up earlier." I haven't tried this yet, but it's going into my collection of go-to responses so I don't have to think too much. I'd love to hear other suggestions!
  4. A really interesting suggestion from the panel at the workshop was "go find a woman in your organization with the same position as you and tell her your salary." That said, I've heard personally from two women at two different universities that they were told they could not be given more of a raise because then they'd be making more than (some white guy). I'm not sure what I can do about cases like that. A related topic is startup: startup packages in a university are typically not public, so a variant of this is to tell your peers what your startup was.
  5. There were a lot of suggestions around the idea of making sure that your company's content has broad representation; I think in academic this is closely related to the first three of Margaret's points about suggesting women for panels, talks or interviews in your stead. I would add leadership roles to that list. One thing I've been trying to do when I'm invited to regular seminar series is to look at their past speakers and decide whether I would be contributing to the problem by accepting. This is harder for one-off things like conference talks/panels (because there's often no history), but even in those cases it's easy enough to ask if I'll be on an all-male panel. In cases where I've done this, the response has been positive. I've also been trying to be more openly intentional recently: if I do accept something, I'll try to explicitly say that I'm accepting because I noticed that past speakers were balanced. Positive feedback is good. A personally useful thing I did was write template emails for turning down invitations or asking for more information, with a list of researchers from historically excluded groups in CS (including but not limited to women) who could be invited in my stead. I almost never send these exactly as is, but they give me a starting point.

    There's a dilemma here: if every talk series, panel, etc., were gender balanced, women would be spending all their time going around giving talks and would have less time for research. I don't have a great solution here. (I do know that a non-solution is to be paternalistic and make decisions for other people.) One option would be to pay honoraria to women speakers and let the "market" work. This doesn't address the dilemma fully (time != money), but I haven't heard of or found other ideas. Please help!

    Turning down invitations to things as an academic is really hard. I recognize my relative privilege here that I already have tenure and so the cost to me for turning down this or that is pretty low in comparison to someone who is still a Ph.D. student or an untenured faculty member. That is to say: it's easy for me to say that I'm willing to take a short term negative reward (not giving a talk) in exchange for a long term very positive reward (being part of a more diverse community that both does better science and is also more supportive and inclusive). If I were still pre-tenure, this would definitely get clouded with the problem that it's great if there's a better environment in the future but not so great for me if I'm not part of it. On the other hand, pre-tenure is definitely a major part of the leaky pipeline, and so it's also really important to try to be equitable here. Each person is going to have to find a balance that they're comfortable with.

    One last thought on this topic is something that I was very recently inspired to think about by Hanna Wallach. My understanding is that she, like most people, cannot accept honoraria as part of a company, and so she recently started asking places to donate her honoraria to good causes. I can accept honoraria for talks, which hurts the pipeline, but perhaps by donating these funds to organizations like ABI or BlackGirlsCode, I can try to help other parts of the pipeline. (There are tons of organizations out there I've thought about supporting; I like BGC for original intersectionality reasons.)
  6. I've been working hard to follow women on social media (and to follow members of other historically excluded groups, including women). This has been super valuable to me for expanding my views of tons of topics.
  7. The final topic at the workshop was a talk by the two authors of a new book on how and why men can mentor women called Athena Rising. This was really awesome. Mentoring in tech is different than advising in academia, but not that different. Or at least there are certainly some parallels. Looking back at Hal-a-few-years ago, I very much had fallen into the trap of "okay I advise a diverse group of PhD students ergo I'm supporting diversity." This is painfully obvious now when I re-read old grant proposals. A consistent thing I've heard is that this is a pretty low bar, especially because women who do the extra required to get to our PhD program are really really amazing.

    I still think this is an important factor, but this discussion at the workshop made me realize that I can also go out and learn how to be a better advisor, especially to students whose live experiences are very different than my own. And that it's okay if students don't want the same path in life that I do: "hone don't clone" was the catch-phrase here. This discussion reminded me a comment one of the PhD students made to me after going to Grace Hopper: she really appreciated it because she could ask questions there that she couldn't ask me. I think there will always be such questions (because my lived experience is different), but I've decided to try to close the gap a bit by learning more here.
  8. Finally (and really, thank you if you've read this far), a major problem that was made apparent to me by Bonnie Webber is that one reason that women receive fewer awards in general is because women are nominated for fewer awards (note: this is not the only reason). Nominating women for awards is a super easy thing for me to do. It costs a few hours of my time to nominate someone for an award, or to write a letter (of course for serious awards, it's far more than a few hours to write a letter, but whatev). This includes internal awards at UMD, as well as external awards like ACL (or ACM or whatever) fellows, etc. Whenever I get an email for things like this, I'm trying to think about: who could I nominate for this that might otherwise be overlooked (Margaret's point on page 2!).
I promised some other intro resources; I would suggest:
  1. GeekFeminism: Allies
  2. GeekFeminism: Resources for Allies
  3. GeekFeminism: Good sexism comebacks
  4. Everyday Feminism: Male Feminist Rules to Follow
  5. GeekFeminism: Allies Workshop
Like I said at the beginning, what I really hope is that people will reply here with (a) suggestions for things they've been trying that seem to be working (or not!), (b) critical feedback that something here is really a bad idea and that something else is likely to be much more effective, (c) and general discussion about the broad issues of diversity and inclusion in our communities.

Because of the topic of the workshop, this is obviously focused in particular on women, but the broader discussion needs to include topics related to all historically excluded groups because what works for one does not necessary work for another. Especially when intersectionality is involved. Rose Robinson ended the ABI Workshop saying "To get to the same place, women have to do extra. And Black women have to do extra extra." What I'm trying to figure out is what extra I can do to try to balance a bit more. So please, please, help me!

14 March 2017

Awesome people: Kathy McKeown (IWD++)

To honor women this International Women's Day, I have a several posts, broadly around the topic of women in STEM. Previous posts in this series include: Awesome People: Bonnie Dorr, Awesome People: Ellen Riloff, Awesome People: Lise Getoor and Awesome People: Karen Spärck Jones.

Continuing on the topic of "who has been influential in my career and helped me get where I am?" today I'd like to talk about Kathy McKeown, who is currently the Director of the Institute for Data Sciences and Engineering at Columbia. I had the pleasure of writing a mini-bio for Kathy for NAACL 2013 when I got to introduce her as one of the two invited speakers, and learned during that time that she was the first woman chair of computer science at Columbia and also the first woman to get tenure in the entirety of Columbia's School of Engineering and Applied Science. Kathy's name is near synonymous with ACL: she's held basically every elected position there is in our organization, in addition to being a AAAI, ACM, and ACL Fellow, and having won the Presidential Young Investigator Award from NSF, the Faculty Award for Women from NSF and the ABI Women of Vision Award.

One aspect of Kathy's research that I find really impressive is something that was highlighted in a nomination letter for her to be an invited speaker at NAACL 2013. I no longer have the original statement, but it was something like "Whenever a new topic becomes popular in NLP, we find out that Kathy worked on it ten years ago." This rings true of my own experience: recent forays into digital humanities, work on document and sentence compression, paraphrasing, technical term translation, and even her foundational work in the 80s on natural language interfaces to databases (now called "semantic parsing").

Although---like Bonnie and Karen---I met Kathy through DUC as a graduate student, I didn't start working with her closely until I moved to Maryland and I had the opportunity to work on a big IARPA proposal with her as PI. That was the first of two really big proposals that she'd lead and I'd work on. These proposals involved both a huge amount of new-idea-generation and a huge amount of herding-professors, both of which are difficult in different ways.

On the research end, in the case of both proposals, Kathy's feedback on ideas has been invaluable. She's amazingly good at seeing through a convoluted idea and pushing on the parts that are either unclear or just plain don't make sense. She's really helped me hone my own ideas here.

On the herding-professors end, I am so amazed with how Kathy manages a large team. We're currently having weekly phone calls, and one of the other co-PI's and I have observed in all seriousness that being on these phone calls is like free mentoring. I hope that one day I'd be able to manage even half of what Kathy manages.

One of my favorite less-research-y memories of Kathy was when our previous IARPA project was funded, she invited the entire team to a kickoff meeting in the Hamptons. It was the Fall, so the weather wasn't optimal, but a group of probably a ten faculty and twenty students converged there, ran around the beach, cooked dinner as a group, and bonded. And we discussed some research too. I still think back to this even regularly, because it's honestly not something I would have felt comfortable doing in her position. I have a tendency to keep my work life and my personal life pretty separate, and inviting thirty colleagues over for a kickoff meeting would've been way beyond my comfort zone: I think I worry about losing stature. Perhaps Kathy is more comfortable with this because of personality or because her stature is indisputable. Either way, it's made me think regularly about what sort of relationship I want and am comfortable with with students and colleagues.

Spending any amount of time with Kathy is a learning experience for me, and I also have to thank Bonnie Dorr for including me on the first proposal with Kathy that kind of got me in the door. I'm incredibly indebted to her amazing intellect, impressive herding abilities, and open personality.

Thanks, Kathy!

13 March 2017

Awesome people: Karen Spärck Jones (IWD++)

To honor women this International Women's Day, I have a several posts, broadly around the topic of women in STEM. Previous posts in this series include: Awesome People: Bonnie Dorr, Awesome People: Ellen Riloff and Awesome People: Lise Getoor.

Today is the continuation of the theme "who has been influential in my career and helped me get where I am?" and in that vein, I want to talk about another awesome person: Karen Spärck Jones. Like Bonnie Dorr, Karen is someone I first met at the Document Understanding Conference series back when I was a graduate student.

Karen has done it all. First, she invents inverse document frequency, one of those topics that's so ingrained that no one even cites it anymore. I'm pretty sure I didn't know she invented IDF when I first met her. Frankly, I'm not sure it even occurred to me that this was something someone had to invent: it was like air or water. She's the recipient of the AAAI Allen Newell Award, the BCS Lovelace Medal, the ACL lifetime achievement award, and ASIS&T Award of Merit, the Gerard Salton Award, was a fellow of the British Academy (and VP thereof), a fellow of AAAI, ECCAI and ACL. I highly recommend reading her speech from her ACL fellow award. Among other things, I didn't realize that IDF was the fourth attempt to get the formulation right!

If there are two things I learned from Karen, they are:

  1. simple is good
  2. examples are good
Although easily stated, these two principles are quite difficult to follow.I distinctly remember given a talk at DUC on BayeSum and, afterward, Karen coming up to talk to me to try to get to the bottom of what the model was actually doing and why it was working, basically sure that there was a simpler explanation buried under the model.

I also can't forget Karen routinely pushing people for examples in talks. Giving a talk on MT that doesn't have example outputs of your translation system? Better hope Karen isn't in the audience.

Karen was also a huge proponent of breaking down gender barriers in computing. She's famously quoted as saying:
I think it's very important to get more women into computing. My slogan is: "Computing is too important to be left to men."
This quote is a wonderful reflection both of Karen's seriousness and of her tongue-in-cheek humor. She was truly one of the kindest people I've met.

In particular, more than any of these specifics I just remember being so amazed and grateful that even as a third year graduate student, Karen, who was like this amazing figure in IR and summarization, would come talk to me for a half hour to help me make my research better. I was extremely sad nearly ten years ago when I learned that Karen has passed away. Just a week earlier, we had been exchanging emails about document collections, and the final email I had from her on the topic read as follows:
Document collections is a much misunderstood topic -- you have to think what its for and eg where (in retrieval) you are going to get real queries from. Just assembling some bunch of stuff and saying hey giys, what would you like to do with this is useless.
This was true in 2007 and it's true today. In fact, I might argue that it's even more true today. We have nearly infinite ability to create datasets today, be them natural, artificial or whatever, and it's indeed not enough just to throw some stuff together and cross your fingers.

I miss Karen so much. She had this joy that she brought everywhere and my life is less for that loss.  

10 March 2017

Awesome people: Lise Getoor (IWD++)

To honor women this International Women's Day, I have a several posts, broadly around the topic of women in STEM. Previous posts in this series include: Awesome People: Bonnie Dorr and Awesome People: Ellen Riloff.

Today is the continuation of the theme "who has been influential in my career and helped me get where I am?" and in that vein, I want to talk about another awesome person: Lise Getoor.

Lise is best known for her deep work in statistical relational learning, link mining, knowledge graph models and tons of applications to real world inference problems where data can be represented as a graph. She's currently a professor in CS at UCSC, but I had the fortune to spend a few years with her while she was still here at UMD. During this time, she was an NSF Career awardee, and is now a Fellow of the AAAI. At UMD when you're up for promotion, you give a "promotion talk" to the whole department, and I still remember sitting in her Full Prof promotion talk and being amazed---despite having known her for years at this point---at how well she made both deep technical contributions and also built software and tools that are useful for a huge variety of practitioners.

Like Bonnie Dorr, Lise was something of an unofficial mentor to me. Ok I'll be honest. She's still an unofficial mentor to me. Faculty life is hard work, especially when one has just moved; going through tenure is stressful in a place where you haven't had years to learn how things work; none of which is made easier by simultaneously having personal-life challenges. Lise was always incredibly supportive in all of these areas, and I don't think I realized until after she had moved to UCSC how much I benefited from Lise's professional and emotional labor in helping me survive. And how helpful it is to have an openly supportive senior colleague to help grease some gears. I always felt like Lise was on my side.

Probably one of the most important things I learned from Lise is how to be strategic, both in terms of research (what is actually worth putting my time and energy into) and departmental work (how can we best set ourselves up for success). As someone who has a tendency to spread himself too thin, it was incredibly useful to have a reminder that focusing on a smaller number of deeper things is more likely to have real lasting impact. I also found that I greatly respected her attention to excellence: my understanding (mostly from her students and postdocs) is that her personal acceptance rate on conference submissions is incredibly high (like almost 1.0), because her own internal bar for submission is generally much higher than any reviewer's. This is obviously something I haven't been able to replicate, but I think incredibly highly of Lise for this.

Lise and I got promoted the same year---her to full prof, me to associate prof---and so we had a combined celebration dinner party at one of the (many) great Eritrean restaurants in DC followed by an attempt to go see life jazz at one of my favorite venues across the street. The music basically never showed up, but it was a really fun time anyway. Lise gave me a promotion gift that I still have on my desk: a small piece of wood (probably part of a branch of a tree) with a plaque that reads "Welcome to the World of Deadwood." This is particularly meaningful to me because Lise is so far from deadwood that it puts me to shame, and I can only hope to be as un-deadwood-like as her for the rest of my career.

Thanks Lise!

09 March 2017

Awesome people: Ellen Riloff (IWD++)

To honor women this International Women's Day, I have a several posts, broadly around the topic of women in STEM. Previous posts in this series include: Awesome People: Bonnie Dorr.

Today is the continuation of the theme "who has been influential in my career and helped me get where I am?" and in that vein, I want to talk about another awesome person: Ellen Riloff. Ellen is a professor of computer science at the University of Utah, and literally taught me everything I know about being a professor. I saw a joke a while ago that the transition from being a PhD student to a professor is like being trained for five years to swim and then being told to drive a boat. This was definitely true for me, and if it weren't for Ellen I'd have spent the past N years barely treading water. I truly appreciate the general inclusive and encouraging environment I belonged to during my time at Utah, and specifically appreciate everything Ellen did. When I think of Ellen as a researcher and as a person, I think: honest and forthright.

Ellen is probably best known for her work on bootstrapping (for which she and Rosie Jones received a AAAI Classic Paper award in 2017) and information extraction (AAAI Classic Paper honorable mention in 2012), but has also worked more broadly on coreference resolution, sentiment analysis, active learning, and, in a wonderful project that also reveals her profound love of animals, veterinary medicine. Although I only "officially" worked on one project with her (on plot units), her influence on junior-faculty-Hal was deep and significant.

I would be impossible to overstate how much impact Ellen has had on me as a researcher and a person. I still remember on my first NSF proposal, I sent her a draft and her comment main was "remove half." I was like "nooooooo!!!!" But she was right, and ever since them I try to repeat this advice to myself every time I write a proposal now.

One of the most important scientific lessons I learned from Ellen is that how you construct your data matters. NLP is a field that's driven by the existence of data, but if we want NLP to be meaningful at all, we need to make sure that that data means what we think it means. Ellen's attention to detail in making sure that data was selected, annotated, and inspected correctly is deeper and more thoughtful than anyone else I've ever known. When we were working on the plot units stuff, we each spent about 30 minutes annotating a single fable, followed by another 30 minutes of adjudication, and then reannotation. I think we did about twenty of them. Could we have done it faster? Yes. Could we have had mechanical turkers do it? Probably. Would the data have been as meaningful? Of course not. Ellen taught me that when one releases a new dataset, this comes with a huge responsibility to make sure that it's carefully constructed and precise. Without that, the number of wasted hours of others that you run the risk of creating is huge. Whenever I work on building datasets these days, the Ellen-level-of-quality is my (often unreached) aspiration point.

I distinctly remember a conversation Ellen and I had about advising Ph.D. students in my first or second year, in which I mentioned that I was having trouble figuring out how to motivate different students. Somewhat tongue-in-cheek, Ellen pointed out that different students are actually different people. Obvious (and amusing) in retrospect, but as I never saw my advisor interacting one on one with his other advisees, it actually had never occurred to me that he might have dealt with each of us differently. Like most new faculty, I also had to learn how to manage students, how to promote their work, how to correct them when they mis-step (because we all mis-step), and also how to do super important things like write letters. All of these things I learned from Ellen. I still try to follow her example as best I can.

I was lucky enough to have the office just-next-door to Ellen, and we were both in our offices almost every weekday, and her openness to having me stick my head in her door to ask questions about anything from what are interesting grand research questions to how to handle issues with students, from how to write proposals to what do you want for lunch, was amazing. I feel like we had lunch together almost every day (that's probably an exaggeration, but that's how I remember it), and I owe her many thanks for helping me flesh out research ideas, and generally function as a junior faculty member. She was without a doubt the single biggest impact on my life as junior faculty, and I remain deeply indebted to her for everything she did directly and behind the scenes.

Thanks Ellen!

08 March 2017

Awesome people: Bonnie Dorr (IWD++)

To honor women this International Women's Day, I have a several posts, broadly around the topic of women in STEM.

This is the first, and the topic is "who has been influential in my career and helped me get where I am?" There are many such people, and any list will be woefully incomplete, but today I'm going to highlight Bonnie Dorr (who founded the CLIP lab together with Amy Weinberg and Louiqa Raschid, and who also is a recent fellow of the ACL!).

For those who haven't had the chance to work with Bonnie, you're missing out. I don't know how she does it, but the depth and speed at which she interacts, works, produces ideas and gets things done is stunning. Before leaving for a program manager position at DARPA and then later to IHMC, Bonnie was full professor (and then associate dean) here at UMD. At DARPA she managed basically two PM's worth of projects, and was always excited about everything. During her time as a professor here at UMD (after earning her Ph.D. from MIT), Bonnie was an NSF Presidential Faculty Fellow, a Sloan Recipient, a recipient of the NSF Young Investigator Award, and a AAAI Fellow.

I learned a lot from Bonnie. I first met her back when I was a graduate student and Daniel and I had a paper in the Document Understanding Conference (basically the summarization workshop of the day) on evaluation. It was closely related to something Bonnie had worked on previously, and I was really thrilled to get feedback from her. Fast forward six years and then I'm writing proposals with Bonnie, advising postdocs and students together, and otherwise trying to learn as much as possible by osmosis and direct instruction.

One of the most important things I learned from Bonnie was: if you want it done, just do it. Bonnie is a do-er. This is reflected in her incredibly broad scientific contributions (summarization, machine translation, evaluation, etc.) as well as the impact she had on the department. It was clear almost immediately that the faculty here really respected Bonnie's opinion; her ability to move mountains was evident.

On a more personal note, although she was not my official senior-faculty-mentor when I came to UMD, Bonnie was one of two senior faculty members here who really did everything she could to help me---both professionally and personally. Whenever I was on the fence about how to handle something, I knew that I could go to Bonnie and get her opinion and that her opinion would be well reasoned. I wouldn't always take it (sometimes to my own chagrin), but she was always ready with concrete advice about specific steps to take about almost any topic. I've also been on two very-large grant proposals with her (one successful and one not) which have both been incredible learning experiences. Getting a dozen faculty to work on a 30 page document is no easy task, and Bonnie's combination of just-do-it and lead-by-example is something I still try to mimic when I'm in a similar (if smaller) position. Even when she was at DARPA, as well as now, as professor emerita at UMD, she's still actively supporting both me and other faculty here, and clearly really cares that people at UMD are successful.

In addition to Bonnie's seriousness and excellence in research and professional life, I also really appreciated her more laid back side. When I visited UMD back before accepting a job here, she hosted a visit day dinner for prospective grad students at her house, which overlapped with one of her student's Ph.D. defense: hence, a combined party. To honor the student, Bonnie had written a rap, which she then performed with her son beatboxing. It was in that moment that I realized truly how amazing Bonnie is not just as a researcher but as a person. (Of course, she attacked this task with exactly the same high intensity that she attacks every other problem!)

Overall, Bonnie is both one of the most amazing researchers I know, one of the strongest go-getters I know, and someone I've been extremely luck to have not just as a collaborator, but also a colleague and mentor.

Thanks Bonnie!

12 December 2016

Should the NLP and ML Communities have a Code of Ethics?

At ACL this past summer, Dirk Hovy and Shannon Spruit presented a very nice opinion paper on Ethics in NLP. There's also been a great surge of interest in FAT-everything (FAT = Fairness, Accountability and Transparency), typified by FATML, but there are others. And yet, despite this recent interest in ethics-related topics, none of the major organizations that I'm involved in have a Code of Ethics, namely: the ACL, the NIPS foundation nor the IMLS. After Dirk's presentation, he, I, Meg Mitchell and Michael Strube and some others spent a while discussing ethics in NLP (a different subset, including Dirk, Shannon, Meg, Hanna Wallach, Michael and Emily Bender) went on to form a workshop that'll take place at EACL) and the possibility of a code of ethics (Meg also brought this up at the ACL business meeting) which eventually gave rise to this post.

(Note: the NIPS foundation has a "Code of Conduct" that I hadn't seen before that covers similar ground to the (NA)ACL Anti-Harassment Policy; what I'm talking about here is different.)

A natural question is whether this is unusual among professional organizations. The answer is most definitely yes, it's very unusual. The Association for Computing Machinery, the British Computer Society, the IEEE, the Linguistic Society of America and the Chartered Institute of Linguistics all have codes of ethics (or codes of conduct). In most cases, agreeing to the code of ethics is a prerequisite to membership in these communities.

There is a nice set of UMD EE slides on Ethics in Engineering that describes why one might want a code of Ethics:

  1. Provides a framework for ethical judgments within a profession
  2. Expresses the commitment to shared minimum standards for acceptable behavior
  3. Provides support and guidance for those seeking to act ethically
  4. Formal basis for investigating unethical conduct, as such, may serve both as a deterrence to and discipline for unethical behavior
Personally, I've never been a member of any of these societies that have codes of ethics. Each organization has a different set of codes, largely because they need to address issues specific to their field of expertise. For instance, linguists often do field work, and in doing so often interact with indigenous populations.

Below, I have reproduced the IEEE code as a representative sample (emphasis mine) because it is relatively brief. The ACM code and BCS code are slightly different, and go into more details. The LSA code and CIL code are related but cover slightly different topics. By being a member of the IEEE, one agrees:
  1. to accept responsibility in making decisions consistent with the safety, health, and welfare of the public, and to disclose promptly factors that might endanger the public or the environment;
  2. to avoid real or perceived conflicts of interest whenever possible, and to disclose them to affected parties when they do exist;
  3. to be honest and realistic in stating claims or estimates based on available data;  
  4. to reject bribery in all its forms;  
  5. to improve the understanding of technology; its appropriate application, and potential consequences;  
  6. to maintain and improve our technical competence and to undertake technological tasks for others only if qualified by training or experience, or after full disclosure of pertinent limitations;  
  7. to seek, accept, and offer honest criticism of technical work, to acknowledge and correct errors, and to credit properly the contributions of others;  
  8. to treat fairly all persons and to not engage in acts of discrimination based on race, religion, gender, disability, age, national origin, sexual orientation, gender identity, or gender expression;
  9. to avoid injuring others, their property, reputation, or employment by false or malicious action;  
  10. to assist colleagues and co-workers in their professional development and to support them in following this code of ethics.
The pieces I've highlighted are things that I think are especially important to think about, and places where I think we, as a community, might need to work harder.

After this past ACL, I spent some time combing through the Codes of Ethics mentioned before and tried to synthesize a list that would make sense for the ACL, IMLS or NIPS. This is in very "drafty" form, but hopefully the content makes sense. Also to be 100% clear, all of this is basically copy and paste with minor edits from one or more of the Codes linked above; nothing here is original.

  1. Responsibility to the Public
    1. Make research available to general public
    2. Be honest and realistic in stating claims; ensure empirical bases and limitations are communicated appropriately
    3. Only accept work and make statements on topics which you believe have competence to do
    4. Contribute to society and human well-being, and minimize negative consequences of computing systems
    5. Make reasonable effort to prevent misinterpretation of results
    6. Make decisions consistent with safety, health & welfare of public
    7. Improve understanding of technology, its application and its potential consequences (positive and negative)
  2. Responsibility in Research
    1. Protect the personal identification of research subjects, and abide by informed consent
    2. Conduct research honestly, avoiding plagiarism and fabrication of results
    3. Cite prior work as appropriate
    4. Preserve original data and documentation, and make available
    5. Follow through on promises made in grant proposals and acknowledge support of sponsors
    6. Avoid real or perceived COIs, disclose when they exist; reject bribery
    7. Honor property rights, including copyrights and patents
    8. Seek, accept and offer honest criticism of technical work; correct errors; provide appropriate professional review
  3. Responsibility to Students, Colleagues, and other Researchers
    1. Recognize and property attribute contributions of students; promote student contributions to research
    2. No discrimination based on gender identity, gender expression, disability, marital status, race/ethnicity, class, politics, religion, national origin, sexual orientation, age, etc. (should this go elsewhere?)
    3. Teach students ethical responsibilities
    4. Avoid injuring others, their property, reputation or employment by false or malicious action
    5. Respect the privacy of others and honor confidentiality
    6. Honor contracts, agreements and assigned responsibilities
  4. Compliance with the code
    1. Uphold and promote the principles of this code
    2. Treat violations of this code as inconsistent with membership in this organization

I'd love to see (and help, if wanted) ACL, IMLS and NIPS foundation work on constructing a code of ethics. Our fields are more and more dealing with problems that have real impact on society, and I would like to see us, as a community, come together and express our shared standards.

09 December 2016

Whence your reward function?

I ran a grad seminar in reinforcement learning this past semester, which was a lot of fun and also gave me an opportunity to catch up on some stuff I'd been meaning to learn but haven't had a chance and old stuff I'd largely forgotten about. It's hard to believe, but my first RL paper was eleven years ago at a NIPS workshop where Daniel Marcu, John Langford and I had a first paper on reducing structured prediction to reinforcement learning, essentially by running Conservative Policy Iteration. (This work eventually became Searn.) Most of my own work in the RL space has focused on imitation learning/learning from demonstrations, but me and my students have recently been pushing more into straight up reinforcement learning algorithms and applications and explanations (also see Tao Lei's awesome work in a similar explanations vein, and Tim Vieira's really nice TACL paper too).

Reinforcement learning has undergone a bit of a renaissance recently, largely due to the efficacy of its combination with good function approximation via deep neural networks. Even more arguably this advance has been due to the increased availability and interest in "interesting" simulated environments, mostly video games and typified by the Atari game collection. In a very similar way that ImageNet made neural networks really work for computer vision (by being large, and capitalizing on the existence of GPUs), I think it's fair to say that these simulated environments have provided the same large data setting for RL that can also be combined with GPU power to build impressive solutions to many games.

In a real sense, many parts of the RL community are going all-in on the notion that learning to play games is a path toward broader AI. The usual refrain that I hear arguing against that approach is based on the quantity of data. The argument is roughly: if you actually want to build a robot that acts in the real world, you're not going to be able to simulate 10 million frames (from the Deepmind paper, which is just under 8 days of real time experience).

I think this is an issue, but I actually don't think it's the most substantial issue. I think the most substantial issue is the fact that game playing is a simulated environment and the reward function is generally crafted to make humans find the games fun, which usually means frequent small rewards that point you in the right direction. This is exactly where RL works well, and something that I'm not sure is a reasonable assumption in the real world.

Delayed reward is one of the hardest issues in RL, because (a) it means you have to do a lot of exploration and (b) you have a significant credit assignment problem. For instance, if you imagine a variant of (pick your favorite video game) where you only get a +1/-1 reward at the end of the game that says whether you won or lost, it becomes much much harder to learn, even if you play 10 million frames or 10 billion frames.

That's all to say: games are really nice settings for RL because there's a very well defined reward function and you typically get that reward very frequently. Neither of these things is going to be true in the real world, regardless of how much data you have.

At the end of the day, playing video games, while impressive, is really not that different from doing classification on synthetic data. Somehow it's better because the people doing the research were not those who invented the synthetic data, but games---even recent games that you might play on your (insert whatever the current popular gaming system is) are still heavily designed---are built in such a way that they are fun for their human players, which typically means increasing difficulty/complexity and relatively regularly reward function.

As we move toward systems that we expect to work in the real world (even if that is not embodied---I don't necessarily mean the difficulty of physical robots), it's less and less clear where the reward function comes from.

One option is to design a reward function. For complex behavior, I don't think we have any idea how to do this. There is the joke example in the R+N AI textbook where you give a vacuum cleaner a reward function for number of pieces of gunk picked up; the vacuum learns to pick up gunk, then drop it, then pick it up again, ad infinitum. It's a silly example, but I don't think we have much of an understanding of how to design reward functions for truly complex behaviors without significant risk of "unintended consequences." (To point a finger toward myself, we invented a reward function for simultaneous interpretation called Latency-Bleu a while ago, and six months later we realized there's a very simple way to game this metric. I was then disappointed that the models never learned that exploit.)

This is one reason I've spent most of my RL effort on imitation learning (IL) like things, typically where you can simulate an oracle. I've rarely seen an NLP problem that's been solved with RL where I haven't thought that it would have been much better and easier to just do IL. Of course IL has it's own issues: it's not a panacea.

One thing I've been thinking about a lot recently is forms of implicit feedback. One cool paper in this area I learned about when I visited GATech a few weeks ago is Learning from Explanations using Sentiment and Advice in RL by Samantha Krening and colleagues. In this work they basically have a coach sitting on the side of an RL algorithm giving it advice, and used that to tailor things that I think of as more immediate reward. I generally think of this kind of like a baby. There's some built in reward signal (it can't be turtles all the way down), but what we think of as a reward signal (like a friend saying "I really don't like that you did that") only turn into this true reward through a learned model that tells me that that's negative feedback. I'd love to see more work in the area of trying to figure out how to transform sparse and imperfect "true" reward signals into something that we can actually learn to optimize.

28 November 2016

Workshops and mini-conferences

I've attended and organized two types of workshops in my time, one of which I'll call the ACL-style workshop (or "mini-conference"), the other of which I'll call the NIPS-style workshop (or "actual workshop"). Of course this is a continuum, and some workshops at NIPS are ACL-style and vice versa. As I've already given away with phrasing, I much prefer the NIPS style. Since many NLPers may never have been to NIPS or to a NIPS workshop, I'm going to try to describe the differences and explain my preference (and also highlight some difficulties).

(Note: when I say ACL-style workshop, I'm not thinking of things like WMT that have effectively evolved into full-on co-located conferences.)

To me, the key difference is whether the workshop is structured around invited talks, panels, discussion (NIPS-style) or around contributed, reviewed submissions (ACL-style).

For example, Paul Mineiro, Amanda Stent, Jason Weston and I are organizing a workshop at NIPS this year on ML for dialogue systems, Let's Discuss. We have seven amazing invited speakers: Marco Baroni, Antoine Bordes, Nina Dethlefs, Raquel Fernández, Milica Gasic, Helen Hastie and Jason Williams. If you look at our schedule, we have allocated 280 minutes to invited talks, 60 minutes to panel discussion, and 80 minutes to contributed papers. This is a quintessential NIPS-style workshop.

For contrast, a standard ACL-style workshop might have one or two invited talks with the majority of the time spent on contributed (submitted/reviewed) papers.

The difference in structure between NIPS-style and ACL-style workshops has some consequences:

  • Reviewing in the NIPS-style tends to be very light, often without a PC, and often just by the organizers.
  • NIPS-style contributed workshop papers tend to be shorter.
  • NIPS-style workshop papers are almost always non-archival.
My personal experience is that NIPS-style workshops are a lot more fun. Contributed papers at ACL-style workshops are often those that might not cut it at the main conference. (Note: this isn't always the case, but it's common. It's also less often the case at workshops that represent topics that are not well represented at the main conference.) On the other hand, when you have seven invited speakers who are all experts in their field and were chosen by hand to represent a diversity of ideas, you get a much more intellectually invigorating experience.

(Side note: my experience is that many many NIPS attendees only attend workshops and skip the main conferences; I've rarely heard of this happening at ACL. Yes, I could go get the statistics, but they'd be incomparable anyway because of time of year.)

There are a few other effects that matter.

The first is the archival-ness of ACL workshops which have proceedings that appear in the anthology:
(This is from EACL, but it's the same rules across the board.) I personally believe it's absurd that workshop papers are considered archival but papers on arxiv are not. By forcing workshop papers to be archival, you run the significant risk of guaranteeing that many submissions are things that authors have given up on getting into the main conference, which can lead to a weaker program.

A second issue has to do with reviewing. Unfortunately as of about three years ago, the ACL organizing committee almost guaranteed that ACL workshops have to be ACL-style and not NIPS-style (personally I believe this is massive bureaucratic overreaching and micromanaging):
By forcing a program committee and reviewing, we're largely locked into the ACL-style workshop. Of course, some workshops ignore this and do more of a NIPS-style anyway, but IMO this should never have been a rule.

One tricky issue with NIPS-style workshops is that, as I understand it, some students (and perhaps faculty/researchers) might be unable to secure travel funding to present at a non-archival workshop. I honestly have little idea how widespread this factor is, but if it's a big deal (e.g., perhaps in certain parts of the world) then it needs to be factored in as a cost.

A second concern I have about NIPS-style workshops is making sure that they're inclusive. A significant failure mode is that of "I'll just invite my friends." In order to prevent this outcome, the workshop organizers have to make sure that they work hard to find invited speakers who are not necessarily in their narrow social networks. Having a broader set of workshop organizers can help. I think that when NIPS-style workshops are proposed, they should be required to list potential invited speakers (even if these people have not yet been contacted) and a significant part of the review process should be to make sure that these lists represent a diversity of ideas and a diversity of backgrounds. In the best case, this can lead to a more inclusive program than ACL-style workshops (where basically you get whatever you get as submissions) but in the worst case it can be pretty horrible. There are lots of examples of pretty horrible at NIPS in the past few years.

At any rate, these aren't easy choices, but my preference is strongly for the NIPS-style workshop. At the very least, I don't think that ACL should predetermine which type is allowed at its conferences.

08 November 2016

Bias in ML, and Teaching AI

Yesterday I gave a super duper high level 12 minutes presentation about some issues of bias in AI. I should emphasize (if it's not clear) that this is something I am not an expert in; most of what I know is by reading great papers by other people (there is a completely non-academic sample at the end of this post). This blog post is a variant of that presentation.

Structure: most of the images below are prompts for talking points, which are generally written below the corresponding image. I think I managed to link all the images to the original source (let me know if I missed one!).

Automated Decision Making is Part of Our Lives

To me, AI is largely the study of automated decision making, and the investment therein has been growing at a dramatic rate.

I'm currently teaching undergraduate artificial intelligence. The last time I taught this class was in 2012. The amount that's changed since there is incredible. Automated decision making is now a part of basically everyone's life, and will only be more so over time. The investment is in the billions of dollars per year.

Things Can Go Really Badly

If you've been paying attention to headlines even just over the past year, the number of high stakes settings in which automated decisions are being made is growing, and growing into areas that dramatically affect real people's real life, their well being, their safety, and their rights.

This includes:
This is obviously just a sample of some of the higher profile work in this area, and while all of this is work in progress, even if there's no impact today (hard to believe for me) it's hard to imagine that this isn't going to be a major societal issue in the very near future.

Three (out of many) Source of Bias

For the remainder, I want to focus on three specific ways that bias creeps in. The first I'll talk about more because we understand it more, and it's closely related to work that I've done over the past ten years or so, albeit in a different setting. These three are:
  1. data collection
  2. objective function
  3. feedback loops

Sample Selection Bias

The standard way that machine learning works is to take some samples from a population you care about, run it through a machine learning algorithm, to produce a predictor.

The magic of statistics is that if you then take new samples from that same population, then, with high probability, the predictor will do a good job. This is true for basically all models of machine learning.

The problem that arises is when your population samples are from a subpopulation (or different population) for those on which you're going to apply your predictor.

Both of my parents work in marketing research and have spent a lot of their respective careers doing focuses groups and surveys. A few years ago, my dad had a project working for a European company that made skin care products. They wanted to break into the US market, and hired him to conduct studies of what the US population is looking for in skin care. He told them that he would need to conduct four or five different studies to do this, which they gawked at. They wanted one study, perhaps in the midwest (Cleveland or Chicago). The problem is that skin care needs are very different in the southwest (moisturizer matters) and the northwest (not so much), versus the northeast and southeast. Doing one study in Chicago and hoping it would generalize to Arizona and Georgia is unrealistic.

This problem is often known as sample selection bias in the statistics community. It also has other names, like covariate shift and domain adaptation depending on who you talk to.

One of the most influential pieces of work in this area is the 1979 Econometrica paper by James Heckman, for which he won the 2000 Nobel Prize in economics. He's pretty happy about that! If you haven't read this paper, you should: it's only about 7 pages long, it's not that difficult, and you won't believe the footnote in the last section. (Sorry for the clickbait, but you really should read the paper.)
There's been a ton of work in machine learning land over the past twenty years, much of which builds on Heckman's original work. To highlight one specific paper: Corinna Cortes is the Head of Google Research New York and has had a number of excellent papers on this topic over the past ten years. One in particular is her 2013 paper in Theoretical Computer Science (with Mohri) which provides an amazingly in depth overview and new algorithms. Also a highly recommended read.

It's Not Just that Error Rate Goes Up

When you move from one sample space (like the southwest) to another (like the northeast), you should first expect error rates to go up.

Because I wanted to run some experiments for this talk, here are some simple adaptation numbers for predicting sentiment on Amazon reviews (data due to Mark Dredze and colleagues). Here we have four domains (books, DVDs, electronics and kitchen appliances) which you should think as standins for the different regions of the US, or different demographic qualifiers.

The figure shows error rates when you train on one domain (columns) and test on another (rows). The error rates are normalized so that we have ones on the diagonal (actual error rates are about 10%). The off-diagonal shows how much additional error you suffer due to sample selection bias. In particular, if you're making predictions about kitchen appliances and don't train on kitchen appliances, your error rate can be more than two times what it would have been.

But that's not all.

These data sets are balanced: 50% positive, 50% negative. If you train on electronics and make predictions on other domains, however, you get different false positive/false negative rates. This shows the number of test items predicted positively; you should expect it to be 50%, which basically is what happens in electronics and DVDs. However, if you predict on books, you underpredict positives; while if you predict on kitchen, you overpredict positives.

So not only do the error rates go up, but the way they are exhibited chances, too. This is closely related to issues of disparate impact, which have been studied recently by many people, for instance by Feldman, Friedler, Moeller, Scheidegger and Venkatasubramanian.

What Are We Optimizing For

One thing I've been trying to get undergrads in my AI class to think about is what are we optimizing for, and whether the thing that's being optimized for is what is best for us.

One of the first things you learn in a data structures class is how to do graph search, using simple techniques like breadth first search. In intro AI, you often learn more complex things like A* search. A standard motivating example is how to find routes on a map, like the planning shown above for me to drive from home to work (which I never do because I don't have a car and it's slower than metro anyway!).

We spend a lot of time proving optimality of algorithms in terms of shortest path costs, for fixed costs that have been given to us by who-knows-where. I challenged my AI class to come up with features that one might use to construct these costs. They started with relatively obvious things: length of that segment of road, wait time at lights, average speed along that road, whether the road is one-way, etc. After more pushing, they came up with other ideas, like how much gas mileage one gets on that road (either to save the environment or to save money), whether the area is “dangerous” (which itself is fraught with bias), what is the quality of the road (ill-repaired, new, etc.).

You can tell that my students are all quite well behaved. I then asked them to be evil. Suppose you were an evil company, how might you come up with path costs. Then you get things like: maybe businesses have paid me to route more customers past their stores. Maybe if you're driving the brand of car that my company owns or has invested it, I route you along better (or worse) roads. Maybe I route you so as to avoid billboards from competitors.

The point is: we don't know, and there's no strong reason to a priori assume that what the underlying system is optimizing for is my personal best interest. (I should note that I'm definitely not saying that Google or any other company is doing any of these things: just that we should not assume that they're not.)

A more nuanced example is that of a dating application for, e.g., multi-colored robots. You can think of the color as representing any sort of demographic information you like: political leaning (as suggested by the red/blue choice here), sexual orientation, gender, race, religion, etc. For simplicity, let's assume there are way more blue robots than others, and let's assume that robots are at least somewhat homophilous: they tend to associate with other similar robots.

If my objective function is something like “maximize number of swipe rights,” then I'm going to want to disproportionately show blue robots because, on average, this is going to increase my objective function. This is especially true when I'm predicting complex behaviors like robot attraction and love, and I don't have nearly enough features to do anywhere near a perfect matching. Because red robots, and robots of other colors, are more rare in my data, my bottom line is not affected greatly by whether I do a good job making predictions for them or not.

I highly recommend reading Version Control, a recent novel by Dexter Palmer. I especially recommend it if you have, or will, teach AI. It's fantastic.
There is an interesting vignette that Palmer describes (don't worry, no plot spoilers) in which a couple engineers build a dating service, like Crrrcuit, but for people. In this thought exercise, the system's objective function is to help people find true love, and they are wildly successful. They get investors. The investors realize that when their product succeeds, they lose business. This leads to a more nuanced objective in which you want to match most people (to maintain trust), but not perfectly (to maintain clientèle). But then, to make money, the company starts selling its data to advertisers. And different individuals' data may be more valuable: in particular, advertisers might be willing to pay a lot for data from members of underrepresented groups. This provides incentive to actively do a worse job than usual on such clients. In the book, this thought exercise proceeds by human reasoning, but it's pretty easy to see that if one set up, say, a reinforcement learning algorithm for predicting matches that had long term company profit as its objective function, it could learn something similar and we'd have no idea that that's what the system was doing.

Feedback Loops

Ravi Shroff recently visited the CLIP lab and talked about his work (with Justin Rao and Shared Goel) related to stop and frisk policies in New York. The setup here is that the “stop and frisk” rule (in 2011, over 685k people were stopped; this has subsequently been declared unconstitutional in New York) gave police officers the right to stop people with much lower thresholds than probable cause, to try to find contraband weapons or drugs. Shroff and colleagues focused on weapons.

They considered the following model: a police officer sees someone behaving strangely, and decide that they want to stop and frisk that person. Before doing so, they enter a few values into their computer, and the computer either gives a thumbs up (go ahead and stop) or a thumbs down (let them live their life). One question was: can we cut down on the number of stops (good for individuals) while still finding most contraband weapons (good for society)?

In this figure, we can see that if the system thumbs downed 90% of stops (and therefore only 10% of people that police would have stopped get stopped), they are still able to recover about 50% of the weapons. With stopping only about 1/3 of individuals, they are able to recover 75% of weapons. This is a massive reduction in privacy violations while still successfully keeping the majority of weapons off the streets.

(Side note: you might worry about sample selection bias here, because the models are trained on people that the policy did actually stop. Shroff and colleagues get around this by the assumption I stated before: the model is only run on people who policy have already decided are suspicious and would have stopped and frisked anyway.)

The question is: what happens if and when such a system is deployed in practice?

The issue is that policy officers, like humans in general, are not stationary entities. Their behavior changes over time, and it's reasonable to assume that their behavior would change when they get this new system. They might feed more people into the system (in “hopes” of thumbs up) or feed fewer people into the system (having learned that the system is going to thumbs down them anyway). This is similar to how the sorts of queries people issue against web search engines change over time, partially because we learn to use the systems more effectively, and learn what to not consider asking a search engine to do for us because we know it will fail.

Now, once we've (hypothetically) deployed this system, it's collecting its own data, which is going to be fundamentally different from the data is was originally trained one. It can continually adapt, but we need good technology for doing this that takes into account the human behavior of the officers.

Wrap Up and Discussion

There are many things that I didn't touch on above that I think are nonetheless really important. Some examples:
  1. All the example “failure” cases I showed above have to do with race or (binary) gender. There are other things to consider, like sexual orientation, religion, political views, disabilities, family and child status, first language, etc. I tried and failed to find examples of such things, and would appreciate pointers. For instance, I can easily imagine that speech recognition error rates skyrocket when working for users with speech impairments, or with non-standard accents, or who speak a dialect of English that not the “status quo academic English.” I can also imagine that visual tracking of people might fail badly on people with motor impairments or who use a wheelchair.
  2. I am particularly concerned about less “visible” issues because we might not even know. The standard example here is: could a social media platform sway an election by reminding people who (it believes) belong to a particular political party to vote? How would we even know?
  3. We need to start thinking about qualifying our research better with respect to the populations we expect it to work on. When we pick a problem to work on, who is being served? When we pick a dataset to work on, who is being left out? A silly example is the curation of older datasets for object detection in computer vision, which (I understand) decided on which objects to focus on by asking five year old relatives of the researchers constructing the datasets to name all the objects they could see. As a result of socio-economic status (among other things), mouse means the thing that attaches to your computer, not the cute furry animal. More generally, when we say we've “solved” task X, does this really mean task X or does this mean task X for some specific population that we haven't even thought to identify (i.e., “people like me” aka the white guys problem)? And does “getting more data” really solve the problem---is more data always good data?
  4. I'm at least as concerned with machine-in-the-loop decision making as fully automated decision making. Just because a human makes the final decision doesn't mean that the system cannot bias that human. For complex decisions, a system (think even just web search!) has to provide you with information that helps you decide, but what guarantees do we have that that information isn't going to be biased, either unintentionally or even intentionally. (I've also heard that, e.g., in predicting recidivism, machine-in-the-loop predictions are worse than fully automated decisions, presumably because of some human bias we don't understand.)
If you've read this far, I hope you've found some things to think about. If you want more to read, here are some people whose work I like, who tweet about these topics, and for whom you can citation chase to find other cool work. It's a highly biased list.
  • Joanna Bryson (@j2bryson), who has been doing great work in ethics/AI for a long time and whose work on bias in language has given me tons of food for thought.
  • Kate Crawford (@katecrawford) studies the intersection between society and data, and has written excellent pieces on fairness.
  • Nick Diakopoulos (@ndiakopoulos), a colleague here at UMD, studies computational journalism and algorithmic transparency.
  • Sorelle Friedler (@kdphd), a former PhD student here at UMD!, has done some of the initial work on learning without disparate impact.
  • Suresh Venkatasubramanian (@geomblog) has co-authored many of the papers with Friedler, including work on lower bounds and impossibility results for fairness.
  • Hanna Wallach (@hannawallach) is the first name I think of for machine learning and computational social science, and has recently been working in the area of fairness.
I'll also point to less biased sources. The Fairness, Accountability and Transparency in Machine Learning workshop takes place in New York City in a week and a half; check out the speakers and papers there. I also highly recommend the very long reading list on Critical Algorithm Studies, which covers more than just machine learning.