On Ethics and Algorithms

franck-v-g29arbbvPjo-unsplash
Photo by Franck V. on Unsplash

An article on the front page of the Observer, Revealed: how drugs giants can access your health records, caught my eye this week. In summary the article highlights that the Department of Health and Social Care (DHSC) has been selling the medical data of NHS patients to international drugs companies and have “misled” the public that the information contained in the records would be “anonymous”.

The data in question is collated from GP surgeries and hospitals and, according to “senior NHS figures”, can “routinely be linked back to individual patients’ medical records via their GP surgeries.” Apparently there is “clear evidence” that companies have identified individuals whose medical histories are of “particular interest.” The DHSC have replied by saying it only sells information after “thorough measures” have been taken to ensure patient anonymity.

As with many articles like this it is frustrating when some of the more technical aspects are not fully explained. Whilst I understand the importance of keeping their general readership on board and not frightening them too much with the intricacies of statistics or cryptography it would be nice to know a bit more about how these records are being made anonymous.

There is a hint of this in the Observer report when it states that the CPRD (the Clinical Practice Research Datalink ) says the data made available for research was “anonymous” but, following the Observer’s story, it changed the wording to say that the data from GPs and hospitals had been “anonymised”. This is a crucial difference. One of the more common methods of ‘anonymisation’  is to obscure or redact some bits of information. So, for example, a record could have patient names removed and ages and postcodes “coarsened”, that is only the first part of a postcode (e.g. SW1A rather than SW1A 2AA)  are included and ages are placed in a range rather than using someones actual age (e.g. 60-70 rather than 63).

The problem with anonymising data records is that they are prone to what is referred to as data re-identification or de-anonymisation. This is the practice of matching anonymous data with publicly available information in order to discover the individual to which the data belongs. One of the more famous examples of this is the competition that Netflix organised encouraging people to improve its recommendation system by offering a $50,000 prize for a 1% improvement. The Netflix Prize was started in 2006 but abandoned in 2010 in response to a lawsuit and Federal Trade Commission privacy concerns. Although the dataset released by Netflix to allow competition entrants to test their algorithms had supposedly been anonymised (i.e. by replacing user names with a meaningless ID and not including any gender or zip code information) a PhD student from the University of Texas was able to find out the real names of people in the supplied dataset by cross-referencing the Netflix dataset with Internet Movie Database (IMDB) ratings which people post publicly using their real names.

Herein lies the problem with the anonymisation of datasets. As Michael Kearns and Aaron Roth highlight in their recent book The Ethical Algorithm, when an organisation releases anonymised data they can try and make an intelligent guess as to which bits of the dataset to anonymise but it can be difficult (probably impossible) to anticipate what other data sources either already exist or could be made available in the future which could be used to correlate records. This is the reason that the computer scientist Cynthia Dwork has said “anonymised data isn’t” – meaning either it isn’t really anonymous or so much of the dataset has had to be removed that it is no longer data (at least in any useful way).

So what to do? Is it actually possible to release anonymised datasets out into the wild with any degree of confidence that they can never be de-anonymised? Thankfully something called differential privacy, invented by the aforementioned Cynthia Dwork and colleagues, allows us to do just that. Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in that dataset.

To understand how differential privacy works consider this example*. Suppose we want to conduct a poll of all people in London to find out who have driven after taking non-prescription drugs. One way of doing this is to randomly sample a suitable number of Londoners, asking them if they have ever driven whilst under the influence of drugs. The data collected could be entered into a spreadsheet and various statistics, e.g. number of men, number of women, maybe ages etc derived. The problem is that whilst collecting this information lots of compromising personal details may be collected which, if the data were stolen, could be used against them.

In order to avoid this problem consider the following alternative. Instead of asking people the question directly, first ask them to flip a coin but not to tell us how it landed. If the coin comes up heads they tell us (honestly) if they have driven under the influence. If it comes up tails however they tell us a random answer then flip the coin again and tell us “yes” if it comes up heads or “no” if it is tails. This polling protocol is a simple randomised algorithm which is a form of differential privacy. So how does this work?

differential privacy
If your answer is no, the randomised response answers no two out of three times. It answers no only one out of three times if your answer is yes. Diagram courtesy Michael Kearns and Aaron Roth, The Ethical Algorithm 2020

When we ask people if they have driven under the influence using this protocol half the time (i.e. when the coin lands heads up) the protocol tells them to tell the truth. If the protocol tells them to respond with a random answer (i.e. when the coin lands tails up), then half of that time they just happen to randomly tell us the right answer. So they tell us the right answer 1/2 + ((1/2) x (1/2)) or three-quarters of the time. The remaining one quarter of the time they tell us a lie. There is no way of telling true answers from lies. Surely though, this injection of randomisation completely masks the true results and the data is now highly error prone? Actually, it turns out, this is not the case.

Because we know how this randomisation is introduced we can reverse engineer the answers we get to remove the errors and get an approximation of the right answer. Here’s how. Suppose one-third of people in London have actually driven under the influence of drugs. So of the one-third who have truthfully answered “yes” to the question, three-quarters of those will answer “yes” using the protocol, that is 1/3 x 3/4 = 1/4. Of the two-thirds who have a truthful answer of “no”, one-quarter of those will report “yes”, that is 2/3 x 1/4 = 1/6. So we expect 1/4 + 1/6 = 5/12 ~ 1/3 of the population to answer “yes”.

So what is the point of doing the survey like this? Simply put it allows the true answer to be hidden behind the protocol. If the data were leaked and an individual from it was identified as being suspected of driving under the influence then they could always argue they were told to say “yes” because of the way the coins fell.

In the real world a number of companies including the US census, Apple, Google and Privitar Lens use differential privacy to limit the disclosure of private information about individuals whose information is in public databases.

It would be nice to think that the NHS data that is supposedly being used by US drug companies was protected by some form of differential privacy. If it were, and if this could be explained to the public in a reasonable and rational way, then surely we would all benefit both in the knowledge that our data is safe and is maybe even being put to good use in protecting and improving our health. After all, wasn’t this meant to be the true benefit of living in a connected society where information is shared for the betterment of all our lives?

*Based on an example from Kearns and Roth in The Ethical Algorithm.

The Price of Privacy

So we finally have the official price of privacy. AT&T (one of the largest telecommunications companies in America) have announced that their GigaPower super-fast broadband service can be obtained at a discount if customers “let us use your individual Web browsing information, like the search terms you enter and the web pages you visit, to tailor ads and offers to your interests.”  The cost of not letting AT&T do this? $29 a month. And don’t think you can use your browsers privacy settings to stop AT&T tracking your browser history or search requests. It looks like they use deep packet inspection to examine the data packets that pass through their network and allow them to eavesdrop on your data.

So far so bad but it gets worse. It is not at all clear what GigaPower subscribers get when they pay their $29 fee to opt out of the snooping service. AT&T says that it “may collect and use web browsing information for other purposes, as described in our Privacy Policy, even if you do not participate in the Internet Preferences program.”  In other words even if you pay your ‘privacy tax’ there is no actual guarantee that AT&T won’t snoop on you anyway!

The even worse thing about this, as Bruce Schneier points out here is that “privacy becomes a luxury good” that means only those that can afford the tax can have their privacy recognised thereby driving even more of a wedge between the digital haves and have not’s.

In many ways of course at least AT&T are being transparent and telling you what they do and giving you the option of opting out (whatever that means) of not taking their service at all (assuming you don’t live in a part of the country where they don’t have a virtual monopoly). Google on the other hand offers a ‘free’ email service on the basis that it scans your emails to display what it considers are relevant ads in the hope that the user is more likely to click on them and generate more advertising revenue. This is a service you cannot opt out of. Maybe it’s time for us gmail users to switch to services like those offered by Apple which has a different business model that does not rely on building “a profile based on your email content or web browsing habits to sell to advertisers”. They just make a fortune selling us nice, shiny gadgets.

Why I Became a Facebook Refusenik

I know it’s a new year and that generally is a time to make resolutions, give things up, do something different with your life etc but that is not the reason I have decided to become a Facebook refusenik.

Image Copyright http://www.keepcalmandposters.com
Image Copyright http://www.keepcalmandposters.com

Let’s be clear, I’ve never been a huge Facebook user amassing hundreds of ‘friends’ and spending half my life on there. I’ve tended to use it to keep in touch with a few family and ‘real’ friend members and also as a means of contacting people with a shared interest in photography. I’ve never found the user experience of Facebook particularly satisfying and indeed have found it completely frustrating at times; especially when posts seem to come and go, seemingly at random. I also hated the ‘feature’ that meant videos started playing as soon as you scrolled them into view. I’m sure there was a way of preventing this but was never interested enough to figure out how to disable it. I could probably live with these foibles however as by and large the benefits outweighed the unsatisfactory aspects of Facebook’s usability.

What’s finally decided me to deactivate my account (and yes I know it’s still there just waiting for me to break and log back in again) is the insidious way in which Facebook is creeping into our lives and breaking down all aspects of privacy and even our self-determination. How so?

First off was the news in June 2014 that Facebook had conducted a secret study involving 689,000 users in which friends’ postings were moved to influence moods. Various tests were apparently performed. One test manipulated a users’ exposure to their friends’ “positive emotional content” to see how it affected what they posted. The study found that emotions expressed by friends influence our own moods and was the first experimental evidence for “massive-scale emotional contagion via social networks”. What’s so terrifying about this is whether, as Clay Johnson the co-founder of Blue State Digital asked via Twitter is “could the CIA incite revolution in Sudan by pressuring Facebook to promote discontent? Should that be legal? Could Mark Zuckerberg swing an election by promoting Upworthy (see later) posts two weeks beforehand? Should that be legal?”

As far as we know this has been a one off which Facebook apologised for but the mere fact they thought they could get away with such a tactic is, to say the least, breathtaking in its audacity and not an organisation I am comfortable with entrusting my data to.

Next was the article by Tom Chatfield called The Attention Economy in which he discusses the idea that “attention is an inert and finite resource, like oil or gold: a tradable asset that the wise manipulator (i.e. Facebook and the like) auctions off to the highest bidder, or speculates upon to lucrative effect. There has even been talk of the world reaching ‘peak attention’, by analogy to peak oil production, meaning the moment at which there is no more spare attention left to spend.” Even though I didn’t believe Facebook was grabbing too much of my attention I was starting to become a little concerned that Facebook was often the first site I visited in the morning and was even becoming diverted by some of those posts in my newsfeed with titles like “This guy went to collect his mail as usual but you won’t believe what he found in his mailbox”. Research is beginning to show that doing more than one task at a time, especially more than one complex task, takes a toll on productivity and that the mind and brain were not designed for heavy-duty multitasking. As Danny Crichton argues here “we need to recognize the context that is distracting us, changing what we can change and advocating for what we can hopefully convince others to do.”

The final straw that has made me throw in the Facebook towel however was reading The Virologist by Andrew Marantz in The New Yorker magazine about Emerson Spartz the so called ‘king of clickbait”. Spartz is twenty-seven and has been successfully launching Web sites for more than half his life. In 1999, when Spartz was twelve, he built MuggleNet, which became the most popular Harry Potter fan site in the world. Spartz’s latest venture is Dose a photo- and video-aggregation site whose posts are collections of images designed to tell a story. The posts have names like “You May Feel Bad For Laughing At These 24 Accidents…But It’s Too Funny To Look Away“. Dose gets most of its feeds through Facebook. A bored teenager absent mindedly clicking links will eventually end up on a site like Dose. Spartz’s goal is to make the site so “sticky”—attention-grabbing and easy to navigate—that the teenager will stay for a while. Money is generated through ads – sometimes there are as many as ten on a page and Spartz hopes to develop traffic-boosting software that he can sell to publishers and advertisers. Here’s the slightly disturbing thing though. Algorithms for analysing users behaviour are “baked in” to the sites Spartz builds. When a Dose post is created, it initially appears under as many as two dozen different headlines, distributed at random to different Facebook users. An algorithm measures which headline is attracting clicks most quickly, and after a few hours, when a statistically significant threshold is reached, the “winning” headline automatically supplants all others. Hence users are “click-bait”, unknowingly taking part in a “test” to see how quickly they respond to a headline.

The final, and most sinister aspect to what Spartz is trying to do with Dose and similar sites is left to the end of Marantz’s article when Spartz gives his vision of the future of media:

The lines between advertising and content are blurring,” he said. “Right now, if you go to any Web site, it will know where you live, your shopping history, and it will use that to give you the best ad. I can’t wait to start doing that with content. It could take a few months, a few years—but I am motivated to get started on it right now, because I know I’ll kill it.

The ‘content’ that Spartz talks about is news. In other words he sees his goal is to feed us the news articles his algorithms calculate we will like. We will no longer be reading the news we want to read but rather that which some computer program thinks we should be reading, coupled of course with the ads the same program thinks we are most likely to respond to.

If all of this is not enough to concern you about what Facebook is doing (and the sort of companies it collaborates with) then the recent announcement of ‘keyword’ or ‘graph’ search might. Keyword search allows you to search content previously shared with you by entering a word or phrase. Privacy settings aren’t changing, and keyword search will only bring up content shared with you, like posts by friends or that friends commented on, not public posts or ones by Pages. But if a friend wanted to easily find posts where you said you were “drunk”, now they could. That accessibility changes how “privacy by obscurity” effectively works on Facebook. Rather than your posts being effectively lost in the mists of time (unless your friends want to methodically step through all your previous posts that is) your previous confessions and misdemeanors are now just a keyword search away. Maybe now is the time to take a look at your Timeline or search for a few dubious words with your name to check for anything scandalous before someone else does? As this article points out there are enormous implications of Facebook indexing trillions of our posts some we can see now but others we can only begin to guess at as ‘Zuck’ and his band of researchers do more and more to mine our collective consciousness’.

So that’s why I have decided to deactivate my Facebook account. For now my main social media interactions will be through Twitter (though that too is obviously working out how it can make money out of better and more targeted advertising of course). I am also investigating Ello which bills itself as “a global community that believes that a social network should be a place to empower, inspire, and connect — not to deceive, coerce, and manipulate.” Ello takes no money from advertising and reckons it will make money from value added services. It is early days for Ello yet and it still receives venture capital money for its development. Who knows where it will go but if you’d like to join with me on there I’m @petercripps (contact me if you want an invite).

I realise this is a somewhat different post from my usual ones on here. I have written posts before on privacy in the internet age but I believe this is an important topic for software architects and one I hope to concentrate on more this year.

Let’s Build a Smarter Planet – Part IV

This is the fourth and final part of the transcript of a lecture I recently gave at the University of Birmingham in the UK.In Part I of this set of four posts I tried to give you a flavour of what IBM is and what it is trying to do to make our planet smarter. In Part II I looked at my role in IBM and in Part III I looked at what kind of attributes IBM looks for in its graduate entrants. In this final part I take a look at what I see as some of the challenges we face in a world of open and ubiquitous data where potentially anyone can know anything about us and what implications that has on people who design systems that allow that to happen.

So let’s begin with another apocryphal tale…ec12d-whosewatchingyou

Target is the second largest (behind Walmart) discount retail store in America. Using advanced analytics software one of Target’s data analysts identified 25 products that when purchased together indicate a women is likely to be pregnant. The value of this information was that Target could send coupons to the pregnant woman at an expensive and habit-forming period of her life.

In early 2012 a man walked into a Target store outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation. “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”fd140-thisisforeveryone

Two of the greatest inventions of our time are the internet and the mobile phone. When Tim Berners-Lee appeared from beneath the semi-detached house that lifted up from the ground of the Olympic stadium during the London 2012 opening ceremony and the words “this is for everyone” flashed up around the edge of the stadium there can surely be little doubt that he had earned his place there. However as with any technology there is a downside as well as an upside. A technology that gives anyone, anywhere access to anything they choose has to be treated with great care and responsibility (as Spiderman’s uncle said, “with great power comes great responsibility”). The data analyst at Target was only trying to improve his companies profits by identifying potential new consumers of its baby products. Inadvertently however he was uncovering information that previously would have been kept very private and only known to a few people. What should companies do in balancing a persons right to privacy with a companies right to identify new customers?

There is an interesting book out at the moment called Age of Context in which the authors examine the combined effects of five technological ‘forces’ that they see as coming together to form a ‘perfect storm’ that they believe are going to change forever our world. These five forces are mobile, social media, (big) data, sensors and location aware services. As the authors state:

The more the technology knows about you, the more benefits you will receive. That can leave you with the chilling sensation that big data is watching you…

In the Internet of Things paradigm, data is gold. However, making that data available relies on a ‘contract’ between suppliers (usually large corporations) and consumers (usually members of the public). Corporations provide a free or nominally-priced service in exchange for a consumer’s personal data. This data is either sold to advertisers or used to develop further products or services useful to consumers. Third-party applications, which build off the core service, poach customers (and related customer data) from such applications. For established networks and large corporations, this can be detrimental practice because such applications eventually poach their customers. In such a scenario, large corporations need to balance their approach to open source with commercial considerations.

Companies know that there is a difficult balancing act between doing what is commercially advantageous and doing what is ethically the right. As the saying goes – a reputation takes years to be built but can be destroyed in a matter of minutes.

IBM has an organisation within it called the Academy of Technology (AoT) which has as its membership around 1000 IBM’ers from its technical community. The job of the AoT is to focus on “uncharted business and technical opportunities” that help to “facilitate IBM’s technical development” as well as “more tightly integrate the company’s business and technical strategy”. As an example of the way IBM concerns itself with issues highlighted by the story about Target one of the studies the academy looked at recently was into the ethics of big data and how it should approach problems we have mentioned here. Out of that study came a recommendation for a framework the company should follow in pursuing such activities.

This ethical framework is articulated as a series of questions that should be asked when embarking on a new or challenging business venture.

  1. What do we want to do?
  2. What does the technology allow us to do?
  3. What is legally allowable?
  4. What is ethically allowable?
  5. What does the competition do?
  6. What should we do?

As an example of this consider the insurance industry.

  • The Insurance Industry provides a service to society by enabling groups of people to pool risk and protect themselves against catastrophic loss.
  • There is a duty to ensure that claims are legitimate.
  • More information could enable groups with lower risk factors to reduce their cost basis but those in higher risk areas would need to increase theirs.
  • Taken to the extreme, individuals may no longer be able to buy insurance – e.g. using genetic information to determine medical insurance premium.

How far should we take using technology to support this extreme case? Whilst it may not be breaking any laws to raise someones insurance premium to a level where they cannot afford it, is it ethically the right thing to do?Make no mistake the challenges we face in making our planet smarter through the proper and considered use of information technology are considerable. We need to address questions such as how do we build the systems we need, where does the skilled and creative workforce come from that can do this and how do we approach problems in new and innovative ways whilst at the same time doing what is legally and ethically right.

The next part is up to you…

Thank you for your time this afternoon. I hope I have given you a little more insight into the type of company IBM is, how and why it is trying to make the planet smarter and what you might do to help if you choose to join us. You can find more information about IBM and its graduate scheme here and you can find me on Twitter and Linkedin if you’d like to continue the conversation (and I’d love it if you did).

Thank you!

What Now for Internet Piracy?

So SOPA is to be kicked into the long grass which means it is at least postponed if not killed altogether. For those who have not been following the Stop Online Piracy Act debate, this is the bill proposed by a U.S Republican Representative to expand the ability of U.S. law enforcement to fight online trafficking in copyrighted intellectual property (IP) and counterfeit goods. Supporters of SOPA said it would protect IP as well as the jobs and livelihoods of people (and organisations) involved in creating books, films music, photographs etc. Opponents reckoned the legislation threatened free speech and innovation and would enable law enforcement officers to block access to entire internet domains as well as violating the First Amendment. Inevitably much of the digerati came out in flat opposition of SOPA and staged an internet blackout on 18th January where many sites “went dark” and Wikipedia was unavailable altogether. Critics of SOPA cited that the fact the bill was supported by the music and movie industry was an indication that it was just another way of these industry dinosaurs protecting their monopoly over content distribution. So, a last minute victory for the new digital industry over the old analogue one?And yet…

Check out this TED talk by digital commentator Clay Shirky called Why SOPA is a bad idea. Shirky in his usual compelling way puts a good case for why SOPA is bad (the talk was published before the recent announcement on the bill being postponed) but the real interest for me in this talk was from the comments about it. There are many people saying yes SOPA may be a bad bill but there is nonetheless a real problem with content being given away that should otherwise be paid for and that content creators (whether they be software developers, writers or photographers) are simply losing their livelihoods because people are stealing their work. Sure, there are copyright laws that are meant to prevent this sort of thing happening but who can really chase down the web sites and peer-to-peer networks that “share” content they have not created or paid for? SOPA may have been a bad bill and really have been about protecting the interests of large corporations who just want to carry on doing what they have always done without having to adapt or innovate. However without some sort of regulation that protects the interests of individuals or small start-ups wishing to earn a living from their art, killing SOPA has not moved us forward in any way and certainly not protected their interests. Unfortunately some sort if internet regulation is inevitable.

For a historical perspective of why this is likely to be so, see the TED talk by the Liberal Democrat Paddy Ashdown called The global power shift. Ashdown argues that “where power goes governance must follow” and that there is plenty of historical evidence showing what happens when this is not the case (the recent/current financial meltdown to name but one).

So SOPA may be dead but something needs to replace it and if we are to get the right kind of governance we must all participate in the debate else the powerful special interest groups will get their own way. Clay Shirky argued that if SOPA failed to be passed it would be replaced by something else. Now then is our chance to ensure that whatever that is, is right for content creators as well as distributors.