A review of 2017’s IRSG Search Solutions conference
The amount of searchable content online coupled with the main monetisation technique in the web (advertising) has made better search tools a necessity to avoid the spread of misinformation and undesirable content. It is understandable that the focus of current commercial platforms with search tools (e.g. Google, Facebook, …) is not solely centered on addressing this issue. There is no such thing as a “free lunch”, we can’t expect to get the perfect tool for searching the web for free. It has to be a community driven effort, and it is the responsibility of the IR community to be a strong driver of this. There is a lot of work been done around improving search to address issues like this one, and the work shared at Search Solutions supports this. However, the fishbowl (panel) session at the conference also made apparent how the communication of this work with non-IR experts is still lacking, and should be improved.
Recently, I had the pleasure of attending Search Solutions 2017 at the British Computer Society’s offices in London. Search Solutions is a yearly event, going on for 11 years now, that brings together members of the academic community and the industry to discuss the current state of Information Retrieval (IR) research and its commercial applications. The speakers this year ranged from purely academic speakers (talking about the EU-funded Pheme project), to researchers applying IR working for big names in the industry (Google, Facebook, Microsoft, Bloomberg, LexisNexis, Elsevier and Elastic). Also at the event, were people from different backgrounds; this resulted in a very diverse fishbowl session (a more dynamic version of a panel) mainly about the fake news problem and the responsibility the IR community has in solving it. The startup world was also represented in the fishbowl session with Factmata’s co-founder, Dhruv Ghulati. This post will review the event by going through the 4 sessions it consisted of in the following order: Beyond Keyword Matching: Searching with Taxonomies and Knowledge Graphs, From Search to Conversation, New Challenges: Misinformation, Fake News & Toxic Content, Beyond Web Search. To conclude, I will provide a review of the topics covered in the event.
Beyond Keyword Matching: Searching with Taxonomies and Knowledge Graphs
The current state of search in IR is moving away from purely keyword-based to more semantically meaningful retrieval methods. Dr. Edgar Meij (Bloomberg) and Mark Fea (LexisNexis) showed ways to improve the quality of search results through manual entity and topic annotation and supervised classification. Dr. Meij showcased some of Bloomberg’s products and how they improve search recall by leveraging manual topic tags and entity links assigned to news articles by their journalists all around the world (Bloomberg has more than 2,700 journalists and analysts in more than 120 countries 1). Meij noted how these annotations made it possible for the platform to perform multilingual search by doing an entity/topic-centered search instead of a keyword one. This improves recall by a large margin when searching on news, since the search is now not localised. Similarly, Mark Fea shared how manually building a language-specific taxonomy for their topics ensured more relevant search results. Fea achieved this increase in relevance by following a cascaded approach to the supervised classification of topics under this taxonomy. In the taxonomy, root topics were classified through trained machine learning models to improve recall, whilst leaves were classified using rules to improve precision. Both speakers in this session showed how human annotation can be implemented at scale by integrating the task into the publishing loop (when you have the benefit of also having content generation in your pipeline like Bloomberg does), or maximising the impact of human effort through careful selection of what topics to train classifiers for and the relationships between them.
Note: Manually thinking about the taxonomy, will naturally improve the quality of the training data for any supervised or semi-supervised approach to topic classification; since it is hard to define a topic, having this taxonomy beforehand makes it easier to know where an article fits in that taxonomy.
From Search to Conversation
Apart from moving to a more semantically meaningful approach to IR, current search engines have become more conversational and therefore personalised. In this session, Dr. Filip Radlinski (Google) gave a presentation based on his paper A theoretical framework for conversational search and Dr. Fabrizio Silvestri (Facebook) talked about search at Facebook. Dr. Radlinski started off by establishing the similarity between current search engines (e.g. Google) and a conversation. He also noted that search is personalised to the information these engines have of that user and bounded by a search session context, which led him to the following key characteristics conversational search requires to be as effective as traditional search interfaces:
- User and System Revealment: System shares its capabilities and corpus and assists a user in expressing and/or discovering their information needs.
- Mixed Initiative: Understood as an organic collaboration between the system and the user to satisfy the user’s information needs.
- Memory: This is self-explanatory, the system must allow for the user to make references to past instances in the conversation.
- Set Retrieval: The system can understand a set of items and allow the user to get information about and manipulate this set.
As an example of how these first two properties (revealment and initiative) are present in current search engines we can use search at Facebook. Dr. Silvestri explained how, at Facebook, they allow users to search News, Images, Video, Posts, and People, but with the added complexity of personalising search suggestions and results to the user. Dr. Silvestri, went through some examples of two different classes of suggestions Facebook makes when a user searches: rewritten queries and related queries. Here are some examples of these classes of suggestions:
Dr. Silvestri, also noted how finding the right amount of personalisation (user-centered context) of the search experience (suggestions and results) is a hard problem to solve, which comes also with some privacy issues. It is evident that the tendency of search to be more conversational and personalised leads to a different dynamic between the user and the search system. A natural consequence of this trend, in conjunction with an explosion of searchable content spawns new challenges and social responsibilities for the IR community, which were covered in the next session.
New Challenges: Misinformation, Fake News & Toxic Content
This session dealt with the challenge of handling misinformation and undesirable / toxic content with search. Phil Bradley (independent consultant and blogger) gave an interesting presentation on fake news, where he explained: its origin, purpose, and how to identify them (with some examples). The most controversial part of his talk was when he took the audience through some examples of fake news (e.g. cures for cancer, sensationalist stories that bring terror to the population, abortion advice websites funded by right wing groups, etc…), and how search engines return matching documents and not “facts” or “truth”. He accused these big search engines of only being motivated by the money they can make from advertising, instead of by the impact their technology has on their users.
Bradley’s talk was evidence of a major issue with the current view people outside the IR community have on fake news: the complexity of identifying fake news is not understood by those critiquing companies who deal with search (e.g. Facebook, Google, etc…). Identifying misinformation is a big problem in IR that is under active research, and this has to be better communicated outside of the IR community. There is a lot of effort been put into researching this subject. An example of this is the PHEME project, a research project spanning 7 countries in the EU, which focused on studying and building tools to identify and visualise misinformation in the news for journalism and medical applications. Dr. Anna Kolliakou presented her work on the project, in a very detailed presentation. The takeaway of her talk was that no single solution can tackle this problem. Instead, tackling this problem requires real-time human analysis of metrics visualised in a useful way, which come from an automatic annotation system, running on real-time data from multiple sources (e.g. Twitter, medical databases, knowledge bases, …).
Also in this session, Mark Harwood’s presentation (voted by the audience as the best presentation) showed a very practical example on finding undesirable content, by cleverly treating the task as a recommendation engine implementation. He demoed the fetching of recommended content for a “user” who likes undesirable content. After that, Harwood also showed how existing search engines (i.e. Elasticsearch) can be leveraged to build tools for visualising content items in a platform and their relationships to other items (i.e. Reviews on Amazon linked by their creation time). This leads to a robust toolset that can be used to automatically flag content producers that might be undesirable, which can be later reviewed manually. An interesting question was raised by an audience member about whether these tools can be published or if they should be kept private and internal, since this is an adversarial space where knowledge of the content-flagging strategies can be exploited by the toxic content producers.
Beyond Web Search and Lightning Talks
Apart from this currently very popular and controversial problem in search, there are still new interesting ones in the field. In this session Dr. Nicola Cancedda (Bing/Microsoft) and Mark Stanger (Elsevier) took us through some new applications of IR. Dr. Cancedda described the implementation of a new unreleased feature in Outlook called “Reply With:”. The problem to solve was to reduce the number of steps a user needs to take to reply to a file request email (i.e. an email prompting for a file to be attached). Dr. Cancedda used a supervised approach to identify file request emails. After, a query was generated out of the file request email to search the user’s mailbox for emails with attachments with similar content. This way, Dr. Cancedda proved that attachments can be recommended automatically based on the content of file request emails. After, Stanger described the main challenges in building a new search engine for research data (e.g. papers, datasets, …). It was an interesting presentation that showcased why Elsevier won the most interesting 2017 search project award from Search Solutions. These presentations are evidence that IR is an ever-growing area of research, which can be applied in the industry in many ways, making IR still a very attractive field to be in.
To summarise, the event made it clear that IR researchers and engineers have to actively engage with the current problems society is facing. The amount of searchable content online coupled with the main monetisation technique of web services has made better search tools a necessity to avoid the spread of misinformation and undesirable content online. It is logical that search results shown by major search engines and social platforms are biased towards showing results that at least generate some revenue for them; in the same way, content producers like blogs and news sites are incentivised to produce as much content as possible, and as quick as possible, which leads to either “fake news” or questionable content. Search now has to be done through robust tools, in which every document has associated metadata you can trust (i.e a document is always signed by a list of unique ids that securely link it to it’s creator(s)). For example, when validating a document, a user needs to know what the motivation behind its creator is, and which other documents show opposing views. However, there is no such thing as a “free lunch”, we can’t expect to get the perfect tool for searching the web for free. It has to be a community driven effort, and it is the responsibility of the IR community to be a strong driver of this, while actively engaging and making research advances accessible to non-IR experts and useful for the industry.