May 9, 2017,

Businesses often want to track mentions in the news of themselves, their spokespeople and their competitors, in order to maintain a competitive edge. To support this tracking in media monitoring platforms, sophisticated text analytics components are required. In particular, mentions of the different entities in the article’s text should be automatically identified and linked to their entries in a knowledge base (KB), such as Wikipedia, or Freebase. In fact, this process is called Entity Linking (EL) and can be very challenging as it involves coping with ambiguous names and multiple aliases of the same entity. An example of this process is illustrated in Figure 1.

Figure 1 : example from the Signal-1M dataset

Figure 1 : example from the Signal-1M dataset

The word ‘Worcester’ in the article’s text is a mention of an entity (a place in this case). EL has two steps: firstly detecting that ‘Worcester’ represents an entity (detection), and secondly linking it to the correct entity in certain KBs (disambiguation). The latter, i.e. disambiguation, involves deciding which Worcester this word is about, as there are many towns around the world which share this particular name (according to Wikipedia, there are more than 10: https://en.wikipedia.org/wiki/Worcester_(disambiguation) ). The correct one in this case is the town in Worcestershire, England, which has a unique identifier in the Wikipedia KB represented by a URL. There are number of open-source EL systems out there that have public APIs and we can use them to perform this task on news articles. An example is Spotlight, which has correctly linked ‘Worcester’ when we feed it with the article’s text.

How do EL systems do that?

Generally, effective EL systems depend heavily on the availability of a sufficient quantity of relevant information about the entities in the KB. Some examples include:

In other words, effective EL systems rely on rich contextual content and metadata describing the entities in their KBs. Such data is not usually rich, yet alone available for all entities out there. Indeed, many off-the-shelf entity linking systems use general KBs such as Freebase and Wikipedia that cover popular entities, e.g. big corporations and celebrities. Many less popular entities or domain-specific entities have a less complete profile or are not covered at all by general KBs, and therefore they cannot be easily linked by these systems.

Long-tail Entities

Recently, we have published a paper in the European Conference in Information Retrieval (ECIR 2017) studying the volume of long-tail entities in the news and the challenges associated to effectively identify and disambiguate them. The term “long-tail entities” describes the large number of entities with relatively few mentions in text collections. They are usually characterised with limited or no general KB profile and sparse or absent resources outside the KB. They are particularly of interest to Signal as they represent part of our target audience (small or medium organisations), or their spokespeople. A concrete example of such entities is shown in Figure 1. ‘Worcester’s Breakfast Club for HM Forces and Veterans’ is an example of long-tail entities. It is an organisation which has no profile in Wikipedia. In fact, the Spotlight EL system did not succeed in identifying this entity.

In our paper, we developed an approach to automatically identify long-tail entities, such that we can measure their volume in news corpora. We also uncover insights into the types of entities that cannot be easily linked by off-the-shelf EL systems, which rely on general knowledge bases.

The approach we developed works as follows. We take a large collection of news articles. In this case, we took all the news articles in the Signal-1M dataset, which is collection of 1 million news articles published over a period of 1 month (September 2015) sourced from tens of thousands of news and blog sources. We then fed each article into two different processes:

Back to our example in Figure 1. We actually show the output of the above two processes on the exemplary article there: bolded phrases/words are the output of the EL process, while underlined is the ER output. Long-tail entities will be typically identified by ER and not by EL. Indeed, Worcester’s Breakfast Club for HM Forces and Veterans’ was identified as an organisation entity by ER, but it was not linked. On the other hand, the popular entity Worcester, with a rich profile in Wikipedia, was picked up by both ER and EL.

Following this, we can use the overlap between the output of these two processes as a proxy to measure the volume of long-tail entities. A low overlap indicates a larger volume of long-tail entities. Note however, when ER identifies an entity mention and EL does not provide the link to the KB, we cannot guarantee that the mention represents an entity in the long-tail as these could be due to the performance of EL, since some entity mentions are so ambiguous.

We repeated this over the million articles in the dataset, and aggregated the results over all the entities identified by ER for different entity types. We show the results in Table 1.

Table 1: Overlap results

  Total Mentions (ER) No Overlap with EL
Person 7.71 Million 54.38%
Location 5.52 Million 7.33%
Organisation 5.37 Million 14.54%

The first row in the table is read as follows: in total there are 7.71 million mentions of people’s names in the 1 million news articles identified by the ER process and 54.38% of them (more than 4.1 million people mentions) could not be linked to an entry in the Wikipedia KB by EL. In other words, there is a large number of people mentions in the news that could not be linked at all to general KBs (long-tail entities). From the second and the third row, we can observe that the overlap is lower. However, there is still a large number of organisation mentions (773k) and location mentions (404k) that could not be linked and are likely to represent entities in the long-tail. As a summary of the results, long-tail entities are very common in news and they represent a challenge for EL systems that rely on general KBs.

We conducted another analysis where we looked at how the overlap changes for more popular entity mentions. For this, we order the unique entity mentions by their frequencies, (a total of 2,029,235 are identified by ER) and we estimate the mean no-overlap rate for different subsets of those. In Figure 2, we report the results for two different subsets: the most frequent 0.1% entity mentions (2031 entity mentions) and the most frequent 5%, along with the overall results already reported in Table 1. As expected, we see that the overlap rate decreases but only marginally. In other words, even for very commonly-mentioned entities, the EL is not capable of finding them in Wikipedia.

Figure 2: Overlap results

Figure 2: Overlap results for different mention subsets

Recall that a lack of overlap either means that the mention represents a long-tail entity (not covered in the KB) or mistakes by EL due to an entity being very ambiguous. We developed an approach to distinguish between these two cases when a no-overlap occurs. Full details of this is available in the paper, but the main findings were:

Table 2: Examples of entity mentions of low overlap with EL output

High Frequency - EL mistakes   Lower Frequency - Not in Wikipedia  
mention frequency mention frequency
andy 904 asigra 14
jaguars 1,073 mark gleeson 81
nomura 805 mique juarez 12
sedar 773 pryce 10

Going Forward

This findings of this study serve as a useful guideline to improve EL systems on news corpora. In particular, we can apply the procedure developed in our paper to automatically identify the above two classes of entity mentions: