New Wikipedia-Based Cognitive Model Available for Text Processing

 

Blog Posting By Alex Sakharo, Principal Member of Technical Staff, Data Mining Technologies

Explicit Semantic Analysis (ESA), a new feature in Oracle Advanced Analytics Release 12.2, uses concepts of an existing knowledge base as features rather than latent features derived by latent semantic analysis methods such as Singular Value Decomposition and Latent Dirichlet Allocation. Each row e.g. a document in the training data maps to a feature, i.e. a concept. ESA works best with concepts represented by text documents. It has multiple applications in the area of text processing, most notably semantic relatedness (similarity) and explicit topic modeling.   Text similarity use cases might involve e.g. resume matching, searching for similar blog postings, etc..  OAA’s ESA derived similarity indexes can be used as added new features for other records e.g. Candidate, Age, Income, Job_description Similarity_index_score.

The ESA model is basically an inverted index that maps words to relevant concepts of the knowledge base. This inverted index also incorporates weights reflecting the strength of association between words and concepts. ESA does not project the original feature space and does not reduce its dimensionality except for filtering out features with uninformative text.

There exist vast amounts of knowledge represented as text. Textual knowledge bases are normally collections of common or domain-specific articles, and every article defines one concept. These textual knowledge bases such as Wikipedia usually serve as sources for ESA models. Wikipedia is particularly good as a source for a general-purpose ESA model because Wikipedia is a comprehensive knowledge base.  Users can develop and add and use their own custom, domain specific ESA models e.g. medical, homeland security, research & development, etc.

Please refer to https://docs.oracle.com/database/122/DMAPI/explicit-semantic-analysis.htm for more information about ESA.

Distribution

Oracle distributes an ESA model built in 12.2.0.1 from the following 2016 Wikipedia dump https://dumps.wikimedia.org/enwiki/.  The dump dated November 1, 2016 was used for the building of this ESA model.

See Oracle Machine Models to download the ESA Model 1.0 EN.  The model file is wiki_model12.2.0.1.dmp. The distribution includes two scripts: wiki_esa_setup.sql, wiki_esa_demo.sql. The wiki_esa_setup.sql script defines a text policy. The wiki_esa_demo.sql script contains sample queries for the model.

Setup

This is how to load this model into your DB given that you use the scott/tiger account provided in Oracle DBs. It is done similarly for other accounts. First, you execute the following sql as sysdba in order to grant necessary privileges to this account:

SQL> GRANT CREATE ANY DIRECTORY TO SCOTT;
Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_DDL TO SCOTT;
Grant succeeded.

SQL> GRANT CREATE MINING MODEL TO SCOTT;
Grant succeeded.

The minimum recommended size of the tablesspace is 1G.

SQL> CREATE TABLESPACE <your tablespace> DATAFILE '<directory>/<file>' SIZE 1G REUSE AUTOEXTEND ON MAXSIZE UNLIMITED;
Tablespace created.

It is necessary to to define a DB directory in order to import a model.

SQL> CREATE OR REPLACE DIRECTORY DBDIR AS '<directory>';
Directory created.

SQL> ALTER USER SCOTT QUOTA UNLIMITED ON <your tablespace>;
User altered.

Second, you need to copy wiki_model12.2.0.1.dmp to your directory. After that, you execute this command in the shell:

impdp scott/tiger dumpfile=wiki_model12.2.0.1.dmp directory=DBDIR remap_schema=DMUSER:SCOTT remap_tablespace=TBS_1:TBS

Alternatively, you may execute the following sql code to achieve the same result:

SQL> begin
dbms_data_mining.import_model (
             filename => 'wiki_model12.2.0.1.dmp',
             directory =>'DBDIR',
             schema_remap => 'DMUSER:SCOTT',
             tablespace_remap => 'TBS_1:TBS'
);
end; 
/
PL/SQL procedure successfully completed.

The imported model name is WIKI_MODEL. You can explore the imported model via view DM$VAWIKI_MODEL. If you use your own DB account, then make sure that the same privileges as for SCOTT are granted to that account.

Now you can run:

SQL> @wiki_esa_setup.sql

This script sets up the text policy wiki_txtpol which is referred to from the model. Make sure that the size of SGA is sufficient for fast scoring. The minimum recommended settings are:

sga_max_size=1G

sga_target=1G

Once SGA is properly sized, you can run sample scoring queries:

SQL> @wiki_esa_scoring.sql

 

Scoring

All queries against WIKI_MODEL score textual data. These data should be given as one column named TEXT. If your textual data comes from a table column, this column should be aliased to TEXT. Text policy wiki_txtpol should be defined before scoring. Scoring function feature_set is used for topic modeling, and function feature_compare is used for semantic similarity.

 

Explicit topic modeling

The ESA Wikipedia model helps discover the most relevant topics for a given text document. It could be a short text such as a singular word or a long document. Please see relevant Wikipedia topics for word 'bank':

SQL> select s.feature_id, s.value from
    (select feature_set(wiki_model, 10 using *) fset from
    (SELECT 'bank' AS text FROM dual)) t,
    table(t.fset) s order by s.value desc;

FEATURE_ID                              VALUE
—————————————-  ——
Bank                                            .101
Bank of America                          .099
National bank                               .099
Central bank                                .099
National Bank Act                        .096

The next example shows Wikipedia topics for one sentence:

SQL> select s.feature_id, s.value from
      (select feature_set(wiki_model, 10 using *) fset from
       (SELECT 'A group of European-led astronomers has made a photograph of what appears to be a planet orbiting another star. If so, it would be the first confirmed picture of a world beyond our solar system.'
  AS text FROM dual)) t, table(t.fset) s order by s.value desc;

FEATURE_ID                                                  VALUE
—————————————-                      ——
Solar System                                                  .144
Exoplanet                                                        .138
Planet                                                              .138
Formation and evolution of the Solar System .127
Planetary system                                            .127

Here is yet another example in which topic modeling is done for a paragraph:

SQL> select s.feature_id, s.value from
    (select feature_set(wiki_model, 10 using *) fset from
    (SELECT 'The more things change… Yes, I''m inclined to agree, especially with regards to the historical relationship between stock prices and bond yields. The two have generally traded together, rising during periods of economic growth and falling during periods of contraction. Consider the period from 1998 through 2010, during which the U.S. economy experienced two expansions as well as two recessions: Then central banks came to the rescue. Fed Chairman Ben Bernanke led from Washington with the help of the bank''s current $3.6T balance sheet. He''s accompanied by Mario Draghi at the European Central Bank and an equally forthright Shinzo Abe in Japan. Their coordinated monetary expansion has provided all the sugar needed for an equities moonshot, while they vowed to hold global borrowing costs at record lows' AS text FROM dual)) t,
table(t.fset) s order by s.value desc;

FEATURE_ID                                 VALUE
—————————————-     ——
Recession                                       .147
Mario Draghi                                   .138
Lost Decade (Japan)                      .132
Ben Bernanke                                .120
Federal Open Market Committee   .093

 

Semantic similarity

The ESA Wikipedia model can be used to calculate semantic similarity. One can score semantic similarity for short and long documents alike. The following two queries capture the fact that words 'street' and 'avenue' are semantically closer than 'street' and 'farm'.

SQL> select 1-feature_compare(wiki_model using 'street' as text and using 'avenue' as text) comp from dual;
  COMP
——
  .235
SQL>
SQL> select 1-feature_compare(wiki_model using 'street' as text and using 'farm' as text) comp from dual;
  COMP
——
  .004

In the next example, the first pair of sentences scores higher because Nick Price is a golfer born in South Africa. Note that the two sentences from the first pair have no common words.

SQL> SELECT 1-FEATURE_COMPARE(wiki_model USING 'There are several PGA tour golfers from South Africa' text AND USING 'Nick Price won the 2002 Mastercard Colonial Open' text) comp FROM DUAL;
  COMP
——
  .119
SQL>
SQL> SELECT 1-FEATURE_COMPARE(wiki_model USING 'There are several PGA tour golfers from South Africa' text AND USING 'John Elway played quarterback for the Denver Broncos' text) comp FROM DUAL;
  COMP
——
  .003

In the following example, one paragraph referring to al Qa ida and Saudi Arabia is compared to two other paragraphs. The first counterpart paragraph refers to similar matters even though it does not mention al Qa ida or Osama bin Laden. The first pair of paragraphs scores a high similarity according to a Wikipedia-based ESA model. The second counterpart paragraph refers to unrelated topics. The second pair of paragraphs scores low as expected.

SQL> select 1-feature_compare(wiki_model using
  'Senior members of the Saudi royal family paid at least $560 million to Osama bin Laden terror group and the Taliban for an agreement his forces would not attack targets in Saudi Arabia, according to court documents. The papers, filed in a $US3000 billion ($5500 billion) lawsuit in the US, allege the deal was made after two secret meetings between Saudi royals and leaders of al-Qa ida, including bin Laden. The money enabled al-Qa ida to fund training camps in Afghanistan later attended by the September 11 hijackers. The disclosures will increase tensions between the US and Saudi Arabia.' as text
  and using
  'The Saudi Interior Ministry on Sunday confirmed it is holding a 21-year-old Saudi man the FBI is seeking for alleged links to the Sept. 11 hijackers. Authorities are interrogating Saud Abdulaziz Saud al-Rasheed "and if it is proven that he was connected to terrorism, he will be referred to the sharia (Islamic) court," the official Saudi Press Agency quoted an unidentified ministry official as saying.' as text) comp from dual;
  COMP
——
  .583
SQL>
SQL> select 1-feature_compare(wiki_model using
  'Senior members of the Saudi royal family paid at least $560 million to Osama bin Laden terror group and the Taliban for an agreement his forces would not attack targets in Saudi Arabia, according to court documents. The papers, filed in a $US3000 billion ($5500 billion) lawsuit in the US, allege the deal was made after two secret meetings between Saudi royals and leaders of al-Qa ida, including bin Laden. The money enabled al-Qa ida to fund training camps in Afghanistan later attended by the September 11 hijackers. The disclosures will increase tensions between the US and Saudi Arabia.' as text
  and using
  'Russia defended itself against U.S. criticism of its economic ties with countries like Iraq, saying attempts to mix business and ideology were misguided. "Mixing ideology with economic ties, which was characteristic of the Cold War that Russia and the United States worked to end, is a thing of the past," Russian Foreign Ministry spokesman Boris Malakhov said Saturday, reacting to U.S. Defense Secretary Donald Rumsfeld statement that Moscow economic relationships with such countries sends a negative signal.' as text) comp from dual;
  COMP
——
  .095

 

References

E. Gabrilovich and S. Markovitch. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis, IJCAI, v. 7, pp. 1606-1611, 2007

E. Gabrilovich and S. Markovitch. Wikipedia-based Semantic Interpretation for Natural Language Processing. Journal of Artificial Intelligence Research, v. 34, pp. 443-498, 2009.

from Oracle Blogs | Oracle Data Mining (ODM) Blog https://blogs.oracle.com/datamining/how-to-get-started-using-oracle-advanced-analytics-122-new-feature%3A-explicit-semantic-analysis-v2

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s