خواندن ۴ دقیقه·۵ سال پیش

Hubs And Authorities

Hyperlink Induced Topic Search (HITS) Algorithm is a Link Analysis Algorithm that rates webpages, developed by Jon Kleinberg. This algorithm is used to the web link-structures to discover and rank the webpages relevant for a particular search.
HITS uses hubs and authorities to define a recursive relationship between webpages. Before understanding the HITS Algorithm, we first need to know about Hubs and Authorities.

Given a query to a Search Engine, the set of highly relevant web pages are called Roots. They are potential Authorities.
Pages which are not very relevant but point to pages in the Root are called Hubs. Thus, an Authority is a page that many hubs link to whereas a Hub is a page that links to many authorities.

for example, we want to calculate hubs and authorities score for this adjacency matrix by using hits algorithm for k=3.

Matrix A is:

And adjacency matrix with nodes is:

Then, we design a graph with nodes:

After that, showing ranks by using hubs and authorities:

Now, we calculate AT matrix:

In this step, we assume that the initial hub weight vector, u, is 1. So, we calculate

v = AT * u

Next, hub weight vector updated. u = A * v

Now, we create the graph again for k=1:

Then, calculate new authority from k = 1:

v1= 12 + 12 + 22 + 42 = 22

v1 = 1/√22 + 1/√22 + 2/√22 + 4/√22

v1 = 0.213 + 0.213 + 0.426 + 0.853

similarly calculate new hub from k = 1:

u1= 72 + 62 + 52 + 42 = 126

u1= 7/√126 + 6/√126 + 5/√126 + 4/√126

u1 = 0.623 + 0.535 + 0.445 + 0.356

The diagram for k = 2, will be:

Then, calculate new authority from k = 2:

v2= 0.2132 + 0.2132 + 0.4262 + 0.8532 = 0.999

v2 = 0.213/√0.999 + 0.213/√0.999 + 0.426/√0.999 + 0.853/√0.999

v2 = 0.213 + 0.213 + 0.426 + 0.853

similarly calculate new hub from k = 2:

u2= 0.6232 + 0.5352 + 0.4452 + 0.3562 = 0.999

u2= 0.623/√0.999 + 0.535/√0.999 + 0.445/√0.999 + 0.356/√0.999

u2 = 0.623 + 0.535 + 0.445 + 0.356

The diagram for k = 3, will be:

So, k = 2 ≈ k = 3

Hub and authority scores have come to a consistent value.

And when we reached to this statement, it is FINISHED!!!

The library for hits that you can use in python is Networkx. And the library code is:

#!/usr/bin/env python

# Copyright (C) 2008 by

# Aric Hagberg <hagberg@lanl.gov>

# Dan Schult <dschult@colgate.edu>

# Pieter Swart <swart@lanl.gov>

# All rights reserved.

# BSD license.

# NetworkX:http://networkx.lanl.gov/

__author__ = &quot&quot&quotAric Hagberg (hagberg@lanl.gov)&quot&quot&quot

import networkx

from networkx.exception import NetworkXError

__all__ = ['hits','hits_numpy','hits_scipy','authority_matrix','hub_matrix']

def hits(G,max_iter=100,tol=1.0e-8,nstart=None):

&quot&quot&quotReturn HITS hubs and authorities values for nodes.

Parameters

-----------

G : graph

A networkx graph

max_iter : interger, optional

Maximum number of iterations in power method.

tol : float, optional

Error tolerance used to check convergence in power method iteration.

nstart : dictionary, optional

Starting value of each node for power method iteration.

Returns

-------

(hubs,authorities) : two-tuple of dictionaries

Two dictionaries keyed by node containing the hub and authority

values.

Examples

--------

>>> G=nx.path_graph(4)

>>> h,a=nx.hits(G)

Notes

-----

The eigenvector calculation is done by the power iteration method

and has no guarantee of convergence. The iteration will stop

after max_iter iterations or an error tolerance of

number_of_nodes(G)*tol has been reached.

The HITS algorithm was designed for directed graphs but this

algorithm does not check if the input graph is directed and will

execute on undirected graphs.

For an overview see:

A. Langville and C. Meyer, &quotA survey of eigenvector methods of web

information retrieval.&quot http://citeseer.ist.psu.edu/713792.html

&quot&quot&quot

import networkx

if type(G) == networkx.MultiGraph or type(G) == networkx.MultiDiGraph:

raise Exception(&quothits() not defined for graphs with multiedges.&quot)

# if not G.weighted:

# raise Exception(&quothits(): input graph must be weighted&quot)

# choose fixed starting vector if not given

if nstart is None:

h=dict.fromkeys(G,1.0/G.number_of_nodes())

else:

h=nstart

# normalize starting vector

s=1.0/sum(h.values())

for k in x: h[k]*=s

nnodes=G.number_of_nodes()

# power iteration: make up to max_iter iterations

for i in range(max_iter):

hlast=h

h=dict.fromkeys(hlast.keys(),0)

a=dict.fromkeys(hlast.keys(),0)

# this &quotmatrix multiply&quot looks odd because it is

# doing a left multiply a^T=hlast^T*G

for n in h:

for nbr in G[n]:

a[nbr]+=hlast[n]*G[n][nbr].get('weight',1)

# now multiply h=Ga

for n in h:

for nbr in G[n]:

h[n]+=a[nbr]*G[n][nbr].get('weight',1)

# normalize vector

s=1.0/sum(h.values())

for n in h: h[n]*=s

# normalize vector

s=1.0/sum(a.values())

for n in a: a[n]*=s

# check convergence, l1 norm

err=sum([abs(h[n]-hlast[n]) for n in h])

if err < tol:

return h,a

raise NetworkXError(&quotHITS: power iteration failed to converge in %d iterations.&quot%(i+1))

def authority_matrix(G,nodelist=None):

&quot&quot&quotReturn the HITS authority matrix.&quot&quot&quot

import networkx

M=networkx.to_numpy_matrix(G,nodelist=nodelist)

return M.T*M

def hub_matrix(G,nodelist=None):

&quot&quot&quotReturn the HITS hub matrix.&quot&quot&quot

import networkx

M=networkx.to_numpy_matrix(G,nodelist=nodelist)

return M*M.T

def hits_numpy(G,max_iter=100,tol=1.0e-6,nodelist=None):

&quot&quot&quotReturn a NumPy array of the hubs and authorities.&quot&quot&quot

import numpy

import networkx

M=networkx.to_numpy_matrix(G,nodelist=nodelist)

(n,m)=M.shape # should be square

A=M.T*M # authority matrix

x=numpy.ones((n,1))/n

# power iteration on authority matrix

for i in range(max_iter):

xlast=x

x=numpy.dot(A,x)

x=x/x.sum()

# check convergence, l1 norm

err=numpy.abs(x-xlast).sum()

if err < n*tol:

a=numpy.asarray(x).flatten()

# h=M*a

h=numpy.asarray(numpy.dot(M,a)).flatten()

h/=h.sum()

return h,a

raise NetworkXError(&quotpage_rank: power iteration failed to converge in %d iterations.&quot%(i+1))

def hits_scipy(G,max_iter=100,tol=1.0e-6,nodelist=None):

&quot&quot&quotReturn a SciPy sparse array of the hubs and authorities.&quot&quot&quot

import numpy

import scipy.sparse

import networkx

M=networkx.to_scipy_sparse_matrix(G,nodelist=nodelist)

(n,m)=M.shape # should be square

A=M.T*M # authority matrix

x=scipy.ones((n,1))/n # initial guess

# power iteration on authority matrix

for i in range(max_iter):

xlast=x

x=A*x

x=x/x.sum()

# check convergence, l1 norm

err=scipy.absolute(x-xlast).sum()

if err < n*tol:

a=numpy.asarray(x).flatten()

# h=M*a

h=numpy.asarray(M*a).flatten()

h/=h.sum()

return h,a

raise NetworkXError(&quotpage_rank: power iteration failed to converge in %d iterations.&quot%(i+1))

deniz.n.rezai

شاید از این پست‌ها خوشتان بیاید