14.7. "Clusterization" analysis type


Cluster analysis is a mathematical procedure for multidimensional analysis, which allows you to group objects into clusters based on a set of indicators characterizing the objects. Objects must be grouped so that objects in one cluster are more homogeneous and similar compared to objects in other clusters.

The basis of this analysis is the calculation of distance between objects. Based on the distances between the objects these are grouped into clusters. The distance can be determined in different ways (according to different metrics). The following metrics are available:

  • Euclidean metric
  • Squared Euclidean metric
  • City block metric
  • Maximum metric.

After determining the distances between objects, one of several algorithms for distributing objects among clusters can be used. The following clustering methods are available:

  • Nearest neighbor
  • Furthest neighbor
  • K-means
  • Centroid

Schematically, the functionality of cluster analysis can be presented as follows:

Fig. 468. Cluster analysis layout

The data source is passed to the DataAnalysis object. The data source can be the result of a query, a value table, a cell area of a spreadsheet document. Source columns are defined as input or unused. It should be noted that all the column values are contained in the DataAnalysisColumnTypeClusterization system enumeration. This enumeration contains more values (not only unused and input ones), but the other values are used when building forecasts.

The analysis is performed in accordance with the set analysis parameters.

We will use the following code fragment as an example illustrating the capability of cluster analysis:

&AtClient
Procedure ClusterAnalysis(Command)
Result = AnalysisClusterization();
EndProcedure

&AtServerNoContext
Function AnalysisClusterization()
Analysis = New DataAnalysis;
Analysis.AnalysisType = Type("DataAnalysisClusterization");

Group = Catalogs.Counterparties.FindByDescription("Legal entities");
Query = New Query;
Query.Text = "
|SELECT
|Counterparties.Ref,
|Counterparties.RetailShopsCount,
|Counterparties.VehiclesCount,
|Counterparties.CompanyOperationTime,
|Counterparties.ContractSigningTime,
|Counterparties.ContractType,
|Counterparties.RelationsTermination
|FROM
|Catalog.Counterparties AS Counterparties
|WHERE
|(Not Counterparties.IsFolder AND Counterparties.Parent = &Parent)";

Query.SetParameter("Parent", Group);

Analysis.DataSource = Query.Execute();

// Selecting metric.
Analysis.Parameters.DistanceMeasure.Value =
DataAnalysisDistanceMetricType.SquaredEuclidean;

// Selecting clusterization method.
Analysis.Parameters.ClusterizationMethod.Value = ClusterizationMethod.KMeans;

AnalysisResult = Analysis.Execute();

Builder = New DataAnalysisReportBuilder();
Builder.Template = Undefined;
Builder.AnalysisType = Type("DataAnalysisClusterization");

SpreadsheetDoc = New SpreadsheetDocument;
Builder.Output(AnalysisResult, Spreadsheet);

Return Spreadsheet;
EndFunction

Query is performed by the Counterparties catalog. According to the query condition, only detailed catalog entries from the Legal entities group are selected.

Execution of the above code will result in the following values being defined as the initial data analysis settings. Some of them are set explicitly, some of them are set by default:

Fig. 469. Analysis parameters

The composition of the columns was determined based on the composition of the query selection fields. By default they are defined with equal weight. For the Number and Date types, the Contiguous data type is defined. For other types, the Discrete type is defined. If it is necessary to change the parameters of the columns, this can be done by analogy with the fragment below:

Analysis.ColumnsSetting.VehiclesCount.AdditionalParameters.Weight = 2;

In this line, the weight is increased for the VehiclesCount column.

The selection of data for which the analysis will be performed has the following content:

Counterparty

Number of retail shops

Number of vehicles

Company operation time

Contract signing time

Contract type

Relations condition

Smith CJSC

1

0

Less than a year

Less than a year

Dealer

Contract violation

Furniture CJSC

15

4

From three to ten years

Less than a year

Distributor

Terminated by counterparty

Furniture CJSC

1

10

From three to ten years

From one to three years

Distributor

Terminated by counterparty

Forest LLC

1

1

From one to three years

Less than a year

Dealer

Terminated by counterparty

Shop No. 15

1

1

Over ten years

From three to ten years

Permanent partner

Not terminated

Gross LLC

3

2

Less than a year

Less than a year

Permanent partner

Not terminated

Consultant LLC

7

3

From three to ten years

From one to three years

Permanent partner

Terminated by counterparty

Trust LLC

2

2

Over ten years

From three to ten years

Permanent partner

Not terminated

Individual Entrepreneur Taylor

0

1

Less than a year

Less than a year

Dealer

Not terminated

The result of the analysis will be obtained in the following form:

Fig. 470. Cluster analysis result

Note that data is retrieved on the clusters found (their number, centers, distances between them) as a result of the analysis. The analysis does not result in obtaining the data on which objects (in our case, counterparties) are included in which clusters. This behavior is observed if the parameters of the analysis performed are not explicitly set (namely, the TableFillingType parameter).

In order to see the distribution of objects in clusters as a result of the analysis, it is necessary to define the following line of code before performing the analysis (but after determining its type):

Analysis.Parameters.TableFillType.Value = DataAnalysisResultTableFillType.UsedFields;
Icon/Social/001 Icon/Social/006 Icon/Social/005 Icon/Social/004 Icon/Social/002