14.7. "Clusterization" analysis type

Cluster analysis is a mathematical procedure for multidimensional analysis, which allows you to group objects into clusters based on a set of indicators characterizing the objects. Objects must be grouped so that objects in one cluster are more homogeneous and similar compared to objects in other clusters.

The basis of this analysis is the calculation of distance between objects. Based on the distances between the objects these are grouped into clusters. The distance can be determined in different ways (according to different metrics). The following metrics are available:

Euclidean metric
Squared Euclidean metric
City block metric
Maximum metric.

After determining the distances between objects, one of several algorithms for distributing objects among clusters can be used. The following clustering methods are available:

Nearest neighbor
Furthest neighbor
K-means
Centroid

Schematically, the functionality of cluster analysis can be presented as follows:

Fig. 468. Cluster analysis layout

The data source is passed to the DataAnalysis object. The data source can be the result of a query, a value table, a cell area of a spreadsheet document. Source columns are defined as input or unused. It should be noted that all the column values are contained in the DataAnalysisColumnTypeClusterization system enumeration. This enumeration contains more values (not only unused and input ones), but the other values are used when building forecasts.

The analysis is performed in accordance with the set analysis parameters.

We will use the following code fragment as an example illustrating the capability of cluster analysis:

&AtClient
Procedure ClusterAnalysis(Command)
Result = AnalysisClusterization();
EndProcedure

&AtServerNoContext
Function AnalysisClusterization()
Analysis = New DataAnalysis;
Analysis.AnalysisType = Type("DataAnalysisClusterization");

Group = Catalogs.Counterparties.FindByDescription("Legal entities");
Query = New Query;
Query.Text = "
|SELECT
|Counterparties.Ref,
|Counterparties.RetailShopsCount,
|Counterparties.VehiclesCount,
|Counterparties.CompanyOperationTime,
|Counterparties.ContractSigningTime,
|Counterparties.ContractType,
|Counterparties.RelationsTermination
|FROM
|Catalog.Counterparties AS Counterparties
|WHERE
|(Not Counterparties.IsFolder AND Counterparties.Parent = &Parent)";

Query.SetParameter("Parent", Group);

Analysis.DataSource = Query.Execute();

// Selecting metric.
Analysis.Parameters.DistanceMeasure.Value =
DataAnalysisDistanceMetricType.SquaredEuclidean;

// Selecting clusterization method.
Analysis.Parameters.ClusterizationMethod.Value = ClusterizationMethod.KMeans;

AnalysisResult = Analysis.Execute();

Builder = New DataAnalysisReportBuilder();
Builder.Template = Undefined;
Builder.AnalysisType = Type("DataAnalysisClusterization");

SpreadsheetDoc = New SpreadsheetDocument;
Builder.Output(AnalysisResult, Spreadsheet);

Return Spreadsheet;
EndFunction

Query is performed by the Counterparties catalog. According to the query condition, only detailed catalog entries from the Legal entities group are selected.

Execution of the above code will result in the following values being defined as the initial data analysis settings. Some of them are set explicitly, some of them are set by default:

Fig. 469. Analysis parameters

The composition of the columns was determined based on the composition of the query selection fields. By default they are defined with equal weight. For the Number and Date types, the Contiguous data type is defined. For other types, the Discrete type is defined. If it is necessary to change the parameters of the columns, this can be done by analogy with the fragment below:

Analysis.ColumnsSetting.VehiclesCount.AdditionalParameters.Weight = 2;

In this line, the weight is increased for the VehiclesCount column.

The selection of data for which the analysis will be performed has the following content:

Counterparty	Number of retail shops	Number of vehicles	Company operation time	Contract signing time	Contract type	Relations condition
Smith CJSC	1	0	Less than a year	Less than a year	Dealer	Contract violation
Furniture CJSC	15	4	From three to ten years	Less than a year	Distributor	Terminated by counterparty
Furniture CJSC	1	10	From three to ten years	From one to three years	Distributor	Terminated by counterparty
Forest LLC	1	1	From one to three years	Less than a year	Dealer	Terminated by counterparty
Shop No. 15	1	1	Over ten years	From three to ten years	Permanent partner	Not terminated
Gross LLC	3	2	Less than a year	Less than a year	Permanent partner	Not terminated
Consultant LLC	7	3	From three to ten years	From one to three years	Permanent partner	Terminated by counterparty
Trust LLC	2	2	Over ten years	From three to ten years	Permanent partner	Not terminated
Individual Entrepreneur Taylor	0	1	Less than a year	Less than a year	Dealer	Not terminated

The result of the analysis will be obtained in the following form:

Fig. 470. Cluster analysis result

Note that data is retrieved on the clusters found (their number, centers, distances between them) as a result of the analysis. The analysis does not result in obtaining the data on which objects (in our case, counterparties) are included in which clusters. This behavior is observed if the parameters of the analysis performed are not explicitly set (namely, the TableFillingType parameter).

In order to see the distribution of objects in clusters as a result of the analysis, it is necessary to define the following line of code before performing the analysis (but after determining its type):

Analysis.Parameters.TableFillType.Value = DataAnalysisResultTableFillType.UsedFields;