Create the Taxonomy for Unintentional Insider Threat via Text Mining and Hierarchical Clustering Analysis

The unintentional activities of system users can jeopardize the confidentiality, integrity, and assurance of data on information systems. These activities, known as unintentional insider threat activities, account for a significant percentage of data breaches. A method to mitigate or prevent this threat is using smart systems or artificial intelligence (AI). The construction of an AI requires the development of a taxonomy of activities. The literature review focused on data breach threats, mitigation tools, taxonomy usage in cybersecurity, and taxonomy development using Endnote and Google Scholar. This study aims to develop a taxonomy of unintentional insider threat activities based on narrative descriptions of the breach events in public data breach databases. The public databases were from the California Department of Justice, US Health and Human Services, and Verizon, resulting in 1850 examples of human errors. A taxonomy was constructed to specify the dimensions and characteristics of objects. Text mining and hierarchical cluster analysis were used to create the taxonomy, indicating a quantitative approach. Ward’s agglomeration coefficient was used to ensure the cluster was valid. The resulting top-level taxonomy categories are application errors, communication errors, inappropriate data permissions, lost media, and misconfigurations.


Introduction
According to Morgan [1], the expected 2023 annual global cost of cybercrime is approximately $11 trillion.A particular concern for cybersecurity is the "insider threat," defined by Tsiostas et al. [2] as an individual who has authorized access to organizational assets and information and who acts either maliciously or accidentally in a manner that negatively affects the organization.According to the Ponemon Institute [3], the average cost to resolve all insider threat activities was $15,380,000 per company.The cost per incident from malicious insider threat incidents was $648,062, and the cost per incident for correcting negligent activities was $484,931.The average time to contain an incident was 85 days.Insider threats have been divided into two categories: malicious and unintentional.In its foundational study, the US Computer Emergency Response Team [4] defines the unintentional insider threat as an entity that causes harm or increases the risk of future damage to the confidentiality, integrity, or availability of an information technology enterprise without malicious intent.For 2022, approximately 20% of all data breaches were caused by internal threats and 80% by external threats, with 18% of all breaches caused by unintentional insider actions [5].Schoenherr and Thomson [6] continue the theme and state that insufficient research has been focused on the unintentional insider threat, with malicious and unintentional threat definitions and mitigations commingled An approach for dealing with the increasing cyber threat is the use of artificial intelligence (AI), with the current trend to apply artificial intelligence to various cybersecurity problems [7]- [12].Historically, Chandrasekaran [13] was the first to foresee a need for a primitive taxonomy of terms and ontology of actions for the creation of future expert systems.Therefore, one must have a functional taxonomy and ontology to build an expert system.
A taxonomy, or classification scheme with labels, shows relationships between objects, with the relationship often displayed as a structure [14].In computer science, an

Baugher and Qu
Create the Taxonomy for Unintentional Insider Threat ontology was first described by Gruber [15] as a formal specification that defines the concepts and relationships that exist for an agent or a community of agents.Guarino et al. [16] provided a more rigorous definition using modeling language, often referenced in the literature.Olivares-Alarcos et al. [17] show how ontologies support artificial intelligence in their discussion of comparing different ontology-based methods for robot autonomy.Various taxonomies and ontologies for cybersecurity have been proposed or are used in expert systems [18]- [23].Canham et al. [24] focused on the causes of unintentional and intentional data breaches, providing an overview of human error research and comparing studies for root causes.They determined it is challenging to decide on sources or devise mitigation strategies without a standardized taxonomy.Meanwhile, because of the lack of a taxonomy, databases and reports do not differentiate the different types of non-malicious activities [5].Database entries are labelled human error but do not use any further delineations.Text mining can be used to create the taxonomy tree, indicating a quantitative approach.Thus, a hierarchical clustering analysis approach using quantitative methods to create and evaluate the artifact was used in this study.

Taxonomies for Insider Threats
There have been attempts at creating taxonomies for insider threats, primarily focused on malicious threats.Chaipa et al. [21] developed a combined taxonomy of insider threats, focusing on malicious threats.They combined previous malicious insider threat taxonomies and created trees based on masqueraders, traitors, explorers, pure insiders, and logically present ones.After surveying the literature on malicious insider threats, Al-Mhiqani et al. [25] considered insider threat taxonomy development a problem for future consideration.Canham et al. [24] similarly concluded that cybersecurity professionals require a taxonomy of employees' unintentional errors to understand root causes and mitigate risks.Homoliak et al. [26] provided initial work on a taxonomy for unintentional insider threats, creating a structure that consisted of slips and mistakes.There was no further decomposition of slips or mistakes.Yeo and Banfield [27] described their findings from evaluating accidental data breaches contained in the Department of Health and Human Services database of data breaches.The problems include email, misplaced hard drives or documents, and accidental uploads to public databases.Unfortunately, they did not take the next step and turn their findings into a taxonomy.However, their observations were included in the developed taxonomy for this research.According to CEOs of Mandiant and CybSafe, the largest insider threat risk is unintentional or accidental acts, not malicious ones.Thus, even though the most significant insider threat risk is unintentional or accidental acts, unintentional threats were underrepresented in the literature and often excluded from the insider threat definition [28].

Ontologies for Insider Threats
Under a contract with IARPA, Greitzer et al. [29] updated and expanded the ontology for insider threats.This work was based on an existing taxonomy of malicious threat activities.However, the effort did not focus on the unintentional insider threat.Unintentional insider threat activities are labeled human error, with no further delineation.By labeling all activities as human error, no mitigations can be defined.Further, although the existing Sociotechnical and Organizational Factors for Insider Threat (SOFIT) ontology class structure provided an initial baseline for a comprehensive description of activities, Greitzer et al. [30] continue to improve it.They included several subject matter experts' risk assessments to validate the threat assessment model.This approach was necessary since there is little real-world data for malicious insider threat activities, effectively using human observations as a substitute.Canito et al. [31] developed an ontology to improve interoperability between various cybersecurity systems, focusing on critical infrastructure and cyberphysical systems, particularly those systems found at airports.Their work included evaluating the SOFIT model for inclusion in their overall ontology.

Taxonomy Validation
Ralph [32] provided guidance for validating taxonomies.He stated that a taxonomy's class structure should match the observed data, that a researcher should be able to determine conclusions based on entity class membership, and that a taxonomy should meet its goal.As taxonomies are typically developed using qualitative metrics [33], evaluation is also generally qualitative.The most common approach to verify a taxonomy has been to determine if the taxonomy is germane with new data.Humbatova et al. [34] developed a taxonomy of deep learning systems development problems based on GitHub analysis, Stack Overflow discussions, and interviews with 20 software developers.Their verification of the taxonomy was to interview 20 different software developers to validate the taxonomy's design and completeness.Lebeuf et al. [35] created a taxonomy for software bots.They validated taxonomy by comparing it to other classifications of software bots, demonstrating its utility by classifying public bots and using experts to determine if the taxonomy was complete and correct.Mountrouidou et al. [36] developed a general taxonomy of Internet of Things (IoT) devices, including devices spanning consumer to industrial and healthcare use.Their validation used set theory to establish completeness, timelessness, and precision.Completeness was defined as ensuring that a new device can be placed into at least one leaf of each major branch of the taxonomy tree and that non-IoT devices cannot.Precision was defined as every device belonging to one and only one leaf, and timelessness meant that categories were generalized sufficiently such that all new types of IoT devices could be included.
Create the Taxonomy for Unintentional Insider Threat Baugher and Qu

Problem Statement, Hypothesis Statements, and
Research Questions

Problem Statement
The problem is that cybersecurity professionals cannot cost-effectively build and maintain comprehensive expert systems to mitigate the threat of accidental data breaches since they do not have a standard taxonomy for unintentional insider threat activities [24].

Hypothesis Statement
It is possible to use text mining and hierarchical clustering analysis to create and maintain a standard taxonomy of unintentional threat activities to allow cybersecurity professionals to build and maintain cost-effective and comprehensive expert systems that mitigate the threat of data breaches.

Research Question
How can hierarchical clustering analysis be applied to ensure both the creation and maintenance of a standard taxonomy for unintentional insider threat activities?

Artifact Creation
Text mining was used to examine the relationships and hierarchies between the descriptions of human errors that cause data breaches.Common text mining tools include word clouds, word frequency counts, and cluster dendrograms [37].Word clouds are a pictorial representation of the frequency of a word, while word frequency counts are the number of times a word occurs.Cluster dendrograms use hierarchical clustering to identify groups in the dataset.Hierarchical clustering does not require a pre-determined number of clusters as the more traditional k-means clustering approach requires.Clustering uses the concept of distance to define how similar two elements are to each other.The classic method of distance measure is Euclidean [38].Hierarchical clustering can be used to create trees known as dendrograms [39].Thus, Euclidian distances will be used to calculate the hierarchical clusters that will create the dendrograms.The dendrograms can help shape a taxonomy tree.

Baugher and Qu
Create the Taxonomy for Unintentional Insider Threat According to Nickerson et al. [33], the taxonomy creation steps are iterative.They further suggest ascertaining whether the approach is empirical-to-conceptual or conceptual-to-empirical.Since the data are presented as text descriptions, the approach is empirical-to-conceptual.Fig. 1 is a graphical depiction of the empirical-toconceptual process for taxonomy creation.
The first step is preparatory and establishes the metacharacteristics of the taxonomy.In other words, determine what is included, what the taxonomy does not contain, and other meta-characteristics.This taxonomy does not include phishing or social engineering attacks since they originate externally, and the human is the victim.Verizon [5] separates human error activities from activities in response to an external threat; in other words, phishing and ransomware attacks are considered separate from pure human mistakes.According to Schlackl et al. [40], human factors can be part of the reason for a data breach and are comprised of social engineering attacks and human error.Social engineering attacks are generated by an external entity, while human error is internal and accidental.
The Verizon report [5] and its database are consistent with Schlackl et al.'s [40] analysis, which separates social engineering attacks from human error.Thus, this taxonomy only includes accidental human errors leading to breaches.The second step is also preparatory and determines the ending conditions.The ending criteria are listed in Table I, summarizing the work of Nickerson et al. [33].Nickerson et al. [33] also states that the taxonomy should be concise, with the number of dimensions allowing the taxonomy to be meaningful without being unwieldy or overwhelming.A heuristic for this condition is ensuring that the number of dimensions falls in the range of seven plus or minus two [41].
The next steps are the mechanics of building a taxonomy tree.Fig. 2 shows the entire process, including creation, validation, and maintenance.Each step will add more data, 100 cases at a time.The first two groups are taxonomy creation and validation, with the last grouping for maintenance.Validation ensures that no changes are needed from the dataset and that the taxonomy describes all the data.The difference between the maintenance and validation phases is that minor lower-level changes are expected.In other words, new stems and leaves can be added, but the top two levels of the taxonomy should stay the same, or a redesign is needed.This approach is similar to maintenance for the biological taxonomy by Linnaeus [42], with new species added but the fundamentals of the taxonomy persisting.A redesign means a taxonomy should be created from the beginning with new data, creating new branches or a new tree.
Word clouds, word frequency counts, and a cluster dendrogram will create the initial tree, with later iterations adding subgroups by identifying subsets of objects (i.e., forming the taxonomy tree), delineating common characteristics and group objects (i.e., creating the tree branches), and grouping characteristics into dimensions (i.e., developing the tree stems, with individual activities as leaves).As required, an Agglomerative coefficient will be measured to determine the strength of the clustering in the dendrogram, with a measure close to one implying a strong relationship.Using this coefficient will help delineate the more subtle linkages.The last step is to decide whether the ending conditions are met and iterate until they are achieved.The ending conditions are summarized in Table I.

Population and Sample
The target population for this study is the publicly available data breach databases from Verizon [43], the State of California Department of Justice [44], and the US Department of Health and Human Services (HHS) [45].Each database is available for research, is continually updated, and contains all the reported breaches for events affecting more than 500 users.The Verizon database contains over 10,000 entries, the HHS database approximately 5000, and the California database around 3500.The HHS database focuses on healthcare breaches, while the Verizon database considers all breaches.The California database is only concerned with California residents.A subset of the databases concerns human error.The Verizon report [5] shows almost 20% of breaches involve human error.Thus, the estimated population is approximately 3700 entries.
However, breach descriptions may not be sufficiently detailed to determine the root cause of the breach event, and the databases may overlap.The goal was to use 1000 entries to derive the initial taxonomy tree and then validate the taxonomy with 500 new events.Therefore, the sample size is 1500 database entries, with 1000 entries used for artifact creation and 500 entries used for validation.Another 350 entries were used to demonstrate artifact maintenance, resulting in a need for 1850 entries.These data were sufficient to create, validate, and maintain the taxonomy.

Definitions Dendrogram:
A dendrogram is a graphical representation of a hierarchical clustering technique.It is typically plotted as a tree [46].
Hierarchical Clustering: Forina et al. [46] state that hierarchical clustering is used for unsupervised pattern recognition and consists of visual techniques, hierarchical methods of agglomerative and divisive techniques, and non-hierarchical methods.Hierarchical clustering often uses Euclidian distances to determine the clusters and is used as a machine-learning model.
Insider Threat: Tsiostas et al. [2] define an insider threat as someone who has an organization's credentials and permissions and acts in a way that harms the organization.
The insider threat may be malicious or unintentional.Malicious Insider Threat: A malicious insider threat is an entity that intentionally causes damage, including sabotage, intellectual property theft or disclosure, and release of proprietary or personal information [47].They are motivated to do harm to an organization.
Ontology: In Computer Science, an ontology is a formal specification that defines the concepts and relationships that exist for an agent or a community of agents [15].Guarino et al. [16] provided a more rigorous definition using modeling language.An example of how ontologies support artificial intelligence is described by Olivares-Alarcos et al. [17] in their discussion of comparing different ontologybased methods for robot autonomy.In Computer Science, an ontology describes the artificial world of behaviors, user stories, threads, schema, objects, interactions, and hierarchies needed to create an application.Ontologies are part of the Semantic Model for Artificial Intelligence [48].
Semantic Model: In Artificial Intelligence, a semantic model is one where the knowledge representation is a language with specific syntax and semantics.Semantic models include signature language-based, event embedding, and ontology learning [48].
Taxonomy: A taxonomy is a classification scheme with labels.It shows relationships between objects, with the relationship often displayed as a structure [14].An example is the biological taxonomy developed by Carl Linnaeus [42], consisting of kingdom, phylum, class, order, family, genus, and species.A species can only exist in one hierarchical description, a key criterion for a taxonomy [33].
Unintentional Insider Threat: US Computer Emergency Response Team [4] defines an unintentional insider threat as an entity that causes harm or increases the risk of future damage to an information technology enterprise's confidentiality, integrity, or availability without malicious intent.This term is often referred to as an Accidental Insider Threat or Negligent Insider Threat.There is no intent to harm the organization.Under the right circumstances, all individuals can become unintentional insider threats.US federal government reports often abbreviate this term as UIT.

Results
The data were scraped from the State of California Department of Justice [44] and the US Department of Health and Human Services (HHS) [45] and copied into an Excel spreadsheet using examples of human error.The HHS and California databases were also used for validation since there were over 1800 entries in both databases.The total number of events captured from the HHS, California, and Verizon databases was 1850.The Verizon database and the remainder of the HHS and California databases were used to demonstrate a maintenance capability, using 350 events.The distribution of breaches caused by human errors was Gaussian, with their presence uniformly distributed.Human error events were approximately 20% of the scraped data.
For the creation of the taxonomy, 1000 entries were used from the California and HHS databases.R scripts created word frequency counts, word clouds, and cluster dendrogram plots using references from Paradis [49], the University of Cincinnati [39], and Silge and Robinson [37].Cluster dendrograms were created using Euclidean distances.The classic method of distance measure is Euclidean [38].The initial sparsity value is 0.85, which highlights the dominant connections.

Baugher and Qu
Create the Taxonomy for Unintentional Insider Threat Five top-level categories of human error were chosen based on initial analysis: application problems, inappropriate data permissions, misconfigurations, lost media, and communications problems.The key metric is the agglomerative coefficient, which is a measure of clustering [50].
An agglomerative coefficient was calculated for each of these categories that showed the clustering.A clustering coefficient was calculated for each of the common hierarchical clustering methods of single, complete, and Ward's method [51].The method that had a coefficient closest to 1.0 was chosen for the dendrogram.In each case, Ward's approach had a coefficient that was the highest.Therefore, Ward's method was used to create the dendrogram.Noise words that did not contribute to the taxonomy process  This process was iterative, using 100 cases on each iteration, creating ten passes through the original 1000 cases.Each pass created or modified the taxonomy design, using the stopping conditions defined in Table I [33] and recording the agglomerative coefficients.Word clouds, word frequency counts, and cluster dendrograms for the first 1000 events are presented in Lost Media in Figs.3a-3c.
Agglomerative coefficients were measured to determine the strength of the clustering in the dendrogram and within the category, with a measure close to one implying a strong relationship.The final agglomerative coefficient for each category is presented in Table II, showing that Ward's method is the preferred method of clustering.All clustering values are greater than 0.5, with the lowest value of 0.576 for the category Misconfigurations and the highest value of 0.787 for Application Problems.
The most common approach to verify a taxonomy is to determine if the taxonomy is germane with new data [32].The final taxonomy was evaluated with 500 more entries from CA and the HHS databases, resulting in a total of 1500 entries.The new data did not change the structure of the taxonomy, and all stop conditions established in Table I were met.There were no significant differences in the dendrograms.Another method to demonstrate taxonomy completeness is to generate an ontology.An ontology thread for email communications errors is shown in Fig. 4.
Data were added from the Verizon database to the remaining entries from the CA and the HHS databases to prove the concept of taxonomy maintenance.The Verizon database contained entries from Canada, the United Kingdom, France, India, South Korea, and Japan, as well as the United States.There were no changes to the major portion of the taxonomy other than to add a new type of accidental misconfiguration.The agglomerative coefficients for creation, validation, and maintenance are also presented in Table III.
The category for Application Problem changed the most with each phase.This is somewhat expected as this category is the most broadly defined with little commonality among problem areas.As this subject area is more

Addresses
Email Groups

User Home Address
Allows Non-OrganizaƟonal Mail

Hidden Data / Metadata
No Checks on AƩachments Fig. 4. Ontology thread for email, using taxonomy definitions.researched, it is likely that the category will evolve and possibly break into two or more categories.The category of Lost Media consists of lost CPUs, lost paper, lost storage devices, and lost smart devices.To demonstrate the clustering in the subcategories, a word cloud, word frequency counts, and a hierarchical dendrogram for lost storage devices are presented in Figs.5a-5c.These data were derived from the maintenance database for Lost Media, highlighting the entries pertaining to lost storage devices.The agglomeration coefficient for this subcategory was calculated to be 0.709 using Ward's method.The Lost Storage coefficient is higher than the coefficient or all Lost Media, implying this subcategory is better clustered.

Taxonomy Description
Taxonomies are often displayed as trees or hierarchical tables [14].A hierarchical table of the taxonomy after creation, validation, and maintenance is shown in Fig. 6.Three additional items were added in the validation phase (web links and problems with mailing storage devices), while two items were added in the maintenance phase (Zoom and ChatGPT).These are denoted in yellow and blue highlights for validation and maintenance respectively.All the items added were at the sub-sub-category level, meaning the design of the taxonomy is appropriate.The taxonomy consists of five top-level categories: application problems, inappropriate data permissions, lost media, misconfigurations, and communications problems.
Defining three classes of actors (user, maintainer, developer) is helpful in looking forward to an eventual ontology.A user is one with limited privileges.A maintainer ensures configuration settings are correct while a developer writes code.The difference between misconfiguration and an application problem is complexity; application problems include coding errors or privileged user activities.Table IV provides a mapping between the taxonomy categories and actor classes.
Application problems consist of software errors on websites and mobile applications that leak data.The cause is developer error.A major area of concern is business analytics and digital tracking, which are often embedded on a website, as these functions can cause data leaks.The integration of mobile applications and web applications was also observed to lead to software problems, causing data breaches.Java Script Object Notation (JSON) is widely used and allows easy and text-readable data transport.The text readable feature can also create accidental data leakage.The application coder may accidentally leave admin privileges open, leaking data.The application coder may use third-party software that leaks data.Lastly, as the types of software errors is somewhat infinite, there is a miscellaneous category for other types of application errors.
Communication errors include problems with physical mail, email, unapproved communication methods, and printer/fax errors.The significant problems with physical mail are inappropriate recipients, labels containing sensitive information such as PII, and using postcards where the content is sensitive.Examples are mass mailing to the wrong individuals, mailing labels that contain social security numbers, and postcards that discuss medical procedures.Physical mail can also contain storage media that is sent to the wrong addresses or contains sensitive hidden data that the sender did not perceive.Email can have wrong addresses, wrong attachments, incorrect use Create the Taxonomy for Unintentional Insider Threat Baugher and Qu of the blind copy function, unencrypted email, or can be accidentally sent to a home address.For example, it is common for organizations to use a standard corporate email address (firstname.lastname@company.com).It is also common for users to use a similar standard for their home address (firstname.lastname@gmail.com).Confusing these two addresses is expected, which can cause data breaches.Enterprises also control forms of communication for their sensitive data and disallow the use of other applications such as WhatsApp and cell phone text messages.Printers and faxes can also have the wrong address, with remote work printer addressing problematic.
Inappropriate data permissions occur when an entity is granted privileges that it should not have.This category is for problems within an enclave or enterprise.An example would be an employee who transfers to another position Misconfigurations occur when accidental data leaks to the public, resulting from relatively simplistic configuration problems.There are two large sub-categories: unsecured public databases within the cloud and an organization accidentally making its internal databases public.There are numerous examples of unsecured databases and misconfigured Amazon S3 buckets in the cloud.Organizations can accidentally make their data public by posting data on a public repository such as GitHub or Docker, providing data to an AI tool such as ChatGPT and Google Bard, placing files on a publicly accessible server, creating web links that allow public access to servers, and misconfigured FTP sites.Other sub-categories include incorrect or missing firewall and router settings, making a test or training platform publicly accessible but with actual data, and unsecured remote employee connections.

Summary
The concept of using text mining tools for taxonomy development is practical and repeatable.Word clouds and frequency counts can establish high-level taxonomy categories.An agglomerative coefficient can quantitatively measure how much a subject area is clustered and can be used to provide metrics for when a clustering area may need revisions.As this is a relatively new area of

Conclusion
An advantage of categorizing unintentional insider threat activities is the possibility of creating tailored mitigation strategies.In this paper, the theory that a taxonomy could be created, validated, and maintained using text mining and hierarchical clustering analysis techniques was demonstrated for unintentional insider threat activities.As the method used in this paper is general, therefore this approach could be expanded to other fields that require taxonomies.A standard taxonomy is necessary to create an ontology, enabling the development of artificial intelligence and machine learning tools.Ward's agglomeration coefficient can be used, using values greater than 0.5 to ensure the clustering is valid.This approach removes some of the subjectivity of taxonomy creation and allows taxonomies to be maintained.
Baugher and QuCreate the Taxonomy for Unintentional Insider Threat

TABLE I :
Ending Criteria for Taxonomy Creation Create the Taxonomy for Unintentional Insider Threat

TABLE II :
Agglomerative Coefficients for Each Category by Type were removed by the R code.Noise word examples are months, medicine, and state names.

TABLE III :
Agglomerative Coefficients for Each Category by Process

TABLE IV :
Actor Class Mapped to Subcategory

TABLE V :
Data Breach Incidents for Each Category by Database There were several instances of websites accidentally capturing sensitive data, particularly on medical websites.Software tools, such as cloud-based transcription services, are potentially problematic and can create data spills.An emerging problem for inappropriate permissions was training or testing with real data versus anonymized data.Unfortunately, when the test or training platform is inappropriately protected, any data are spilled.The latest problem area for lost media is smart devices.These devices may capture significant amounts of data, with potentially damaging consequences if lost.They are also typically unencrypted.As devices become more intelligent, this will become an area of concern.The emerging area of concern for misconfigurations is when public repositories or Artificial Intelligence tools (e.g., GitHub and ChatGPT) are used for development or problemsolving, leading to live data being accidentally spilled.In the communications problem area, communications via modern text messaging systems such as WhatsApp and Signal are emerging areas of concern as these systems are often unmanaged.These newer systems allow large file transfers over not necessarily secure systems.Data spills are already occurring.Other forms of communication (e.g., Slack, Teams, Dropbox, and other corporate forums) are also potential areas of concern.