PSALM – PATENT MINING TOOL FOR COMPETITIVE INTELLIGENCE

Original scientific paper Patent document is a valuable source of information. However, it is neither easy to extract useful information from patents nor simple to track evidence about all patents that may be relevant. This paper describes PSALM (Patent Search and Analysis for Landscaping and Management), a recently developed software tool for competitive intelligence based on patent data. PSALM enables transformation of raw patent data into meaningful and useful information for business decision making. The tool is based on MySQL database and web robot, both supported by routines developed in Java and PHP. PSALM tool assembles patent data from publicly available data bases, collects and analyses bibliographic parameters of patents, but also does text mining and clustering. The objective of this paper is to describe the structure and functions of developed software, to show efficiency and accuracy of its modules (text processing, clustering, visualisation), as well as to demonstrate its usability through an in-depth case study.


Introduction
Patent is a complex legal instrument and a powerful business tool.Based on patents, companies can gain monopoly position in the market, block and disadvantage competitors, attract investors or make additional profits through licensing.At the same time, patents are a unique and valuable source of information.WIPO and EPO estimate that approximately 80 % of the scientific and technical information disclosed in patents is never published in any other form [1,2].In addition to technical data, patent document provides a lot of information relevant for legal, business and public policy usage (Tab. 1 gives summary of the format and information contents of patent documents).The validity of this information is amplified by the fact that all data found in a patent document is collected, verified and presented in a systematic manner according to internationally agreed standards.Based on this, patents offer a full spectrum of possibilities for using them in key areas of competitive intelligence and technology management, including [3]: competitors monitoring, technology trends observation, the identification and assessment of potential partners and R&D portfolio management.Researchers and inventors, R&D managers, patent professionals, entrepreneurs and policy-makers are interested in using this information in strategic decision making, as well as in everyday operations.However, it is neither easy nor simple.There were 9,45 million patents in force in 2013 [4] with an increasing number of pages and claims per patent, difficult language used and unclear relations between patents.To overcome the barriers, various software tools have been developed in the patent field [5,6].They could analyse patent portfolios, make basic statistics, visualize, map and landscape the patent data.Most of these tools use statistical methods to analyse patent data and represent patent trends by various graphs and tables.They provide various features and representations for researchers, managers and R&D specialists.However, most of the patent databases and tools available today are expensive, complex or ask for a strong expertise in the field of intellectual property.Therefore, SMEs and academic institutions, especially in developing countries, do not take a full advantage of using patents as a source of information for their own research, market and innovative activities [2].Responding to this challenge, our research group has developed a tool for patent data analysis and management.
The software tool is named Patent Search and Analysis for Landscaping and Management (PSALM).It allows fast access to free patent databases and provides an easy way to automatically analyse patents.The PSALM is designed to collect and analyse both structured (bibliographic parameters) and unstructured (free text) patent data and to visualize the results of both analyses.
The objective of this paper is to describe the structure and functions of developed software, to show its clustering efficiency and accuracy, as well as to demonstrate its usability through case study.In several conference papers [7,8,9,10] different features of the tool were presented and tested.However, this is first holistic presentation of PSALM's structure, functions and an in-depth application in real life case study.
The rest of the paper is organized as follows.Section 2 describes structure and functional modules of PSALM, while Section 3 assesses their performances.In Section 4, user interface is presented.Implications for practice are discussed in Section 5. Finally, in Section 6, a conclusion with a summary of results and further research are outlined.

Web Robot
The front e bot".It collec tabase, namely uropean Paten bot [7] , log , , , * , The importance of words within each patent document, derived by the TF-IDF method, is used for creation of dissimilarity matrix.The dissimilarity matrix given by ( 4) is square distance matrix where δ i,j is distance function that measures level of dissimilarity between pair of patent documents calculated using cosine of vectors assigned to the each patent document, and term I represents number of patent documents in the considered case.
This high dimensional matrix is transformed into much lower dimensionality space, maintaining the most similar structure to the original, using the Multi-Dimensional Scaling (MDS) scheme [12].The goal of MDS is to find vectors x 1 ,…, x I such that: , ; , ∈ The output of the MDS module is a 2-dimensional matrix that can be easily used to visually present patents similarities.

Clustering
The output of the text processing module is a 2dimensional matrix that is further processed by the clustering module.Clustering is partitioning a set O = {O 1 , O 2 ,…, O n } of objects into homogenous clusters maximizing intra-cluster similarity while minimizing inter-cluster similarity.Clusters are formed without any prior information about objects that are grouped.Any labels associated with objects are obtained solely from the data.
Clustering helps identifying meaningful patterns, undetected or unexpected groups from a set of unlabelled objects [13].Unsupervised clustering technique groups the given unlabelled collection of patent documents into meaningful clusters without any prior information of patent documents.As the number of patents increases and volume of data grows, it is impossible to successfully analyse any set of patents without clustering.Therefore, clustering is the essential function any patent analysis tool should provide.
Due to importance of four unsupervised learning algorithms: k-means, fuzzy c-means, neural gas and reorganizing neural network (ronn) that will be analysed, their main features are shortly explained in the following text and later tested in an experimental setting.

k-means clustering
The k-partitioning attempts to detect k optimal clusters by an iterative relocation method based on an optimization function.The most popular k-clustering is kmeans clustering [14].k-means clustering segments the n observations into k clusters, where each observation belongs to the cluster with the nearest mean.It uses the mean (centroid) as the representative of a cluster.One of main strengths of k-means clustering is its scalability.kmeans clustering problem regards to the complexity of NP-hard.However, there are efficient approximate solutions.

Fuzzy c-means clustering (fcm)
Fuzzy c-means clustering is very similar to k-means.The difference is that each object has a degree of belonging to clusters, rather than belonging to just one cluster.Thus, objects on the edges of the cluster are in the cluster in somewhat lesser degree than objects inside the cluster.Fuzzy c-means is an important algorithm for image processing used for clustering of objects inside the image [15].Complete overview and comparison of fuzzy clustering is presented in [16].

Re-Organising Neural Network (ronn)
Re-Organising Neural Network (ronn) algorithm is an iterative learning procedure.It performs iterative adjustments of node-coordinates in the manner of kmeans algorithm until the nodes stabilize relatively in their current positions.Simultaneously it shifts nodes that turn out to be dead-nodes into better positions.Complete overview of ronn algorithm is presented in [17].

Neural Gas
The Neural Gas is an unsupervised learning technique that allows the uniform placement of representative prototypes in the vector space.This algorithm determines prototypes in such a way that the Euclidean distance between data vectors and prototype vectors is minimal.The neural gas is a relatively simple algorithm for finding optimal data representations and is a robust alternative for k-means clustering.It is widely used where compression is a problem, like image processing or pattern recognition [18].Algorithm's detailed description is given in [19].

Visualization
This module is responsible for data visualization or/and exporting the results.Patent data is processed with data and text mining techniques in order to present the data in some visual form which will allow its better understanding and interaction with the data [20].The clustered patent data space can be presented and visualized with respect to various contexts.
The tool enables visualizations of high and lowdimensional data.High-dimensional data are visualized by mapping patents and clusters in proportion to each other in 2D space.This makes it very easy to locate the most developed areas in certain technologies.It also shows outliers in the data from patents that do not have much in common with the subject.Low-dimensional (structured) data, presented as bar charts and pie charts of bibliographic data, could also help in better understanding of the technology areas, changes in the technology development, company competiveness etc.

PSALM performance assessment 3.1 Web robot performance assessment
In the first phase the web robot performances were assessed using several patents with different data available on "front page".The following patents were tested: US7962846 (patent with standard front page data), US7919816 (patent with additional field: Foreign Application Priority Data), US7962825 (patent with additional fields: Related U.S. Patent Documents and Parent Case Text), D503691 (US design patent with nonstandard data), and D254200 (US design patent with standard data).
Program execution and time performances of acquiring, parsing and writing data into the database were analysed, as well as statistical data on the amount of data that is written to the database.The speed of the Internet connection is an important factor in assessing program performances because it affects the most speed of program operation.In the test case download speed was 6 Mbps and upload speed was 0,36 Mbps.The amount of data that should be processed and the time needed for processing are shown in Tab. 2 and Tab. 3.  In the second phase the web robot performances were assessed using portfolio of 1820 selected US patents.Statistical data on downloading the patent portfolio is presented in Tab. 4 while the average download time for the same portfolio is shown in Tab. 5.
The average time is calculated using unique patents only, i.e. patents that were actually downloaded and written into the database.Patents that already existed in the database (duplicates) were not downloaded but just linked with the current search.

Clustering performance assessment
To test clustering accuracy dataset with 72 US patents is used.All patents have been invented by the same company and cover the field of consumer electronics.Five experienced engineers clustered these patents in the following four groups: Audio, MPEG, Mobile phone and TV.The Audio group consists of 16 patents related to audio coding, audio signal transmission, audio processing, wide-band audio coding and techniques related to audio editing and trick play features.These inventions could be implemented in audio home entertainment equipment, mobile devices and portable gadgets like MP3 players.The MPEG group consists of 29 patents which relate to various optimizations techniques for hardware and software video decoding, creating multi-streams of compressed video data, increasing image compression efficiency, improving error concealment, extracting coding parameters and quality improvement of scalable coding techniques.The Mobile Phone group consists of 15 patents related to call re-establishment and call transfer in telecommunication networks, software defined radio, signal filtering and equalization.The last group is the TV group that consists of 12 patents related to image sharpness enhancement.
In order to test accuracy of clustering algorithms and select the best performing for patent data, four described (k-means, the neural-gas, fuzzy-c-means and ronn) clustering algorithms are compared.Although artificial intelligence and machine learning have significantly improved over the last few years, patent analysis and clustering by human experts has remained the safest and the most accurate method.Therefore, the results of clustering techniques have been compared to expert's results as well.In order to do that, the following methodology was adopted:  for the selected dataset the text processing is performed on different subsets of patent data: abstract, claims, patent description and IPC codes;  4 Techn m the ownersh gned to Apple rosystems (Fig Figure 5 Step two wa ch belong to ated patents.T ion analysis c y similar areas indicated tha bility patents.Closeness of detected (Motorola Mobility patents) and litigated (Android related) patents revealed that Motorola's patents are relatively well distributed and related to patents which can harm Google.From that point, those who argued that Google's decision to buy Motorola Mobility was rooted in its patent portfolio were right.Full support for this understanding of the acquisition came directly from Google's CEO Larry Page in January 2014, after Google sold the device maker to Lenovo.He said [25]: "We acquired Motorola in 2012 to help supercharge the Android ecosystem by creating a stronger patent portfolio for Google […] Motorola's patents have helped create a level playing field, which is good news for all Android's users and partners."

Conclusion
This paper described the structure and functions of developed software.It demonstrated efficiency and accuracy of two crucial functional modules -for patent collection and for clustering, and demonstrated PSALM usability through Android case study.
PSALM is a software tool for competitive intelligence based on patent data.It enables transformation of raw patent data into meaningful and useful information for business decision making.PSALM is a simple tool with good ergonomics.It enables easy patent search over a selected database on the Internet, automatic download and saving of selected patents in a local database.The tool provides its users with automatic analysis, enabling them to visualize low-and highdimensional data from the patent, and to save and print out the analysis and reports.The real power of the tool is in analysing portfolios with a larger number of patents.
Patent data analyses will still be difficult, time and manpower consuming of experts' work, but PSALM could help in improving the correctness and timeliness of decision-making in competitive environment providing useful information and focusing experts' time and efforts on the most interesting and most promising patents.For example, based on PSALM results, it is easier to target technology weak areas and to group and select patents which could be interesting for the company.
Results presented in this paper are results of the current version of PSALM and improvements are expected in the next period.Beta version of PSALM is currently available to from RT-RK Computer Based Systems company, and the next version of the tool will be publicly available on the commercial basis.Further research will be directed towards tool improvement in text processing, using WordNET for comparing words in the text and SAO structures for text analysis.Also, future work will be concentrated on extending the test data set in order to further verify the results and improve data mining techniques, clustering and visualization modules.Main drawbacks of the tool, at the moment, are time needed for download of required data from USPTO web site into a local database, as well as fact that current version of the tool is using only US patents for analysis.

igure 2 Enhanced Text Processin
) and the document frequency in order to scale the values.The calculation of TF-IDF is shown in Eqs.(1), (2) and (3).

Table 2
Amount of data that is written to the database in the test case

Table 3
Processing time in the test case

Table 4
Statistical data on processing patent portfolio with 1820 patents

Table 5
Processing times for patent portfolio with 1820 patents

Table 6 D
after text processing is finished, clustering is performed (four different functions in Matlab have been developed, one for each clustering algorithm); j and it is de 22, 6(2015), 1433-1