Why is Data Analysis Important for Blockchain Industry?
Author： Riley Shu （ArcBlock Data Engineer)
It might not be intuitive for people to connect "Blockchain" with "Data analysis". Why does a Blockchain start-up like ArcBlock need data analysis? What kind of data do they have to analyze? Today we will share our perspective on the value of on-chain data analysis, and peek into ArcBlock's data pipeline. Data is not only part of ArcBlock's core product and service but also the foundation for every decision we make.
First, we should align our definition of data: data is a set of value of qualitative or quantitative variables, derived from our observations, experiments or calculations. In general, data is the trace of human beings, which includes detailed information about their behavior and activities. Collecting and analyzing data enables us to raise questions that have never been asked, as well as answers to such questions. Currently, ArcBlock uses three kinds of data: on-chain data, product data, and system log.
In ArcBlock, data analysis also covers three part:
- On-chain Data Analysis
- Product & User Data Analysis
- Security Auditing
Immutability is one of the core features of Blockchain, which means that once a transaction is recorded on the chain, there is no way to modify or revert it. Blockchain is like an open transaction book available for everyone —— as long as you have the address, you have the access to the complete history of all the transactions that address was involved.
Since 2017, Bitcoin price has climbed straight high, directing more and more people to use it as a trading platform and explore other possibilities with Blockchain. The increase of interest enriches the type and content of Bitcoin data. From 2009 to now, Bitcoin has stored about 1.7 billion transactions.
Even the younger Ethereum, born in 2015, has stored more than 6 million blocks and every 12 seconds, there is a new block out with valuable data.
These data represent users' financial behaviors, incorporating a large amount of information. Many interesting questions can be asked. Where does the money come? Where does it go? What's the entire money flow? Which address has abnormal behaviors?
How come the market is not overwhelmed with analysis report of these publicly available data? One explanation is that since Blockchain is decentralized, there is not enough motivation for the public to generate such reports; another reason is that analysis of Blockchain data requires a certain amount of knowledge of complex data structure and algorithm, which can be intimidating for people to start digging into the data.
To better analyze the data, we have indexed the data on Bitcoin and Ethereum and started real-time data listeners to keep the data updated. In previous articles, we have introduced how we indexed on-chain data. For the data processing, we use AWS Kinesis to stream the data into parquet format and store them on S3. Meanwhile, the data are also fed into our data pipeline, where we perform cleaning and aggregation using Apache Spark, and send the aggregated data to Redshift, where they are ready for further queries and visualizations.
Right now we are using Apache Superset to create visualizations of DAU (Daily Active Users) and MAU (Monthly Active Users) of Ethereum. In the future, we will add more visualization to the collection of on-chain data analysis, such as users' retention rate and the number of colde wallets.
The key to a better service is the understanding of users. Many decisions have to be made during the process of developing a product such as which new feature should be prioritized and what kind of website layout provides the best user experience. The more we understand the users, the more likely we can make the right decision in solving their problems. In ArcBlock, every decision-making process should be traceable,so we use data to guide us with confidence because data is the reflection of what users are actually thinking —— Data never lie.
Due to the importance of data, ArcBlock paid a lot of attention to data infrastructure from the very beginning. The user&product data pipeline fetches real-time user data, backs them up on S3, and performs aggregation using Spark. After the calculation is done, the pipeline sends data to redshift for further analytical work. This pipeline is similar to the on-chain data pipeline, but as we develop more and more features, our pipeline will also get optimized.
Blockchain serves as the infrastructure for building a new-generation network and safety is always the first concern. Besides a careful design of our security network infrastructure, we also use data analysis to detect abnormal user behaviors and manage to stop malicious behaviors before they cause any severe damages.
All the system logs are stored on S3. The analysis engine compares history data with current data, uses machine learning patterns to detects anomalies in transactions. Once an anomaly is found, the system sends out an alert to administrators and the suspicious operation is automtatically suspended.
Today we briefly introduced how ArcBlock uses data in three ways. When we are talking about Blockchain's past and future, no matter for existing questions or questions that have not yet been raised, the data always have the answer. All we need to do is to collect the information we need and reveal the truth.