You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Data Engineering/Systems/AQS/Scaling/2020/Cluster Expansion
Expand the current AQS Restbase cluster with more hosts to grow available storage space. Main reason for growth are first to make sure the currently served data can continue to be handled, and second to extend the service to serve historical pagecounts.
- Two Cassandra instances per node.
- Cassandra 2.2.6
Current State (2020-03-06)
The 6 current hosts handle about 19Tb of data representing bout 55% of the available space. Most (99.7%) of the storage is used by three tables (see this graph):
|table||storage taken (Tb)||storage taken (%)||Description|
|local_group_default_T_pageviews_per_article_flat.data||11.89||60.4%||4 years and 8 month of pageviews per-article daily|
|local_group_default_T_mediarequest_per_file.data||6.72||34.1%||5 years and 2 month of mediarequest per-file daily|
|local_group_default_T_top_pageviews.data||0.103||5.2%||4 years and 8 month of pageviews top-articles daily and monthly|
Based on those those number, an approximate growing rate for our current datasets is an additional 4Tb per year.
We would like to add the historical per-article daily pagecounts dataset to the api (see related task above). This data being similar to the currently existing pageviews per-article daily, we use the later as a basis for the capacity planning of the former.
Pagecounts data is available from 2008 onward, and will stop at 2015-06 when pageviews starts. this represent 7 years and six month of data. In order to represent growth over time (the number of pageviews was lower in 2008), we have computed the number of rows (distinct wiki, page_title and day) to be loaded in cassandra for every Januray month, both for pagecounts (2008 to 2015) and pageviews (2016 to 2020). We have taken the average linear growth between years as a basis for monthly-rows to be loaded for every month of the given year: Rows-Jan-Y * 12 + (Rows Jan Y+1 - Rows-Jan-Y), and based of the storage taken for pageviews, we have computed expected storage taken by by pagecounts (we have used the same method for incomplete years by mulitplying by the number of month instead of 12, and didn't apply any variation to the 2020 year as it only represents 2 month).
With the method described, we end up with every row stored for pageview per-article daily weighting on average 96 bytes, leading to an additional storage of ~14.7Tb (123% of currently stored pageviews) to add the historical pagecounts per-article daily.
Without going into similar details, we can assume that the top-pagecounts for the same period should weight a similar ratio of top-pageviews, therefore approximately 130Gb.
|date||months||rows||rows for the period||storage for the period (Tb)|
|date||months||rows||rows dfor the period||storage for the period (Tb)|
Note on sizing
The number presented above are approximations. It makes no sense to try to represent cassandra storage in bytes-per-row as there compaction and compression at stake. However given the similarity of the datasets we worked with the approach feels reasonably correct.
- We probably want to take advantage of the expansion to move to cassandra 3.x
- Moving the ~20Tb total of data needs to be carefully planned and thought of.