統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用

統計解析言語Rにおける
大規模データ管理のための
Boost.Interporcessの活用

2011年12月3日
Boost#7
@sfchaos

自己紹介

 TwitterID: @sfchaos
 職業：データ分析屋
 RやC++等を使って金融，医療，産業などの
データ分析
 Boostは金融をやっていた頃，行列保持・計算
等でublasを少し使用

アジェンダ

1．Why R?
2. What's R?
3．Boost.Interprocessを活用したRの大規
模データ管理
4. その他C++/BoostとRの接点
5. まとめ

1. Why R?
 近年，機械学習，自然言語処理，データマ
イニングなどがブームになりつつある
　“I keep saying that the sexy job in the
next 10 years will be statisticians,” said
Hal Varian, chief economist at Google.
“And I’m not kidding.”

 Rは分析屋が手元で探索的にデータ分析
できるお手軽ツール
 本発表の趣旨は，RにおけるBoostの活用
事例の紹介

2. What's R?

 統計計算とグラフィックスのための言語・環
境
 多様な統計手法(線形・非線形モデル、古
典的統計検定、時系列解析、判別分析、
クラスタリング、その他)とグラフィックスを
提供
 近年，大いに注目を集めている
 各地での勉強会(Tokyo.R, Tsukuba.R, Osaka.R,
Hiroshima.R)

2.1 Rの長所(の一例)

 オブジェクトに対する高い操作性
> # 初項１，末項10，公差1の等差数列
> x <- 1:10
> # xの値を表示する
>x
[1] 1 2 3 4 5 6 7 8 9 10
> # 偶数の項だけを取り出す
> x[x%%2==0]
[1] 2 4 6 8 10

 強力なグラフィクス機能
Averag e Yearly Sun sp o ts
1750 1800 1850 1900 1950

150
spots

100

50

0

150
spots

100
50
0

1750 1800 1850 1900 1950

Year

 最新の手法を用意した豊富なパッケージ群

2.2 Rの短所
 マルチコア/CPUの環境でも基本的に
1CPU

1CPU
 基本的にオンメモリでデータを保持，計算
を実行

1CPU
を実行
 32ビット整数を用いているので，64ビット
OSでもベクトル，行列，配列などのオブジェ
クトの要素数の上限が231-1

1CPU
を実行
 基本的にオブジェクトは値渡しするため，メ
モリを大量に消費

 マルチCPUの環境でも基本的に1CPU
を実行
大規模なデータに対して
処理速度を上げるためには工夫が必要
→高性能計算(High Performance Computing)
 基本的にオブジェクトは値渡しするため，メ
モリを大量に消費

アジェンダ
1．Why R?
2. What's R?
3. Boost.Interprocessを活用した大規模
データ管理
4. その他Boost/C++とRの接点
5. まとめ

3.1 オンメモリの制約条件を超えるために
 Rの標準機能だけではRAMの制約がある
 この課題を解決するために提供されている
パッケージがいくつかある
 bigmemoryパッケージは
Boost.Interprocessを使用して共有メモリ，
メモリマップドファイルを用いたデータ管理
を実現

3.2 Boost.Interprocessの概要
 プロセス間通信や同期の仕組みを簡略化
したライブラリ
 共有メモリ
 メモリマップファイル
 セマフォ，ミューテックス，条件変数，共有メモリやメ
モリマップファイル上のアップグレード可能なミュー
テックス型
 名前付したこれらの同期オブジェクト型．Unixや
Windowsのsem_openやCreateSemaphore APIに似
たもの．
 ファイルロック
 相対的な位置
 メッセージキュー　等々
http://ohkuma.la.coocan.jp/tech/boost/Interproc
ess.html

 今回は，Rの話に関係のある共有メモリ，メ
モリマップドファイルのみを簡単に調査

3.3.1 共有メモリ
 使用するヘッダファイル
boost/interprocess/shared_memory_object.h
pp

 共有メモリセグメントの作成
using boost::interprocess;
// 共有メモリセグメントのオープン・作成
shared_memory_object
shm_obj(open_or_create, "shared_memory",
read_write);
// 共有メモリのサイズの設定(要read_writeモード)
shm_obj.truncate(10000);

 共有メモリセグメントのマッピング
using namespace boost;
mapped_region(shm, read_write);

 共有メモリの破棄
using namespace boost::interprocess;
shared_memory_object::remove(
"shared_memory");

3.3.2 メモリマップドファイル
 使用するヘッダファイル
boost/interprocess/file_mapping.hpp

 ファイルマッピングの作成
file_mapping m_file("/usr/home/file",
read_write)
 メモリ内へのファイルの中身のマッピング
mapped_region region(m_file, read_write);

3.4 bigmemoryを構成するクラス
BigMatrix

巨大行列の
抽象クラス

LocalBigMatrix SharedBigMatrix
ローカルで
共有用巨大行列の
データを保持する
抽象クラス
巨大行列クラス

SharedMemoryBigMatrix FileBackedBigMatrix

共有メモリを用いたメモリマップドファイルを
巨大行列クラス用いた巨大行列クラス

3.4.1 共有用巨大行列の抽象クラス
class SharedBigMatrix : public BigMatrix
{
public:
SharedBigMatrix() : BigMatrix() {_shared=true;}
virtual ~SharedBigMatrix() {}
std::string uuid() const {return _uuid;}
std::string shared_name() const {return _sharedName;}

protected:
virtual bool destroy()=0;
bool create_uuid(); uuidの作成
bool uuid(const std::string &uuid) {_uuid=uuid; return true;}
std::string _uuid;
std::string _sharedName;
MappedRegionPtrs _dataRegionPtrs;
};

typedef boost::interprocess::mapped_region MappedRegion;
typedef boost::shared_ptr<MappedRegion> MappedRegionPtr;
typedef vector<MappedRegionPtr> MappedRegionPtrs;

bool SharedBigMatrix::create_uuid()
{
try{
stringstream ss;
boost::uuids::basic_random_generator<boost::mt19937> gen;
boost::uuids::uuid u = gen();
ss << u;
_uuid = ss.str();
return true;
} catch(std::exception &e) {
printf("%sn", e.what());
printf("%s line %dn", __FILE__, __LINE__);
return false;
}
}

3.4.2 共有メモリを用いた巨大行列クラス
class SharedMemoryBigMatrix : public SharedBigMatrix
{
public:
SharedMemoryBigMatrix():SharedBigMatrix(){};
virtual ~SharedMemoryBigMatrix(){destroy();};
virtual bool create( const index_type numRow, const index_type
numCol, ①巨大行列の生成
const int matrixType, const bool sepCols);
virtual bool connect( const std::string &uuid, const index_type
numRow, ②巨大行列への接続
const index_type numCol, const int matrixType,
const bool sepCols);
③巨大行列の破棄
protected:
virtual bool destroy();

SharedCounter _counter;
};

① 巨大行列の生成
bool SharedMemoryBigMatrix::create( const index_type numRow,
const bool sepCols ) {
#ifndef INTERLOCKED_EXCHANGE_HACK
named_mutex mutex(open_or_create,
(_sharedName+"_counter_mutex").c_str());
mutex.lock();
#endif
_counter.init( _sharedName+"_counter"①－1 カウンタの初期化
);
mutex.unlock();
#endif
switch(_matType) {
// 行列の型に応じた共有用巨大行列の生成
case 1:
_pdata = CreateSharedMatrix<char>(_sharedName, 　　　　
　　
　　　　　　　　　　_dataRegionPtrs, _nrow, _ncol);
break; ①－2 共有用巨大行列の生成エンジン
･･･
}
return true;
}

①-1 カウンタの初期化
bool SharedCounter::init( const std::string &resourceName ) {
_resourceName = resourceName;
try {
// 初めて接続する場合
boost::interprocess::shared_memory_object shm(
boost::interprocess::create_only,
_resourceName.c_str(),
boost::interprocess::read_write);
shm.truncate( sizeof(index_type) );
_pRegion = new boost::interprocess::mapped_region(shm,
boost::interprocess::read_write);
_pVal = reinterpret_cast<index_type*>(_pRegion-
>get_address());
*_pVal = 1;
} catch(std::exception &ex) {
// 既に存在するカウンタに接続する場合
　･･･
++(*_pVal);
}
return true;
}

①-2 共有用巨大行列の生成エンジン
template<typename T>
void* CreateSharedMatrix( const std::string &sharedName,
MappedRegionPtrs &dataRegionPtrs, const index_type nrow, const
index_type ncol)
{ 共有メモリセグメントの作成
shared_memory_object shm(create_only, sharedName.c_str(),

read_write);
共有メモリのサイズの設定
shm.truncate( nrow*ncol*sizeof(T) ); (行列のサイズ分)
dataRegionPtrs.push_back(
MappedRegionPtr(new MappedRegion(shm, read_write)));
return dataRegionPtrs[0]->get_address();
}

② 巨大行列への接続
bool SharedMemoryBigMatrix::connect( const std::string &uuid,
const index_type numRow, const index_type numCol, const int
matrixType,
const bool sepCols )
{
// Attach to the associated mutex and counter;
mutex.lock();
#endif ②－1 カウンタの初期化
_counter.init( _sharedName+"_counter" );
#ifndef INTERLOCKED_EXCHANGE_HACK (①ー1で扱ったため省略
mutex.unlock(); )
#endif
switch(_matType) {
case 1:
_pdata = ConnectSharedMatrix<char>(_sharedName,
_dataRegionPtrs, _counter);
break;
･･･
}
}

②-2 共有用巨大行列への接続エンジン
void* ConnectSharedMatrix( const std::string &sharedName,
MappedRegionPtrs &dataRegionPtrs, SharedCounter &counter)
{
using namespace boost::interprocess; 共有メモリセグメントのオープン
shared_memory_object shm(open_only, sharedName.c_str(),
read_write);
マップド領域への追加
dataRegionPtrs.push_back(
MappedRegionPtr(new MappedRegion(shm, read_write)));
return reinterpret_cast<void*>(dataRegionPtrs[0]->get_address());
}

③ 巨大行列の破棄
bool SharedMemoryBigMatrix::destroy() {
mutex.lock();
#endif
bool destroyThis = (1==_counter.get()) ? true : false;
_dataRegionPtrs.resize(0);
if (destroyThis) {
shared_memory_object::remove(_uuid.c_str());
}
mutex.unlock();
if (destroyThis) {
named_mutex::remove((_sharedName+"_counter_mutex").c_str());
}
#endif
return true;
}

3.4.3 メモリマップドファイルを用いた巨大行列クラス
class FileBackedBigMatrix : public SharedBigMatrix
{
public:
FileBackedBigMatrix():SharedBigMatrix(){}
virtual ~FileBackedBigMatrix(){destroy();}
virtual bool create( const std::string &fileName,
const std::string &filePath,const index_type numRow,
const index_type numCol, const int matrixType, const bool
sepCols);
virtual bool connect( const std::string &fileName,
const std::string &filePath, const index_type numRow,
const index_type numCol, const int matrixType, const bool
sepCols);
std::string file_name() const {return _fileName;}
bool flush();
protected:
virtual bool destroy();

std::string _fileName;
};

① 巨大行列の生成
bool FileBackedBigMatrix::create( const std::string &fileName,
const std::string &filePath, const index_type numRow, const
index_type numCol,
const int matrixType, const bool sepCols)
{
// 行列の型に応じたメモリマップドファイルの生成
switch(_matType) {
case 1: メモリマップドファイル生成エンジン
_pdata = CreateFileBackedMatrix<char>(_fileName, filePath,
_dataRegionPtrs, _nrow, _ncol);
break;
case 2:
_pdata = CreateFileBackedMatrix<short>(_fileName,
filePath,
_dataRegionPtrs, _nrow, _ncol);
break;
･･･
}
return true;
}

template<typename T>
void* ConnectFileBackedMatrix( const std::string &fileName,
const std::string &filePath, MappedRegionPtrs &dataRegionPtrs)
{
ファイルマッピングの作成
file_mapping mFile((filePath+"/"+fileName).c_str(), read_write);
dataRegionPtrs.push_back( メモリ内へのファイルの中身のマッピング
MappedRegionPtr(new MappedRegion(mFile, read_write)));
return reinterpret_cast<void*>(dataRegionPtrs[0]-
>get_address());
}

② 巨大行列への接続
bool FileBackedBigMatrix::connect( const std::string &fileName,
const std::string &filePath, const index_type numRow,
const bool sepCols)
{
// 行列の型に応じたメモリマップドファイルへの接続
switch(_matType) {
case 1:
_pdata = ConnectFileBackedMatrix<char>(_fileName, filePath,
_dataRegionPtrs);
break;
case 2:
_pdata = ConnectFileBackedMatrix<short>(_fileName,
filePath,
_dataRegionPtrs);
break;
･･･
}
return true;
}

③ 巨大行列の破棄
bool FileBackedBigMatrix::destroy()
{
_dataRegionPtrs.resize(0);
shared_memory_object::remove(_fileName.c_str());
return true;
}

3.3 Rの課題の解決度合い
Rの課題解決度合い
マルチCPU(コア)の環境でも ○
基本的に1CPU(コア) 共有メモリやメモリマップドファイルを
用いて並列/並行計算が可能に
○
基本的にオンメモリでデータ
を保持，計算を実行 RAMをはるかに超えるデータの
扱いが可能に
ベクトル，行列，配列などの ○
要素数の上限が231-1 要素数の上限は252まで拡張
基本的にオブジェクトの参照
渡しができず値渡しを行うた ◎
め，コピーがあちこちで発生
しメモリを消費する参照渡しでオブジェクトを渡せる

3.4 具体例
 使用するデータ
 Data Expo 2009
　アメリカの旅客機のフライトデータ
(1987年～2008年)
　　　　http://stat-computing.org/dataexpo/2009/the-data.html
 約12GB(約1億2,300万レコード，29
フィールド)

3.4.1 メモリマップドファイルの作成・接続
> library(bigmemory)
> # メモリマップドファイルの作成(Intel core i7で約21分)
> airline <- read.big.matrix("AirlineAllData.csv",
header=TRUE, sep=",",
backingifle="AirlineAllData.bin",
descriptorfile="AirlineAllData.desc")

> # 既に作成されたメモリマップドファイルに接続(0.002
秒)
> airline <- attach.big.matrix("AirlineAllData.desc")

3.4.2 データの集計
> library(bigtabulate)
>
> # 各列の要約(最小値、最大値、平均値、NAの数)
> summary(airline)
>
> # 年ごと月ごとのフライト数
> bigtable(airline, c("Year", "Month"))
>
> # 曜日ごとの到着時間の遅れの統計量
> # (最小値、最大値、平均値、標準偏差、NAの数)
> bigtsummary(airline, "DayOfWeek", cols="ArrDelay", na.rm=T)

3.4.3 旅客機の製造月の推定
> library(bigtabulate)
>
> # 旅客機コードごとのレコード番号
> planeindices <- bigsplit(x, 'TailNum')
>
> # 2コアを使って並列に実行する
> library(doMC)
> registerDoMC(cores=2)
>
> # 製造月の推定(約14秒)
> planeStart <-
+ foreach(i=planeindices, .combine=c) %dopar% {
+ return(birthmonth(x[i, c('Year','Month'),
+ drop=FALSE]))
+}

 データの保持，集計程度はできるようには
なったが，まだまだ機能が不十分
 機能を拡張するためには，単一の型しか扱
えない行列ではダメ
 列ごとに型が異なることを許容するデータ
フレームを開発する必要がある
 Boost.VariantやBoost.MPL等を用いて開
発できないか検討中

 Rcppパッケージを用いたRとC++のインタ
フェースの簡潔な記述(Boost.Pythonを参
考)
 Boost.Graphライブラリを呼び出すRBGL
パッケージ
 Boost.Date_Timeライブラリを呼び出す
RcppBDTパッケージ　等々

アジェンダ
1．Why R?
2. What's R?
3．Boost.Interprocessを活用したRの大規
模データ管理
4. その他C++/BoostとRの接点
5. まとめ

 Rの機能を拡張するために，いろいろなとこ
ろでBoostが使われています
 データ分析屋が快適に分析を行うためにも
，Boostコミュニティの益々のご発展を願っ
ています！

統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用

統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用

More Related Content

What's hot

Similar to 統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用

More from Shintaro Fukushima

統計解析言語Rにおける大規模データ管理のためのboost.interprocessの活用