<td>//div[@data-gadget-url='${base_url}/web_site_module/fif_data_runner/gadget_erp5_page_home.html']//a[text()='data sets list' and contains(@href, '?page=fifdata')]</td>
<td>//div[@data-gadget-url='${base_url}/web_site_module/fif_data_runner/gadget_erp5_page_home.html']//a[text()='data set list' and contains(@href, '?page=fifdata')]</td>
<td></td>
</tr>
-->
<!-- Follow data lake link
<!-- Follow data lake link -->
<tr>
<td>click</td>
<td>//div[@data-gadget-url='${base_url}/web_site_module/fif_data_runner/gadget_erp5_page_home.html']//a[text()='data sets list' and contains(@href, '?page=fifdata')]</td>
<td>//div[@data-gadget-url='${base_url}/web_site_module/fif_data_runner/gadget_erp5_page_home.html']//a[text()='data set list' and contains(@href, '?page=fifdata')]</td>
Ebulk tool makes easy to exchange or archive very large data sets. It performs data set ingestions or downloads from different protocols, to Wendelin-IA platform. It also allows to perform local changes in data sets and to upload added and modified files. One key feature of Ebulk is to be able to resume and recover from errors happening with interrupted transfers.
Ebulk tool makes easy to exchange or archive very large data sets. It performs data set ingestions or downloads from different protocols, to Wendelin-IA platform. It also allows to perform local changes in data sets and to upload added and modified files. One key feature of Ebulk is to be able to resume and recover from errors happening with interrupted transfers.<ahref="erp5/web_site_module/fif_data_runner/#/?page=ebulk_doc">See documentation</a>
<p>Ebulk tool makes easy to exchange or archive very large data sets. It performs data set ingestion or download from different storage inputs, to Wendelin-IA platform (based on stack <ahref="https://wendelin.nexedi.com/">Wendelin</a> - <ahref="https://neo.nexedi.com/">NEO</a> - <ahref="https://erp5.nexedi.com/">ERP5</a>). It also allows to perform local changes in data sets and to upload the added and modified files. One key feature of Ebulk is to be able to resume and recover from errors happening with interrupted transfers.</p>
<p>Java 8: Ebulk relies on Embulk-v0.9.7 bulk data loader Java application (please see http://www.embulk.org/), so Java 8 is required in order to install Ebulk tool.</p>
<p>Download operation: downloads the content of the specified remote data set from the Wendelin-IA site into the target output. By default, the output is a directory named as the data set.</p>
<p><dataset> argument: unique reference of the remote data set. It is optional because if no data set is specified, the current directory will be used as data set reference and directory.</p>
<p>Data set reference must be one of the available datasets on the Wendelin-IA site.</p>
<p>Data set argument can be a path to a directory, then the directory name will be used as data set reference: e.g. ‘ebulk pull my_directory/sample/’ --> data set reference will be sample. That directory will be linked to the data set reference, so any future operation on it will refer to that data set, no matter if the directory is moved or renamed.</p>
<p>If pull operation is run on a previously downloaded data set, the tool will offer the options to update it or download it from scratch, warning about any conflict with local changes.</p>
<p>Ingestion operation: uploads the content of the specified input data set to the Wendelin-IA site. By default, the input data set is the directory named as the data set.</p>
<p><dataset> argument: unique reference for the data set. It is optional because if no data set is specified, the current directory will be used as data set reference and directory.</p>
<p>Data set argument can be a path to a directory, then the directory name will be used as data set reference: e.g. ‘ebulk push my_directory/sample/’ --> data set reference will be sample. That directory will be linked to the data set reference, so any future operation on it will refer to that data set, no matter if the directory is moved or renamed.</p>
<p>- New data set ingestion: an ingestion with a new data set reference will create a new data set on the site.</p>
<p>- Data set contribution: ingestion of local changes made on a previously downloaded data set. If no local changes were marked as ready for ingestion (see add/remove commands below), then by default the push command will use all the available local changes.</p>
<p>- Partial ingestion: allows to perform ingestions to a data set without downloading it previously, warning about any file conflict. This feature allows to upload portions of a very large dataset in parallel from different locations/computers.</p>
<p>Lists the local changes in data set path. If no data set path is specified, the current directory will be used as data set directory. Lists any new, modified or deleted file in the local data set, indicating if they were marked for ingestion or not.</p>
<pstyle="margin-bottom:15px"/>
<pstyle="font-weight:bold">add <path></p>
<p>Marks new or modified files in path as ready for ingestion. The path can be a specific file or a directory. Any file in path that has been added or modified will be set as ready, then a future push operation will use the marked files for the ingestion.</p>
<p>Marks the files in path for removal. The path can be a specific file or a directory. Any file in path (deleted or not) will be removed. Then a future push operation will delete from remote data set the files marked as removed. Note: if an existing file (not deleted) is marked for removal, the push operation will also delete it from local data set.</p>
<pstyle="margin-bottom:15px"/>
<pstyle="font-weight:bold">reset <path></p>
<p>Resets marked files in path. The path can be a specific file or a directory. Any file previously marked for ingestion (add or remove) will be reset.</p>
<p>Allows to use a custom location as data set directory. That directory will be linked to the data set reference, so any future operation on it will refer to that data set, no matter if the directory is moved or renamed.</p>
<p>Operations on large files are split into smaller chunks; by default, the size of the chunks is 50Mb. This command allows to set the size (in Megabytes) of the chunks in case is needed.</p>
<pstyle="margin-bottom:15px"/>
<pstyle="font-weight:bold">-dc</p>
<pstyle="font-weight:bold">--discard-changes</p>
<p>Discards all the changes made in the local data set by downloading the corresponding original files from the remote data set.</p>
<p>Uses the specified storage as input for data set ingestion. The storage must be one of the storages available in the installed Ebulk tool version. The tool will ask the corresponding inputs needed for that storage (like authentication, urls, etc.) and it will perform the ingestion of its content to the remote data set on the site. e.g. the command 'ebulk push my-dataset --storage ftp' allows to ingest the contents of a remote located dataset via ftp.</p>
<pstyle="margin-bottom:15px"/>
<pstyle="font-weight:bold">-a</p>
<pstyle="font-weight:bold">--advanced</p>
<p>When using -s|--storage option, it allows to configure more advanced aspects of the specified storage, by editing the corresponding configuration file.</p>
<pstyle="margin-bottom:15px"/>
<pstyle="font-weight:bold">-cs</p>
<pstyle="font-weight:bold">--custom-storage</p>
<p>Allows to use a custom storage as input that is not available yet in the tool. The storage must be one of the available in embulk site: 'http://www.embulk.org/plugins/#input'. The tool will attempt to automatically install the plugin and it will request the user to edit the corresponding configuration file.</p>
<divclass="documentation">
<h2>Welcome to Ebulk tool documentation page</h2>
<h1>DESCRIPTION</h1>
<pclass="last">Ebulk tool makes easy to exchange or archive very large data sets. It performs data set ingestion or download from different storage inputs, to Wendelin-IA platform (based on stack <ahref="https://wendelin.nexedi.com/">Wendelin</a> - <ahref="https://neo.nexedi.com/">NEO</a> - <ahref="https://erp5.nexedi.com/">ERP5</a>). It also allows to perform local changes in data sets and to upload the added and modified files. One key feature of Ebulk is to be able to resume and recover from errors happening with interrupted transfers.</p>
<h1>REQUIREMENTS</h1>
<pclass="last">Java 8: Ebulk relies on Embulk-v0.9.7 bulk data loader Java application (please see <ahref="http://www.embulk.org/">Embulk-doc</a>), so Java 8 is required in order to install Ebulk tool.</p>
<pclass="last">Prints the synopsis and the list of commands and options, with a brief explanation of them.</p>
<pclass="command">-r</p>
<pclass="command">--readme</p>
<pclass="last">Access the README file for a detailed explanation of Ebulk installation and usage.</p>
<pclass="command">-e</p>
<pclass="command">--example</p>
<pclass="last">Prints some basic examples about Ebulk usage.</p>
<pclass="command">pull [<dataset>]</p>
<p>Download operation: downloads the content of the specified remote data set from the Wendelin-IA site into the target output. By default, the output is a directory named as the data set.</p>
<p><dataset> argument: unique reference of the remote data set. It is optional because if no data set is specified, the current directory will be used as data set reference and directory.</p>
<p>Data set reference must be one of the available datasets on the Wendelin-IA site.</p>
<p>Data set argument can be a path to a directory, then the directory name will be used as data set reference: e.g. ‘ebulk pull my_directory/sample/’ --> data set reference will be sample. That directory will be linked to the data set reference, so any future operation on it will refer to that data set, no matter if the directory is moved or renamed.</p>
<p>If pull operation is run on a previously downloaded data set, the tool will offer the options to update it or download it from scratch, warning about any conflict with local changes.</p>
<p>Ingestion operation: uploads the content of the specified input data set to the Wendelin-IA site. By default, the input data set is the directory named as the data set.</p>
<p><dataset> argument: unique reference for the data set. It is optional because if no data set is specified, the current directory will be used as data set reference and directory.</p>
<p>Data set argument can be a path to a directory, then the directory name will be used as data set reference: e.g. ‘ebulk push my_directory/sample/’ --> data set reference will be sample. That directory will be linked to the data set reference, so any future operation on it will refer to that data set, no matter if the directory is moved or renamed.</p>
<p>- New data set ingestion: an ingestion with a new data set reference will create a new data set on the site.</p>
<p>- Data set contribution: ingestion of local changes made on a previously downloaded data set. If no local changes were marked as ready for ingestion (see add/remove commands below), then by default the push command will use all the available local changes.</p>
<p>- Partial ingestion: allows to perform ingestions to a data set without downloading it previously, warning about any file conflict. This feature allows to upload portions of a very large dataset in parallel from different locations/computers.</p>
<pclass="last">Lists the local changes in data set path. If no data set path is specified, the current directory will be used as data set directory. Lists any new, modified or deleted file in the local data set, indicating if they were marked for ingestion or not.</p>
<pclass="command">add <path></p>
<pclass="last">Marks new or modified files in path as ready for ingestion. The path can be a specific file or a directory. Any file in path that has been added or modified will be set as ready, then a future push operation will use the marked files for the ingestion.</p>
<pclass="command">remove <path></p>
<pclass="last">Marks the files in path for removal. The path can be a specific file or a directory. Any file in path (deleted or not) will be removed. Then a future push operation will delete from remote data set the files marked as removed. Note: if an existing file (not deleted) is marked for removal, the push operation will also delete it from local data set.</p>
<pclass="command">reset <path></p>
<pclass="last">Resets marked files in path. The path can be a specific file or a directory. Any file previously marked for ingestion (add or remove) will be reset.</p>
<h1>OPTIONS</h1>
<pclass="command">-d <path></p>
<pclass="command">--directory <path></p>
<pclass="last">Allows to use a custom location as data set directory. That directory will be linked to the data set reference, so any future operation on it will refer to that data set, no matter if the directory is moved or renamed.</p>
<pclass="command">-c <size></p>
<pclass="command">--chunk <size></p>
<pclass="last">Operations on large files are split into smaller chunks; by default, the size of the chunks is 50Mb. This command allows to set the size (in Megabytes) of the chunks in case is needed.</p>
<pclass="command">-dc</p>
<pclass="command">--discard-changes</p>
<pclass="last">Discards all the changes made in the local data set by downloading the corresponding original files from the remote data set.</p>
<pclass="command">-s <storage></p>
<pclass="command">--storage <storage></p>
<pclass="last">Uses the specified storage as input for data set ingestion. The storage must be one of the storages available in the installed Ebulk tool version. The tool will ask the corresponding inputs needed for that storage (like authentication, urls, etc.) and it will perform the ingestion of its content to the remote data set on the site. e.g. the command 'ebulk push my-dataset --storage ftp' allows to ingest the contents of a remote located dataset via ftp.</p>
<pclass="command">-a</p>
<pclass="command">--advanced</p>
<pclass="last">When using -s|--storage option, it allows to configure more advanced aspects of the specified storage, by editing the corresponding configuration file.</p>
<pclass="command">-cs</p>
<pclass="command">--custom-storage</p>
<p>Allows to use a custom storage as input that is not available yet in the tool. The storage must be one of the available in embulk site: 'http://www.embulk.org/plugins/#input'. The tool will attempt to automatically install the plugin and it will request the user to edit the corresponding configuration file.</p>
<h1>Ebulk + Wendelin = Big Data sharing platform</h1>
<p><ahref="erp5/web_site_module/fif_data_runner/#/?page=ebulk_doc">Ebulk</a> tool and <atarget="_blank"href="https://wendelin.nexedi.com/">Wendelin</a> platform are combined to form an easy to use Data Lake to share petabytes of data grouped into data sets. Big Data sharing is essential for research and startups, due building new A.I. models requires access to large data sets, usually available in big platforms such as Google or Alibaba which tend to keep them secret. This project offers a solution to the big data sharing problem by solving the following key points:</p>
<ul>
<li>Huge transfer (over slow and unreliable network)</li>
<p>Dozens of public and private big data sets are available in the platform, terabytes of data of any kind, including binaries like medical images, ndarrays and more. Do you want to download data sets or share your data? <ahref="erp5/web_site_module/fif_data_runner/#/?page=download">Download</a> our Ebulk tool to transfer big data! Please <ahref="erp5/web_site_module/fif_data_runner/#/?page=about">contact us</a> to register and get a user. See our full <ahref="erp5/web_site_module/fif_data_runner/#/?page=fifdata">data set list</a> !</p>
<h1>Ebulk tool</h1>
<p>Ebulk tool is a wrapper for <atarget="_blank"href="http://www.embulk.org/docs/">Embulk</a>, an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. It supports any kind of input file formats, parallel and distributed execution to deal with big data sets, transaction control to guarantee All-or-Nothing file transfer, and operation resuming. Ebulk is as easy as git to use, allowing the big data transfering to be done by using very few commands. Please, <ahref="erp5/web_site_module/fif_data_runner/#/?page=download">download</a> Ebulk and check the <ahref="erp5/web_site_module/fif_data_runner/#/?page=ebulk_doc">documentation</a>.</p>
<h1>Wendelin</h1>
<p><atarget="_blank"href="https://wendelin.nexedi.com/">Wendelin</a> is a big data framework designed for industrial applications based on python, NumPy, Scipy and other NumPy based libraries. It uses at its core the NEO distributed transactional NoSQL database to store petabytes of binary data. Wendelin combines the performance of scikit-learn machine learning with NEO distributed storage in order to provide out-of-core processing of large data sets. Its goal is to bring the best open source, big data engine based on Numpy python technologies and gather a wide community of contributors of new data analytics algorithms.</p>