An in-IDE data-cleaning and code-generation toolkit for data scientists — no Python environment needed.
Data Preprocessor brings your data-cleaning workflow directly into IntelliJ IDEA. Load CSV, Excel, or JSON data, profile every column, build point-and-click cleaning pipelines, and generate ready-to-run Python scripts from the same workflow. Pro unlocks reusable .dpp pipelines plus R and SQL generation — all without leaving the IDE.
| Tab | What you get |
|---|---|
| Preview | Configurable table preview of your loaded dataset |
| Profile | Per-column stats: type, null count, unique values, mean, median, std, min, max, mode |
| Clean | Point-and-click operations that build a reproducible, editable pipeline |
| Code | Auto-generated Python, R [Pro], or SQL [Pro] — save anywhere as .py, .R, or .sql, or copy to clipboard |
| Visualise [Pro] | Histogram and box plot per numeric column; charts update after every Apply |
- Open IntelliJ IDEA (any edition, 2024.3+)
- Go to Settings → Plugins → Marketplace
- Search for "Data Preprocessor"
- Click Install and restart the IDE
Visit plugins.jetbrains.com/plugin/31226-data-preprocessor and click Get.
- Load CSV, Excel, and JSON files — browse and open data without leaving the IDE; also available via right-click on supported files in the Project view
- Column Profiler — type, null count, unique count, mean, median, std deviation, min, max, mode
- Missing value handling — drop rows, fill with mean / median / mode, or supply a custom value
- Remove duplicates — deduplicate rows in one click
- Outlier removal — IQR fence method (1.5 × IQR) to drop statistical outliers
- Normalization — Min-Max scaling [0, 1], Z-Score standardisation (mean=0, std=1), or Robust Scaler using median/IQR
- Type casting — cast any column to int, float, boolean, or string
- Regex replace rules [Pro] — clean text columns with custom pattern-based find/replace, including
$1,$2capture groups - Pipeline editing and reuse — reorder, remove, clear, undo, redo, and use Pro import/export for
.dppJSON pipelines - Multi-file batch mode [Pro] — apply the current pipeline to many CSV, Excel, or JSON files and export
_cleaned.csvoutputs beside each source file - Python code generation — one click produces a complete, ready-to-run
pandasscript that mirrors every cleaning step you applied - R code generation [Pro] — one click produces an equivalent base-R script;
readxl,jsonlite, andfastDummiesimported only when needed - SQL code generation [Pro] — one click produces a PostgreSQL-style CTE template for database-side preprocessing
- Column visualisations [Pro] — Visualise tab renders a histogram or box plot for every numeric column; charts update automatically after Apply so you can see the effect of normalization or outlier removal instantly
- Save generated code — choose where to save generated
.py, Pro.R, or Pro.sqlcode; it opens directly in the IntelliJ editor - Export cleaned CSV — choose where to save the cleaned dataset
- Copy cleaned data as TSV — copy the applied result for direct paste into Excel or Google Sheets
- Settings page — configure preview row limit, default normalization operation, and default train/test ratio under Settings → Tools → Data Preprocessor
![]() |
![]() |
| CSV Preview tab | Column Profiler tab |
![]() |
![]() |
| Clean operations panel | Generated code tab |
- Open the Data Preprocessor tool window from the right-side panel, or
- Right-click any
.csv,.xlsx, or.jsonfile in the Project view → Open in Data Preprocessor
- Switch to the Clean tab
- Select a column and operation from the dropdowns, then click Add step
- Repeat for as many steps as needed
- Reorder steps with ↑ Up / ↓ Down, or use Undo / Redo while editing
- Click ▶ Apply steps to preview the cleaned result
- Click 🐍 Generate Python code, 🔵 Generate R code [Pro], or 🗄 Generate SQL code [Pro]
The Code tab populates with generated code for the selected language. Use Save as script… to choose a destination and open it in the editor, or copy and paste it into your notebook, R session, or database console.
- Export Pipeline [Pro] — saves the current cleaning steps as a
.dppJSON file - Import Pipeline [Pro] — loads a saved
.dppfile back into the Clean tab - Imported pipelines warn when a step references a column that is not present in the currently loaded dataset
- Batch processing is a Pro feature.
- Build a cleaning pipeline from a sample dataset, or import a saved
.dpppipeline before loading data - Click Batch process with this pipeline
- Select multiple
.csv,.xlsx, or.jsonfiles - Each compatible file is loaded in the background, transformed with the current pipeline, and exported beside the source as
<filename>_cleaned.csv - Files missing required pipeline columns are skipped and reported in the batch summary
- 📤 Export cleaned CSV — opens a save dialog and writes the applied result as CSV
- Copy cleaned data as TSV — copies headers and cleaned rows for spreadsheet paste
- Save as script… — opens a save dialog for
preprocess_<filename>.py, Pro.R, or Pro.sqloutput and opens it in the editor automatically
Open Settings → Tools → Data Preprocessor to configure:
- Preview row limit — limits how many rows the Preview tab renders
- Default normalization — selects the default normalization operation in the Clean tab
- Default train/test ratio — pre-fills the Train / Test split ratio field
Pro features are guarded with JetBrains Marketplace licensing. The plugin descriptor declares the Marketplace product code PDATAPREPROCESS; confirm this matches the Product Code assigned in JetBrains Marketplace before uploading a release.
src/main/java/com/datapreprocessor/
├── model/
│ ├── DataSet.java # In-memory tabular data model
│ ├── ColumnProfile.java # Per-column statistics
│ └── PipelineDocument.java # Serializable .dpp pipeline document
├── engine/
│ ├── DataLoader.java # CSV → DataSet (Apache Commons CSV)
│ ├── DataCleaner.java # All cleaning & transformation logic
│ ├── CodeGenerator.java # Generates Python, R, and SQL code
│ ├── DataChartFactory.java # JFreeChart histogram and box plot factory
│ ├── BatchProcessor.java # Applies one pipeline to many files
│ ├── PipelineExecutor.java # Shared pipeline execution for Apply and Batch
│ ├── PipelineSerializer.java # Reads/writes .dpp pipeline JSON
│ ├── PipelineValidator.java # Validates imported pipeline column references
│ └── DataExporter.java # CSV and .py / .R / .sql file export
├── licensing/
│ ├── ProFeature.java # Enum of all Pro-gated features
│ ├── ProFeatureGate.java # Central gate — reads JetBrains LicensingFacade
│ └── ProUpgradeUi.java # Opens IDE license manager / Marketplace pricing
├── platform/
│ └── IntellijPlatformCompat.java # IDE API compatibility shims
├── settings/
│ ├── DataPreprocessorSettings.java # Persistent plugin settings
│ └── DataPreprocessorConfigurable.java # Settings UI under Tools
├── actions/
│ ├── OpenDataFileAction.java # Right-click "Open in Data Preprocessor"
│ └── GeneratePreprocessingCodeAction.java # Insert code at editor caret
└── toolwindow/
├── DataPreprocessorToolWindowFactory.java
├── DataPreprocessorToolWindow.java # Coordinator — wires all panels (5 tabs)
├── HeaderBarPanel.java # Browse / path label / Reload
├── PreviewPanel.java # Tab 1 — raw data table
├── ProfilePanel.java # Tab 2 — per-column statistics
├── CleanPanel.java # Tab 3 — pipeline builder + actions
├── PipelineFileActions.java # Import/export .dpp file workflow
├── CodePanel.java # Tab 4 — generated code viewer
└── VisualisationPanel.java # Tab 5 — histogram / box plot per column
- JDK 17+
- Use the included Gradle wrapper — do not use a system-installed Gradle
./gradlew runIde./gradlew buildPlugin
# Output: build/distributions/data-preprocessor-plugin-*.zipexport PUBLISH_TOKEN=<your-marketplace-token>
./gradlew publishPluginNote: Always run
./gradlew(the wrapper), notgradle. The build requires Gradle 8.6 — the wrapper pins this automatically. Using a system Gradle 9.x will fail.
To add a new cleaning operation:
- Add a variant to
CodeGenerator.Operation - Implement the logic in
DataCleaner - Add the generated-code translation in
CodeGeneratorfor Python, R, and SQL where applicable - Add a label in the
opSelectorcombo box inCleanPanel - Handle the new index in
CleanPanel.addStep()andCleanPanel.applySteps()
Data Preprocessor does not collect, store, or transmit any user data. All data processing is performed entirely locally within your IDE. No CSV content, file paths, column names, or usage metrics are ever sent anywhere.
Full privacy policy: https://plugins.jetbrains.com/plugin/31226-data-preprocessor




