National Data Science Bowl(Classification)

대회 소개

png

플랑크톤 이미지로 121 개의 플랑크톤 종 구분하는 대회

National Data Science Bowl

Predict ocean health, one plankton at a time
https://www.kaggle.com/c/datasciencebowl


Description

Plankton are critically important to our ecosystem, accounting for more than half the primary productivity on earth and nearly half the total carbon fixed in the global carbon cycle. They form the foundation of aquatic food webs including those of large, important fisheries. Loss of plankton populations could result in ecological upheaval as well as negative societal impacts, particularly in indigenous cultures and the developing world. Plankton’s global significance makes their population levels an ideal measure of the health of the world’s oceans and ecosystems.

Traditional methods for measuring and monitoring plankton populations are time consuming and cannot scale to the granularity or scope necessary for large-scale studies. Improved approaches are needed. One such approach is through the use of an underwater imagery sensor. This towed, underwater camera system captures microscopic, high-resolution images over large study areas. The images can then be analyzed to assess species populations and distributions.

Manual analysis of the imagery is infeasible – it would take a year or more to manually analyze the imagery volume captured in a single day. Automated image classification using machine learning tools is an alternative to the manual approach. Analytics will allow analysis at speeds and scales previously thought impossible. The automated system will have broad applications for assessment of ocean and ecosystem health.

The National Data Science Bowl challenges you to build an algorithm to automate the image identification process. Scientists at the Hatfield Marine Science Center and beyond will use the algorithms you create to study marine food webs, fisheries, ocean conservation, and more. This is your chance to contribute to the health of the world’s oceans, one plankton at a time.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np 
import pandas as pd 
import zipfile
import glob
from PIL import Image
import os
import matplotlib.pyplot as plt
import seaborn as sns
import random
import cv2

import tensorflow as tf
from tensorflow.keras import *
from tensorflow.keras.layers import *
from tensorflow.keras.applications import EfficientNetB1
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

import matplotlib.pyplot as plt
import seaborn as sns

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
1
2
3
4
/kaggle/input/datasciencebowl/plankton_identification.pdf
/kaggle/input/datasciencebowl/train.zip
/kaggle/input/datasciencebowl/sampleSubmission.csv.zip
/kaggle/input/datasciencebowl/test.zip

데이터 불러오기

  • train, test 데이터가 zip으로 압축되어 있으므로 먼저 압축을 풀어주자.
1
2
3
4
5
with zipfile.ZipFile('/kaggle/input/datasciencebowl/train.zip','r') as z:
    z.extractall('train')
    
with zipfile.ZipFile('/kaggle/input/datasciencebowl/test.zip','r') as z:
    z.extractall('test')

Train Dataset

  • Train 데이터프레임을 만들어 주기
1
2
3
4
5
train = glob.glob('train/*/*/*')
train_df = pd.DataFrame({'path' : train})
train_df['image'] = train_df['path'].apply(lambda x : x.split('/')[-1])
train_df['label'] = train_df['path'].apply(lambda x : x.split('/')[-2])
train_df
path image label
0 train/train/pteropod_theco_dev_seq/36451.jpg 36451.jpg pteropod_theco_dev_seq
1 train/train/pteropod_theco_dev_seq/44793.jpg 44793.jpg pteropod_theco_dev_seq
2 train/train/pteropod_theco_dev_seq/157712.jpg 157712.jpg pteropod_theco_dev_seq
3 train/train/pteropod_theco_dev_seq/4992.jpg 4992.jpg pteropod_theco_dev_seq
4 train/train/pteropod_theco_dev_seq/144126.jpg 144126.jpg pteropod_theco_dev_seq
... ... ... ...
30331 train/train/hydromedusae_shapeB/27912.jpg 27912.jpg hydromedusae_shapeB
30332 train/train/hydromedusae_shapeB/49210.jpg 49210.jpg hydromedusae_shapeB
30333 train/train/hydromedusae_shapeB/114615.jpg 114615.jpg hydromedusae_shapeB
30334 train/train/hydromedusae_shapeB/81391.jpg 81391.jpg hydromedusae_shapeB
30335 train/train/hydromedusae_shapeB/95843.jpg 95843.jpg hydromedusae_shapeB

30336 rows × 3 columns


Test Dataset

  • Test 데이터 프레임을 만들어 주기
1
2
3
4
test = glob.glob('./test/*/*')
test_df = pd.DataFrame({'path' : test})
test_df['image'] = test_df['path'].apply(lambda x : x.split('/')[-1])
test_df
path image
0 ./test/test/105188.jpg 105188.jpg
1 ./test/test/27651.jpg 27651.jpg
2 ./test/test/11940.jpg 11940.jpg
3 ./test/test/87538.jpg 87538.jpg
4 ./test/test/95396.jpg 95396.jpg
... ... ...
130395 ./test/test/34122.jpg 34122.jpg
130396 ./test/test/88776.jpg 88776.jpg
130397 ./test/test/78514.jpg 78514.jpg
130398 ./test/test/1830.jpg 1830.jpg
130399 ./test/test/8917.jpg 8917.jpg

130400 rows × 2 columns


데이터 확인하기

  • 이미지 모양 및 크기 확인
  • class 개수 및 분포 확인

이미지 확인하기

1
2
3
4
5
6
7
8
9
fig , axes = plt.subplots(4,3)
fig.set_size_inches(15,10)

for index in range(0,12):
    idx = random.randrange(len(train_df))
    img = Image.fromarray(cv2.imread(train_df.iloc[idx,0], cv2.IMREAD_COLOR)).resize((256,256))
    axes[index//3 , index%3].imshow(img)
    axes[index//3 , index%3].set_title(train_df['label'][idx])
    axes[index//3,  index%3].axis('off')

png


샘플 이미지 사이즈 확인하기

1
2
3
sample_image = Image.open(train[1000])
print("Image size: ",sample_image.size)
sample_image.resize((128,128))
1
Image size:  (131, 251)

png


Train Dataset 이미지들의 평균 너비 및 높이

1
2
3
4
5
6
7
8
9
10
11
w = []
h = []
idx=0
for i in train:
    img = Image.open(i)
    w.append(img.size[0])
    h.append(img.size[1])
    img.close()
    idx+=1
    
print("Train Images' mean width :",np.mean(w),"\nTrain Images' mean height :",np.mean(h))
1
2
Train Images' mean width : 73.50728507383967 
Train Images' mean height : 66.66182093881856

Label의 분포

1
2
3
4
5
6
print("The unique label numbers :",train_df['label'].nunique())

a,b = plt.subplots(1,1,figsize=(20,12))
plot = sns.countplot(train_df['label'])
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()
1
The unique label numbers : 121

png

  • label 별로 데이터의 분포가 다르다.

Test Dataset 전처리

1
2
3
4
5
6
7
8
9
idg_test = ImageDataGenerator()

test_generator = idg_test.flow_from_dataframe(test_df, 
                                                x_col = 'path',
                                                y_col = None, 
                                                class_mode = None,
                                                shuffle=False,
                                                batch_size = 64,
                                                target_size = (150,150),)
1
Found 130400 validated image filenames.

교차 검증으로 모델 학습하기

1
k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
result=0
index=0

for train_index, valid_index in k_fold.split(train_df, train_df['label']):
    
    X_train = train_df.iloc[train_index]
    X_val = train_df.iloc[valid_index]
    
    idg_train = ImageDataGenerator(horizontal_flip=True)
    idg_val = ImageDataGenerator()
    
    train_generator = idg_train.flow_from_dataframe(X_train, 
                                                x_col = 'path',
                                                y_col = 'label', 
                                                batch_size = 64,
                                                target_size = (150,150))

    val_generator = idg_val.flow_from_dataframe(X_val, 
                                                x_col = 'path',
                                                y_col = 'label', 
                                                batch_size = 64,
                                                target_size = (150,150))    
    
    model = Sequential()
    model.add(EfficientNetB1(include_top = False,weights ='imagenet', pooling = 'avg'))
    model.add(Dense(121, activation = 'softmax'))


    model.compile(optimizer = tf.keras.optimizers.Adam() , loss = 'categorical_crossentropy', metrics = ['acc'],)

    es = EarlyStopping(patience = 10, verbose = True)
    ckpt = ModelCheckpoint('best.h5', save_best_only = True, verbose =True, monitor = 'val_loss')
    rl = ReduceLROnPlateau(monitor = 'val_loss',patience = 5, verbose = True)

    model.fit(train_generator, 
              validation_data = val_generator,
              callbacks = [ckpt, rl, es], 
              epochs =25)

    model.load_weights('best.h5')
    
    result += model.predict(test_generator, verbose=True) / 5
    
    index+=1
    print(f"\n\n{index} 번째 교차 검증 완료 ... ... ...\n\n")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Epoch 12/25
380/380 [==============================] - ETA: 0s - loss: 0.0739 - acc: 0.9796
Epoch 00012: val_loss did not improve from 0.93774
380/380 [==============================] - 112s 295ms/step - loss: 0.0739 - acc: 0.9796 - val_loss: 0.9785 - val_acc: 0.7618
Epoch 13/25
380/380 [==============================] - ETA: 0s - loss: 0.0572 - acc: 0.9852
Epoch 00013: val_loss did not improve from 0.93774
380/380 [==============================] - 112s 295ms/step - loss: 0.0572 - acc: 0.9852 - val_loss: 1.0230 - val_acc: 0.7636
Epoch 14/25
380/380 [==============================] - ETA: 0s - loss: 0.0483 - acc: 0.9866
Epoch 00014: val_loss did not improve from 0.93774
380/380 [==============================] - 112s 295ms/step - loss: 0.0483 - acc: 0.9866 - val_loss: 1.0412 - val_acc: 0.7620
Epoch 15/25
380/380 [==============================] - ETA: 0s - loss: 0.0404 - acc: 0.9894
Epoch 00015: val_loss did not improve from 0.93774

Epoch 00015: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
380/380 [==============================] - 112s 295ms/step - loss: 0.0404 - acc: 0.9894 - val_loss: 1.0788 - val_acc: 0.7603
Epoch 16/25
380/380 [==============================] - ETA: 0s - loss: 0.0322 - acc: 0.9927
Epoch 00016: val_loss did not improve from 0.93774
380/380 [==============================] - 113s 297ms/step - loss: 0.0322 - acc: 0.9927 - val_loss: 1.0791 - val_acc: 0.7623
Epoch 17/25
380/380 [==============================] - ETA: 0s - loss: 0.0290 - acc: 0.9941
Epoch 00017: val_loss did not improve from 0.93774
380/380 [==============================] - 112s 296ms/step - loss: 0.0290 - acc: 0.9941 - val_loss: 1.0779 - val_acc: 0.7625
Epoch 18/25
380/380 [==============================] - ETA: 0s - loss: 0.0285 - acc: 0.9938
Epoch 00018: val_loss did not improve from 0.93774
380/380 [==============================] - 113s 297ms/step - loss: 0.0285 - acc: 0.9938 - val_loss: 1.0824 - val_acc: 0.7640
Epoch 19/25
380/380 [==============================] - ETA: 0s - loss: 0.0260 - acc: 0.9948
Epoch 00019: val_loss did not improve from 0.93774
380/380 [==============================] - 113s 297ms/step - loss: 0.0260 - acc: 0.9948 - val_loss: 1.0838 - val_acc: 0.7651
Epoch 20/25
380/380 [==============================] - ETA: 0s - loss: 0.0253 - acc: 0.9952
Epoch 00020: val_loss did not improve from 0.93774

Epoch 00020: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.
380/380 [==============================] - 112s 296ms/step - loss: 0.0253 - acc: 0.9952 - val_loss: 1.0899 - val_acc: 0.7640
Epoch 00020: early stopping
2038/2038 [==============================] - 107s 53ms/step


5 번째 교차 검증 완료 ... ... ...

제출

  • sampleSubmission.csv를 열어서 column의 순서를 확인해보면, 알파벳 순서로 들어가 있지 않다.
  • result 칼럼의 순서와도 맞지 않으니, 이런 겨웅 직접 제출 파일을 만드는 것이 더 쉽다.
1
2
3
4
5
with zipfile.ZipFile('/kaggle/input/datasciencebowl/sampleSubmission.csv.zip','r') as z:
    z.extractall('sub')

sub = pd.read_csv("sub/sampleSubmission.csv")
sub
image acantharia_protist_big_center acantharia_protist_halo acantharia_protist amphipods appendicularian_fritillaridae appendicularian_s_shape appendicularian_slight_curve appendicularian_straight artifacts_edge ... trichodesmium_tuft trochophore_larvae tunicate_doliolid_nurse tunicate_doliolid tunicate_partial tunicate_salp_chains tunicate_salp unknown_blobs_and_smudges unknown_sticks unknown_unclassified
0 1.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
1 10.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
2 100.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
3 1000.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
4 10000.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
130395 99994.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
130396 99995.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
130397 99996.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
130398 99997.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264
130399 99999.jpg 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 ... 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264 0.008264

130400 rows × 122 columns


1
2
3
col_names = list(sub.columns[1:])
col_names.sort()
col_names
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
['acantharia_protist',
 'acantharia_protist_big_center',
 'acantharia_protist_halo',
 'amphipods',
 'appendicularian_fritillaridae',
 'appendicularian_s_shape',
 'appendicularian_slight_curve',
 'appendicularian_straight',
 'artifacts',
 'artifacts_edge',
 'chaetognath_non_sagitta',
 'chaetognath_other',
 'chaetognath_sagitta',
 'chordate_type1',
 'copepod_calanoid',
 'copepod_calanoid_eggs',
 'copepod_calanoid_eucalanus',
 'copepod_calanoid_flatheads',
 'copepod_calanoid_frillyAntennae',
 'copepod_calanoid_large',
 'copepod_calanoid_large_side_antennatucked',
 'copepod_calanoid_octomoms',
 'copepod_calanoid_small_longantennae',
 'copepod_cyclopoid_copilia',
  
  ...
 
 'pteropod_butterfly',
 'pteropod_theco_dev_seq',
 'pteropod_triangle',
 'radiolarian_chain',
 'radiolarian_colony',
 'shrimp-like_other',
 'shrimp_caridean',
 'shrimp_sergestidae',
 'shrimp_zoea',
 'siphonophore_calycophoran_abylidae',
 'siphonophore_calycophoran_rocketship_adult',
 'siphonophore_calycophoran_rocketship_young',
 'siphonophore_calycophoran_sphaeronectes',
 'siphonophore_calycophoran_sphaeronectes_stem',
 'siphonophore_calycophoran_sphaeronectes_young',
 'siphonophore_other_parts',
 'siphonophore_partial',
 'siphonophore_physonect',
 'siphonophore_physonect_young',
 'stomatopod',
 'tornaria_acorn_worm_larvae',
 'trichodesmium_bowtie',
 'trichodesmium_multiple',
 'trichodesmium_puff',
 'trichodesmium_tuft',
 'trochophore_larvae',
 'tunicate_doliolid',
 'tunicate_doliolid_nurse',
 'tunicate_partial',
 'tunicate_salp',
 'tunicate_salp_chains',
 'unknown_blobs_and_smudges',
 'unknown_sticks',
 'unknown_unclassified']

1
2
my_sub = pd.concat([test_df,pd.DataFrame(result, columns=col_names)], axis=1).drop(['path'], axis=1)
my_sub
image acantharia_protist acantharia_protist_big_center acantharia_protist_halo amphipods appendicularian_fritillaridae appendicularian_s_shape appendicularian_slight_curve appendicularian_straight artifacts ... trichodesmium_tuft trochophore_larvae tunicate_doliolid tunicate_doliolid_nurse tunicate_partial tunicate_salp tunicate_salp_chains unknown_blobs_and_smudges unknown_sticks unknown_unclassified
0 105188.jpg 3.407894e-05 5.234519e-07 4.727002e-07 2.523708e-05 6.414000e-07 1.429394e-04 7.389794e-05 2.510147e-05 8.997078e-08 ... 2.137576e-05 2.115729e-06 1.604188e-06 8.190606e-06 1.233781e-06 2.818242e-06 1.334955e-06 0.000023 7.377646e-05 0.000072
1 27651.jpg 2.613361e-06 2.237140e-07 1.978252e-06 6.420553e-07 1.818773e-06 1.416835e-05 8.689414e-06 2.637850e-07 2.187041e-08 ... 2.810447e-07 9.286011e-07 8.374320e-06 2.925759e-06 2.006822e-07 2.315101e-07 7.389501e-07 0.000014 2.626070e-06 0.000423
2 11940.jpg 6.104551e-06 2.521627e-06 2.513692e-07 4.636616e-05 2.085026e-06 1.489576e-05 8.225787e-05 1.137621e-05 3.021312e-07 ... 1.476521e-05 7.298316e-07 1.122896e-06 1.197497e-04 4.583527e-07 1.900000e-05 4.955790e-07 0.000482 3.988420e-05 0.002802
3 87538.jpg 1.197881e-05 4.275779e-06 1.933437e-05 3.641628e-03 1.369998e-03 2.135313e-04 4.967969e-04 7.344080e-04 8.830065e-06 ... 1.323840e-01 8.188009e-06 6.234657e-04 2.646256e-04 5.426758e-05 1.694718e-03 1.481177e-05 0.000878 4.522099e-02 0.106003
4 95396.jpg 1.079931e-07 1.477846e-06 4.025902e-07 9.353130e-05 6.802826e-07 3.575570e-06 4.541085e-07 1.869251e-07 2.157424e-07 ... 3.714577e-06 4.525396e-07 1.575890e-07 7.061017e-07 1.216337e-06 4.131224e-07 3.218001e-07 0.000011 8.910469e-07 0.000015
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
130395 34122.jpg 9.404398e-06 4.410570e-06 1.469415e-05 4.892956e-04 4.235670e-04 7.620276e-04 7.704680e-05 4.103683e-05 1.476081e-05 ... 1.900832e-04 1.661727e-05 9.112542e-02 4.120418e-04 3.442946e-04 3.249257e-04 1.036034e-05 0.033025 1.140596e-05 0.008211
130396 88776.jpg 1.188530e-04 3.108746e-06 7.981415e-06 2.125595e-07 5.005016e-08 5.221577e-07 6.974371e-07 1.526524e-06 1.525192e-05 ... 7.819511e-01 6.797829e-07 2.360160e-07 1.453742e-07 5.104931e-06 3.599948e-06 5.606108e-07 0.000010 1.053775e-04 0.000017
130397 78514.jpg 3.513707e-06 1.072021e-07 6.263767e-07 2.130197e-07 8.390234e-07 3.240724e-05 1.973183e-04 1.500386e-04 1.360382e-08 ... 1.262665e-07 2.861952e-08 2.094702e-04 1.114861e-06 8.004471e-07 1.075251e-05 1.032177e-07 0.000038 1.695668e-06 0.000293
130398 1830.jpg 2.687246e-08 6.054729e-10 1.001864e-07 2.510690e-08 9.267259e-09 2.659469e-08 6.407049e-08 6.902572e-08 9.994848e-01 ... 1.325866e-07 2.992527e-09 6.681008e-08 9.846344e-09 9.125281e-08 4.204970e-09 2.354498e-08 0.000091 7.961059e-08 0.000001
130399 8917.jpg 1.647273e-07 1.519561e-07 9.120576e-08 1.517367e-03 1.538393e-07 2.816295e-07 6.235562e-07 5.191119e-06 2.437675e-06 ... 7.615223e-07 3.302962e-06 1.213374e-02 8.384104e-01 9.281476e-05 2.865229e-03 1.414542e-05 0.000016 7.298431e-06 0.003105

130400 rows × 122 columns

1
my_sub.to_csv('bowl_sub.csv',index=False)  


Evaluation : Log Loss 0.65988
순위 : 1,049 중 35등

0%