Как создать датасет для классификации текста?

В этой статье я покажу, как я создавал свой датасет для классификации текста, который был мне нужен для того, чтобы отдавать команды для моего ассистента умного дома. Работаю я с библиотекой трансформер от Hugging Face.

Определение задачи для создания датасета

В данном случае я хочу обучить своего ассистента, чтобы я говорил ему к примеру “Включи музыку”, а он мне в ответ возвращал понимание, о чем я его попросил. То есть я ему – “включи музыку”, а он мне – “music_on”. Далее еще несколько примеров:

  • Включи свет – light_on
  • Выключи свет – light_off
  • Подруби музло – music_on
  • Вырубай музыку, отвлекает – music_off

Как создать датасет для умного дома на Python?

Этот блок статьи будет в двух частях. Одна будет про то, как именно создать такой датасет с нуля и как руками добавить в него несколько итемов.

Как создать датасет вручную локально?


Сначала мы создаем пустые тренировочный и тестовые датасеты:

from datasets import DatasetDict, Dataset

train_dataset = Dataset.from_dict({
    "text": [],
    "label": []
})
test_dataset = Dataset.from_dict({
    "text": [],
    "label": []
})


Затем реализуем бесконечный цикл по добавлению новых данных в датасет, этот код спрашивает хотите ли вы добавить новый элемент. Если да, то нужно вбить текст вашей возможной просьбы и ее класс. Класс нужно указывать в виде int числа. Например текст выключить музыку будет иметь лейбл 0, а текст включить музыку будет иметь лейбл 1. А текст включи свет, будет с лейблом 3 и так далее:

isDone = ""
while True:
    isDone = input("Добавить новый элемент? Д/Н: ")
    if isDone == "Д":
        train_dataset = train_dataset.add_item({"text": input("Введите трейн текст: "), "label": int(input("Введите трейн лейбл число: "))})
        test_dataset = test_dataset.add_item({"text": input("Введите тест текст: "), "label": int(input("Введите трейн лейбл число: "))})
    else:
        break


Далее сохраняем добавляем оба датасета в один родительский датасет:

dataset_dict = DatasetDict({"train": train_dataset, "test": test_dataset})
print("СОХРАНЯЕМ", dataset_dict)
dataset_dict.save_to_disk("my_dataset")


В итоге получаем такой результат:


from datasets import DatasetDict, Dataset

train_dataset = Dataset.from_dict({
    "text": [],
    "label": []
})
test_dataset = Dataset.from_dict({
    "text": [],
    "label": []
})
isDone = ""
while True:
    isDone = input("Добавить новый элемент? Д/Н: ")
    if isDone == "Д":
        train_dataset = train_dataset.add_item({"text": input("Введите трейн текст: "), "label": int(input("Введите трейн лейбл число: "))})
        test_dataset = test_dataset.add_item({"text": input("Введите тест текст: "), "label": int(input("Введите трейн лейбл число: "))})
    else:
        break


dataset_dict = DatasetDict({"train": train_dataset, "test": test_dataset})
print("Сохранение датасета...", dataset_dict)
dataset_dict.save_to_disk("my_dataset_v1")

Как добавить данные вручную в локальный датасет?


Тут почти такая же история, но с незначительными изменениями. Сначала мы подгружаем локально наш датасет, затем проводим тот же бесконечный цикл и после чего сохраняем датасет, но уже как новый датасет, с новым названием и новой версией:

from datasets import DatasetDict
datasetname = "my_dataset_v1"
new_dataset_dict = DatasetDict()
new_dataset_dict = new_dataset_dict.load_from_disk(datasetname)
print("Подгружен датасет в таком виде:", new_dataset_dict)

train_dataset = new_dataset_dict["train"]
test_dataset = new_dataset_dict["test"]


isDone = ""
while True:
    isDone = input("Добавить новый элемент? Д/Н: ")
    if isDone == "Д":
        train_dataset = train_dataset.add_item({"text": input("Введите трейн текст: "), "label": int(input("Введите трейн лейбл число: "))})
        test_dataset = test_dataset.add_item({"text": input("Введите тест текст: "), "label": int(input("Введите трейн лейбл число: "))})
    else:
        break

new_dataset_dict = DatasetDict({"train": train_dataset, "test": test_dataset})

idx = datasetname.rfind("_v")
if idx != -1:
    version = int(datasetname[idx+2:]) + 1
    new_datasetname = datasetname[:idx] + f"_v{version}"
else:
    new_datasetname = datasetname + "_v1"

new_dataset_dict.save_to_disk(new_datasetname)
print("Сохранено в таком виде:", new_dataset_dict)

Как создать локальный датасет из csv файла?


Самый полезный на практике способ, когда вы уже сформировали обучающие данные в csv файл и вам нужно создать из них датасет для любой из моделей библиотеки Transformers. В данном случае у меня в двух разных файлах были обучающие данные с просьбами включить музыку и выключить музыку. Вот собственно код:

import csv
from datasets import DatasetDict, Dataset

train_dataset = Dataset.from_dict({
    "text": [],
    "label": []
})
test_dataset = Dataset.from_dict({
    "text": [],
    "label": []
})



with open('scripts/mydatasets/music_on.csv', 'r', encoding="utf-8") as file:
    reader = csv.reader(file)
    next(reader) # пропускаем заголовок
    num = 0
    for row in reader:
        num = num + 1
        if num <= 30: 
            train_dataset = train_dataset.add_item({"text": row[0], "label": int(row[1])})
        else:
            test_dataset = test_dataset.add_item({"text": row[0], "label": int(row[1])})

with open('scripts/mydatasets/music_off.csv', 'r', encoding="utf-8") as file:
    reader = csv.reader(file)
    next(reader) # пропускаем заголовок
    num = 0
    for row in reader:
        num = num + 1
        if num <= 25:
            train_dataset = train_dataset.add_item({"text": row[0], "label": int(row[1])})
        else:
            test_dataset = test_dataset.add_item({"text": row[0], "label": int(row[1])})



dataset_dict = DatasetDict({"train": train_dataset, "test": test_dataset})
print(train_dataset["text"], train_dataset["label"])
print("СОХРАНЯЕМ", dataset_dict)
dataset_dict.save_to_disk("my_dataset")

Далее можно приступать к обучению модели классификации текста с использованием нашего датасета. В этой статье продолжение.

В моем паблике ВК можно наблюдать за тем, что я делаю прямо сейчас, а можем и вместе что-нибудь замутить – присоединяйтесь.

Понравилась статья? Поделиться с друзьями:
Комментарии: 58
  1. kontol

    Genuinely no matter if someone doesn’t know after that its up to other people that they will assist, so here it takes place.

  2. recpty-dchm.blogspot.com

    I every time spent my half an hour to read this web site’s posts every day
    along with a cup of coffee.

  3. celinetoto

    Hello to every single one, it’s in fact a good for me to go
    to see this site, it consists of important Information.

  4. tonic greens amazon

    Heya i am for the primary time here. I found this board and
    I in finding It truly helpful & it helped me out much.

    I am hoping to provide one thing back and help others like you helped me.

    my web-site :: tonic greens amazon

  5. hemp smart reviews

    This design is wicked! You certainly know how to keep a reader amused.
    Between your wit and your videos, I was almost moved to start my own blog (well,
    almost…HaHa!) Excellent job. I really enjoyed
    what you had to say, and more than that, how you presented it.
    Too cool!

    my blog post: hemp smart reviews

  6. renew weight loss

    Awesome! Its really remarkable piece of writing, I have got much clear
    idea on the topic of from this piece of writing.

    Feel free to visit my web page :: renew weight loss

  7. wow388

    Why users still use to read news papers when in this technological world all is presented on web?

  8. Dulcie

    When some one searches for his essential thing, so he/she desires to be available
    that in detail, therefore that thing is maintained over here.

  9. nonton bokep hot

    Great goods from you, man. I have understand your stuff previous to and you are just extremely
    magnificent. I actually like what you’ve acquired here, really like what you are
    saying and the way in which you say it. You make it entertaining and you still care for to keep it wise.
    I can’t wait to read much more from you. This is really a terrific web site.

  10. coupon codes

    Excellent way of describing, and good piece of writing to take information about my presentation topic, which i am going to present in institution of
    higher education.

  11. Kokitoto

    I’ve learn a few good stuff here. Definitely worth bookmarking for revisiting.
    I wonder how so much effort you set to make such
    a excellent informative web site.

  12. metabolism pathways

    I nonetheless desire many elements of CoffeeScript when it comes
    to style and clarity by a long stretch (long-time period Haskell fan).
    Coffees from Latin America are proper in the middle by way of acidity
    and body with fruit, nut, vanilla or earthy flavors. Decide to take satisfaction in yourself and your nutrition, and take pleasure in consuming right.
    High in calories but essential for a balanced consuming sample, whole fats ought to supply 20 to
    35 p.c of calories, with a lot of the fat consumed coming from oils.

    The USDA Dietary Guidelines recommend that you limit your intake of saturated
    fat, in nonlean meat, full-fat dairy merchandise, and tropical oils equivalent to palm kernel and coconut oil, to lower than ten p.c
    of your total calorie intake. At the same time, the Dietary Guidelines caution consumers to restrict stable fats,
    akin to those present in meat, complete-fats
    dairy products, and processed foods. Keeping canines healthy and trim works the identical means as it does with
    individuals: They should solely eat sufficient food to
    keep up the appropriate body weight and get common exercise.

    Eating the right amount of carbohydrate will assist
    you eliminate saved fats, and you may really feel higher whereas doing so.
    Just earlier than it touches the floor, elevate your left leg
    and decrease your right leg.

  13. tonic greens

    You actually make it seem so easy with your presentation but I find this topic to be really something that I think I would never understand.

    It seems too complicated and very broad for me. I’m looking forward for your next post,
    I’ll try to get the hang of it!

    My web page tonic greens

  14. the genius wave reviews

    Ahaa, its fastidious dialogue about this paragraph here at this blog,
    I have read all that, so at this time me also commenting at this place.

    My blog :: the genius wave reviews

  15. https://www.chestersasia.com/

    I’m not sure why but this site is loading incredibly slow
    for me. Is anyone else having this issue or is it a issue on my end?
    I’ll check back later on and see if the problem still exists.

  16. boostaro

    Have you ever considered about adding a little bit more than just your articles?

    I mean, what you say is valuable and everything. Nevertheless think about if you added some great pictures or videos to give your posts more, “pop”!
    Your content is excellent but with images and clips, this
    blog could definitely be one of the very best in its field.
    Terrific blog!

    Also visit my webpage – boostaro

  17. Fitspresso Reviews

    Someone essentially help to make severely articles I’d state.
    That is the very first time I frequented your web page and up
    to now? I surprised with the analysis you made to make this actual put up amazing.
    Fantastic task!

  18. lottery defeater software reviews

    What’s up, everything is going well here and ofcourse every one
    is sharing information, that’s really good, keep up writing.

    Also visit my homepage: lottery defeater software reviews

  19. lottery defeated reviews

    Hi there would you mind letting me know which hosting company
    you’re working with? I’ve loaded your blog in 3 completely different web browsers and I must say this blog loads a lot faster then most.
    Can you recommend a good hosting provider at a reasonable price?
    Thank you, I appreciate it!

    Also visit my web-site; lottery defeated reviews

  20. porn hub categories

    Unquestionably consider that which you said. Your favourite reason appeared to
    be at the web the simplest thing to keep in mind of.

    I say to you, I definitely get annoyed while other people think about issues that they just don’t recognize about.

    You controlled to hit the nail upon the top and defined out the entire thing
    without having side effect , other folks could take a signal.
    Will probably be again to get more. Thanks

  21. prostadine alternative

    You’re so interesting! I don’t suppose I’ve read through something like
    that before. So great to discover someone with some unique thoughts on this topic.
    Seriously.. many thanks for starting this up. This site is something that is needed on the web, someone with some originality!

    my webpage :: prostadine alternative

  22. nerve fresh reviews and complaints

    Hello mates, its enormous paragraph regarding educationand fully defined,
    keep it up all the time.

    Review my webpage; nerve fresh reviews and complaints

  23. gluco freedom reviews

    I am actually thankful to the owner of this web page who has shared this wonderful article at here.

    Feel free to visit my webpage; gluco freedom reviews

  24. lottery defeated software reviews

    What’s up colleagues, fastidious paragraph and fastidious urging commented
    here, I am genuinely enjoying by these.

    Here is my web blog: lottery defeated software reviews

  25. заказать парсинг

    Hello just wanted to give you a quick heads up and let you know a few of the pictures aren’t loading
    correctly. I’m not sure why but I think its a linking issue.

    I’ve tried it in two different web browsers and both show the same outcome.

  26. child porn video

    What’s up i am kavin, its my first time to commenting
    anywhere, when i read this piece of writing i thought i could also make comment due to this sensible
    piece of writing.

  27. lottery defeater software

    In fact no matter if someone doesn’t know then its up to
    other viewers that they will help, so here it takes place.

    Also visit my blog post; lottery defeater software

  28. tepung roti terbaik

    My brother suggested I might like this website. He was entirely
    right. This post actually made my day. You cann’t imagine simply how
    much time I had spent for this information! Thanks!

  29. dabwoods uk

    Wow, awesome blog layout! How long have you been blogging
    for? you made blogging look easy. The overall look of your website is wonderful, let alone the content!

  30. database slot gratis

    obviously like your web-site however you have to check
    the spelling on several of your posts. Several of them are rife with spelling issues and I find it very bothersome to inform the
    truth however I will certainly come back again.

  31. ola62

    Greetings, I believe your web site might be having browser compatibility problems.

    Whenever I take a look at your web site in Safari, it looks
    fine however, if opening in Internet Explorer, it has
    some overlapping issues. I simply wanted to provide you with a
    quick heads up! Apart from that, wonderful website!

  32. porn child

    My brother suggested I might like this web site.
    He was entirely right. This post truly made my day.
    You can not imagine simply how so much time I had spent for
    this information! Thanks!

  33. Mini Led Lights

    This paragraph gives clear idea designed for the new visitors of blogging, that truly how to do running
    a blog.

  34. immediate evex pro

    Thanks very nice blog!

  35. does provadent have side effects

    I was curious if you ever thought of changing the structure
    of your website? Its very well written; I love what youve
    got to say. But maybe you could a little more
    in the way of content so people could connect with it better.
    Youve got an awful lot of text for only having 1 or 2
    images. Maybe you could space it out better?

    Feel free to visit my page; does provadent have side effects

  36. pornografi indo

    fantastic issues altogether, you simply gained a new reader.
    What could you recommend about your submit that you made some days ago?
    Any positive?

  37. Buy Weed UK

    This is really interesting, You are a very skilled blogger.
    I’ve joined your rss feed and look forward to seeking more of your
    great post. Also, I’ve shared your website in my social networks!

  38. Bo Bonus New Member

    These are actually impressive ideas in on the topic of blogging.
    You have touched some pleasant points here. Any way keep up wrinting.

  39. nonton bokep hot

    Can I just say what a comfort to uncover somebody who truly understands what they are discussing online.
    You certainly understand how to bring a problem to light and
    make it important. More people need to look at this and
    understand this side of your story. It’s surprising you aren’t more popular because
    you certainly possess the gift.

  40. login gede4d

    Why people still use to read news papers when in this technological world all is presented on web?

  41. Georgetta

    I am extremely inspired together with your writing talents as smartly as with the structure in your blog.
    Is that this a paid topic or did you customize it yourself?
    Anyway stay up the nice high quality writing, it’s uncommon to
    see a great weblog like this one nowadays..

  42. Egzamin na Prawo Jazdy Kategorii AM

    It’s amazing designed for me to have a website, which is helpful in favor of my experience.
    thanks admin

  43. Prawo Jazdy Online Kupić

    Hey would you mind stating which blog platform you’re working with?

    I’m looking to start my own blog in the near future but I’m having a difficult time selecting between BlogEngine/Wordpress/B2evolution and Drupal.

    The reason I ask is because your layout seems different then most blogs and I’m
    looking for something unique. P.S
    Apologies for being off-topic but I had to ask!

  44. 25kmh auto 2 sitzer

    You can certainly see your enthusiasm in the article you write.
    The arena hopes for more passionate writers such as you who are not afraid to mention how they believe.
    Always follow your heart.

  45. Reuben

    I visited many sites except the audio feature for audio songs present at this website is actually superb.

  46. Green Promethazine Syrup

    Wow that was unusual. I just wrote an really long comment but after
    I clicked submit my comment didn’t show up. Grrrr…
    well I’m not writing all that over again. Anyways, just wanted to say wonderful
    blog!

  47. Wockhardt Wet Cough Syrup

    Hey, I think your website might be having browser compatibility issues.
    When I look at your website in Safari, it looks fine but when opening
    in Internet Explorer, it has some overlapping. I just wanted to give you a quick heads
    up! Other then that, very good blog!

  48. 1500mg CBD Sleep Tincture UAYA Botanicals

    Hello! I know this is kinda off topic however I’d
    figured I’d ask. Would you be interested in trading links
    or maybe guest writing a blog article or vice-versa? My blog
    goes over a lot of the same topics as yours and I feel we could greatly benefit from each other.
    If you might be interested feel free to shoot me an email.

    I look forward to hearing from you! Superb blog by the way!

  49. 1000mg THC Plus Syringe Herb Angels

    Hi, after reading this awesome piece of writing i am as well cheerful to share
    my knowledge here with colleagues.

  50. Ethereal essence 4-AcO-DMT

    I know this if off topic but I’m looking into starting my own weblog and was curious what all
    is needed to get setup? I’m assuming having a blog like yours would cost a pretty penny?
    I’m not very internet savvy so I’m not 100% sure. Any recommendations or advice would be greatly appreciated.
    Appreciate it

  51. Buy Ketof Cough Syrup Online Austria

    Greetings from Los angeles! I’m bored to death at work so I decided to check out your blog on my
    iphone during lunch break. I enjoy the info you present here
    and can’t wait to take a look when I get home. I’m shocked at
    how quick your blog loaded on my phone .. I’m not even using WIFI, just 3G ..
    Anyhow, good blog!

  52. Link Jav Terbaru 2024

    Heya i’m for the first time here. I came across this board and I find It truly useful & it helped me out much.
    I hope to give something back and help others like you aided me.

  53. Finance Phantom

    It’s very simple to find out any topic on net as compared
    to books, as I found this post at this site.

  54. go.bubbl.us

    Stand Strong Fencing
    Nashville, Tennessee 37201, Unites Ѕtates
    16154311511
    Trusted fence installation experts (go.bubbl.us)

  55. instagram bistro4d

    Wonderful, what a blog it is! This web site presents useful information to us,
    keep it up.

  56. sex indo viral

    You actually make it seem so easy with your presentation but I find this topic to be
    really something which I think I would never understand.
    It seems too complex and extremely broad for me. I’m looking forward for your next
    post, I will try to get the hang of it!

  57. jepangbet

    [JepangBetplatform game online terbaik] memang menjadi pilihan tepat buat saya.
    [Website-nyauser-friendly] dan [pelayanan customer servicenya{cepat|profesional|memang ramah]
    banget! [Bisa main dengan tenang tanpa khawatir|Aman dan terpercaya banget]!

    Baru tau [JepangBet{situs game online|platform game terpercaya}]!
    [Koleksi game-nya{lumayan lengkap|sangat beragam|gak kalah sama situs lain}] dan [bonus-nya{lumayan besar|menarik banget|terjangkau].
    [Sangat cocok buat yang suka main game online{dan cari penghasilan tambahan|dan cari hiburan seru}]!

    [JepangBet{situs game online terpercaya|situs judi online terbaik}] udah jadi [situs favoritku{untuk bermain game online|untuk mengisi waktu luang}]!
    [Mudah menang{dan withdrawnya cepat}]. [Rekomendasi buat yang mau coba main game online{dan cari uang tambahan}]!

    [JepangBet{situs game online terpercaya|platform game online terbaik|judi online aman}] bikin saya makin yakin buat main game
    online. [Website-nya{user-friendly|mudah diakses|terjamin keamanannya}] dan [pelayanan customer servicenya{cepat|profesional|memang ramah|tanggap banget}]!

    Udah lama cari situs game online yang terpercaya, akhirnya nemu [JepangBet]!
    [Sistem keamanan-nya{kuat|terjamin|gak perlu khawatir}] dan [proses withdraw-nya{cepat|mudah}]!

    Gak perlu ragu lagi buat main di [JepangBet{situs game online terpercaya|platform game online terbaik}]!

    [Lisensi dan izinnya{legal|terdaftar resmi}] jadi [saya yakin aman dan terpercaya]!.

    [JepangBet{situs game online|platform game terpercaya}] punya [koleksi game{lumayan lengkap|sangat beragam|gak kalah sama situs lain}]!

    [Bonus-nya{lumayan besar|menarik banget|terjangkau}] dan [mudah diklaim]!

    [JepangBet{situs game online terpercaya|platform game online terbaik}] bikin saya makin semangat
    main game! [Bonus harian{dan mingguan|dan turnamen}] bikin [saya makin excited]!

    [JepangBet{situs game online terpercaya}] benar-benar
    [situs game online yang recommended]! [Game-nya seru, bonusnya menarik, dan {customer servicenya ramah|pelayanannya cepat}]!

    [JepangBet{situs game online terpercaya}] udah jadi [situs favoritku{untuk bermain game online|untuk mengisi waktu luang}]!

    [Mudah menang dan {withdrawnya cepat|proses withdrawnya mudah}]!

    [JepangBet{situs game online terpercaya}] bikin saya [makin semangat main game online]!
    [Game-nya seru, bonusnya menarik, dan {customer servicenya ramah|pelayanannya cepat}]!

    [JepangBet{situs game online terpercaya}] [situs game online yang recommended]!
    [Game-nya seru, bonusnya menarik, dan {saya udah coba main di sini dan gak kecewa|saya puas banget}]!

  58. ветошь хб

    Ветошь: секреты производства и применения
    ветошь хб

Добавить комментарий

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: