The Wayback Machine - https://web.archive.org/web/20200612220316/https://planetpython.org/

skip to navigation
skip to content

Planet Python

Last update: June 12, 2020 09:47 PM UTC

June 12, 2020


Stack Abuse

any() and all() in Python with Examples

Introduction to any() and all()

In this tutorial, we'll be covering the any() and all() functions in Python.

The any(iterable) and all(iterable) are built-in functions in Python and have been around since Python 2.5 was released. Both functions are equivalent to writing a series of or and and operators respectively between each of the elements of the passed iterable. They are both convenience functions that shorten the code by replacing boilerplate loops.

Both methods short-circuit and return a value as soon as possible, so even with huge iterables, they're as efficient as they can be.

The and/or Operators

Let's remind ourselves how the and/or operators work, as these functions are based on them.

The or Operator

The or operator evaluates to True if any of the conditions (operands) are True.

print("(2 == 2) or (3 == 3) evaluates to: " + str((2 == 2) or (3 == 3)))
print("(2 == 2) or (3 == 2) evaluates to: " + str((2 == 2) or (3 == 2)))
print("(2 == 0) or (3 == 2) evaluates to: " + str((2 == 0) or (3 == 2)))

Output:

(2 == 2) or (3 == 3) evaluates to: True
(2 == 2) or (3 == 2) evaluates to: True
(2 == 0) or (3 == 2) evaluates to: False

We can chain multiple ors in a single statement, and it will again evaluate to True if any of the conditions are True:

print(str(False or False or False or True or False))

This results in:

True

The and Operator

The and operator evaluates to True only if all conditions are True:

print("(2 == 2) and (3 == 3) evaluates to: " + str((2 == 2) and (3 == 3)))
print("(2 == 2) and (3 == 2) evaluates to: " + str((2 == 2) and (3 == 2)))
print("(2 == 0) and (3 == 2) evaluates to: " + str((2 == 0) and (3 == 2)))

Output:

(2 == 2) and (3 == 3) evaluates to: True
(2 == 2) and (3 == 2) evaluates to: False
(2 == 0) and (3 == 2) evaluates to: False

Similarly to or, we can chain multiple and operators, and they will evaluate to True only if all the operands evaluate to True:

print(str(True and True and True and False and True))

This results in:

False

any()

The method any(iterable) behaves like a series of or operators between each element of the iterable we passed. It's used to replace loops similar to this one:

for element in some_iterable:
    if element:
        return True
return False

We get the same result by simply calling any(some_iterable):

print(any([2 == 2, 3 == 2]))
print(any([True, False, False]))
print(any([False, False]))

Running this piece of code would result in:

True
True
False

Note: Unexpected behavior may happen when using any() with dictionaries and data types other than boolean. If any() is used with a dictionary, it checks whether any of the keys evaluate to True, not the values:

dict = {True : False, False: False}

print(any(dict))

This outputs:

True

Whereas, if any() checked the values, the output would have been False.

The method any() is often used in combination with the map() method and list comprehensions:

old_list = [2, 1, 3, 8, 10, 11, 13]
list_if_even = list(map(lambda x: x % 2 == 0, old_list))
list_if_odd = [x % 2 != 0 for x in old_list]

print(list_if_even)
print(list_if_odd)

print("Are any of the elements even? " + str(any(list_if_even)))
print("Are any of the elements odd? " + str(any(list_if_odd)))

This outputs:

[True, False, False, True, True, False, False]
[False, True, True, False, False, True, True]
Are any of the elements even? True
Are any of the elements odd? True

Note: If an empty iterable is passed to any(), the method returns False.

If you'd like to read more about the map(), filter() and reduce() functions, we've got you covered!

all()

The all(iterable) method evaluates like a series of and operators between each of the elements in the iterable we passed. It is used to replace loops similar to this one:

for element in iterable:
    if not element:
        return False
return True

The method returns True only if every element in iterable evaluates to True, and False otherwise:

print(all([2 == 2, 3 == 2]))
print(all([2 > 1, 3 != 4]))
print(all([True, False, False]))
print(all([False, False]))

This outputs:

False
True
False
False

Note: Just like with any(), unexpected behavior may happen when passing dictionaries and data types other than boolean. Again, if all() is used with a dictionary, it checks whether all of the keys evaluate to True, not the values.

Another similarity with any() is that all() is also commonly used in combination with the map() function and list comprehensions:

old_list = ["just", "Some", "text", "As", "An", "example"]
list_begins_upper = list(map(lambda x: x[0].isupper(), old_list))
list_shorter_than_8 = [len(x) < 8 for x in old_list]

print(list_begins_upper)
print(list_shorter_than_8)

print("Do all the strings begin with an uppercase letter? " + str(all(list_begins_upper)))
print("Are all the strings shorter than 8? " + str(all(list_shorter_than_8)))

This outputs:

[False, True, False, True, True, False]
[True, True, True, True, True, True]
Do all the strings begin with an uppercase letter? False
Are all strings shorter than 8? True

Note: If an empty iterable is passed to all(), the method returns True! This is because the code for all() checks if there are any False elements in the iterable, and in the case of an empty list there are no elements and therefore there are no False elements either.

Boolean Conversion and any(), all() Functions

A common cause of confusion and errors when using any logical operators, and therefore when using any() and all() as well, is what happens when the elements aren't of the boolean data type. In other words, when they aren't exactly True of False but instead have to be evaluated to True or False.

Some programming languages don't evaluate non-boolean data types to booleans. For example Java would complain if you tried something along the lines of if("some string") or if(15) and tell you that the type you used can't be converted to boolean.

Python on the other hand does no such thing, and will instead convert what you passed to boolean without warning you about it.

Python converts most things to True with a few exceptions:

A few examples of how we can use the way Python "boolean-izes" other data types with any() and all().

list_of_strings = ["yet", "another", "example",""]
print("Do all strings have some content?", all(list_of_strings))

list_of_ints = [0, 0.0, -2, -5]
print("Are any of the ints different than 0?", any(list_of_ints))

This outputs:

Do all strings have some content? False
Are any of the ints different than 0? True

Keep in mind that you might still want to write more readable code by not using implicit boolean conversion like this.

Conclusion

Both the any() and all() functions are there for convenience sake and should be used only when they make the code shorter but maintain the readability.

In this article, we've jumped into the any() and all() functions and showcased their usage through several examples.

June 12, 2020 12:23 PM UTC


Real Python

The Real Python Podcast – Episode #13: PDFs in Python and Projects on the Raspberry Pi

Have you wanted to work with PDF files in Python? Maybe you want to extract text, merge and concatenate files, or even create PDFs from scratch. Are you interested in building hardware projects using a Raspberry Pi? This week on the show we have David Amos from the Real Python team to discuss his recent article on working with PDFs. David also brings a few other articles from the wider Python community for us to discuss.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

June 12, 2020 12:00 PM UTC


Python Bytes

#185 This code is snooping on you (a good thing!)

<p>Sponsored by Datadog: <a href="http://pythonbytes.fm/datadog"><strong>pythonbytes.fm/datadog</strong></a></p> <hr /> <p><strong>Brian #1:</strong> <a href="https://myst-parser.readthedocs.io/en/latest/"><strong>MyST - Markedly Structured Text</strong></a></p> <ul> <li>I think this came from a tweet from <a href="https://twitter.com/choldgraf">Chris Holdgraf</a></li> <li>A fully-functional markdown flavor and parser for Sphinx.</li> <li>MyST allows you to write Sphinx documentation entirely in markdown. MyST markdown provides a markdown equivalent of the reStructuredText syntax, meaning that you can do anything in MyST that you can do with reStructuredText. It is an attempt to have the best of both worlds: the flexibility and extensibility of Sphinx with the simplicity and readability of Markdown.</li> <li>MyST has the following main features: <ul> <li><a href="https://myst-parser.readthedocs.io/en/latest/using/intro.html#parse-with-sphinx"><strong>A markdown parser for Sphinx</strong></a>. You can write your entire <a href="https://www.sphinx-doc.org/en/master/usage/quickstart.html">Sphinx documentation</a> in markdown.</li> <li><a href="https://myst-parser.readthedocs.io/en/latest/using/syntax.html#syntax-directives"><strong>Call Sphinx directives and roles from within Markdown</strong></a>, allowing you to extend your document via Sphinx extensions.</li> <li><a href="https://myst-parser.readthedocs.io/en/latest/using/syntax.html#extended-block-tokens"><strong>Extended Markdown syntax for useful rST features</strong></a>, such as line commenting and footnotes.</li> <li><a href="https://myst-parser.readthedocs.io/en/latest/using/use_api.html"><strong>A Sphinx-independent parser of MyST markdown</strong></a> that can be extended to add new functionality and outputs for MyST.</li> <li><a href="https://commonmark.org/"><strong>A superset of CommonMark markdown</strong></a>. Any CommonMark markdown (such as Jupyter Notebook markdown) is natively supported by the MyST parser.</li> </ul></li> </ul> <hr /> <p><strong>Michael #2:</strong> <a href="https://direnv.net/"><strong>direnv</strong></a></p> <ul> <li><a href="https://twitter.com/whereismyjetpac/status/1260623498030067717">via __dann__</a></li> <li><code>direnv</code> is an extension for your shell. It augments existing shells with a new feature that can load and unload environment variables depending on the current directory.</li> <li>Use cases <ul> <li>Load 12factor apps environment variables</li> <li>Create per-project isolated development environments</li> <li>Load secrets for deployment</li> </ul></li> <li>Before each prompt, direnv checks for the existence of a <code>.envrc</code> file in the current and parent directories.</li> <li>If the file exists, it is loaded into a <strong>bash</strong> sub-shell and all exported variables are then captured by direnv and then made available to the current shell.</li> <li>It supports hooks for all the common shells like bash, zsh, tcsh and fish. This allows project-specific environment variables without cluttering the <code>~/.profile</code> file.</li> <li>Because direnv is compiled into a single static executable, it is fast enough to be unnoticeable on each prompt.</li> </ul> <hr /> <p><strong>Brian #3:</strong> <a href="https://hultner.se/quickbits/2018-03-12-python-json-serializable-enum.html"><strong>Convert a Python Enum to JSON</strong></a></p> <ul> <li>Alexander Hultner</li> </ul> <p>Problem:</p> <ul> <li>Enum values by default are not serializable.</li> <li>So you can't use them as values in JSON.</li> <li>and can't use them as values passed to databases.</li> </ul> <p>Solution:</p> <ul> <li>Derived enumerations, like <a href="https://docs.python.org/3/library/enum.html#derived-enumerations">IntEnum</a> or <a href="https://docs.python.org/3/library/enum.html#others">custom derived enumerations</a> are simple to define and serializable.</li> <li>You can convert them to json and store them as database values.</li> </ul> <p>Example: </p> <pre><code> &gt;&gt;&gt; from enum import Enum, IntEnum &gt;&gt;&gt; import json &gt;&gt;&gt; class Color(Enum): ... red = 1 ... blue = 2 ... &gt;&gt;&gt; c = Color.red &gt;&gt;&gt; c [HTML_REMOVED] &gt;&gt;&gt; &gt;&gt;&gt; json.dumps(c) Traceback (most recent call last): ... TypeError: Object of type Color is not JSON serializable &gt;&gt;&gt; class Color(IntEnum): ... red = 1 ... blue = 2 ... &gt;&gt;&gt; c = Color.red &gt;&gt;&gt; c [HTML_REMOVED] &gt;&gt;&gt; json.dumps(c) '1' &gt;&gt;&gt; class Color(str, Enum): ... red = "red" ... blue = "blue" ... &gt;&gt;&gt; c = Color.red &gt;&gt;&gt; c [HTML_REMOVED] &gt;&gt;&gt; json.dumps(c) '"red"' </code></pre> <hr /> <p><strong>Michael #4:</strong> <a href="https://pendulum.eustace.io/"><strong>Pendulum: Python datetimes made easy</strong></a></p> <ul> <li>via <a href="https://twitter.com/tuckerbeck/status/1159503255925170176">tuckerbeck</a></li> <li>Drop-in replacement for the standard datetime class.</li> <li>Time deltas</li> </ul> <pre><code> dur = pendulum.duration(days=15) # More properties dur.weeks dur.hours # Handy methods dur.in_hours() 360 dur.in_words(locale="en_us") '2 weeks 1 day' </code></pre> <ul> <li>Intervals</li> </ul> <pre><code> dt = pendulum.now() # A period is the difference between 2 instances period = dt - dt.subtract(days=3) period.in_weekdays() # A period is iterable for dt in period: print(dt) </code></pre> <hr /> <p><strong>Brian #5:</strong> <a href="https://github.com/cool-RR/pysnooper">PySnooper - Never use print for debugging again</a></p> <ul> <li>Thanks <a href="https://twitter.com/pylang23">@pylang23</a> for <a href="https://twitter.com/pylang23/status/1267639151295414273?s=20">the suggestion</a>.</li> <li>With <strong>PySnooper</strong> you can just add one decorator line to a function and you get a play-by-play log of your function, including which lines ran and when, and exactly when local variables were changed.</li> <li>Logs <ul> <li>every modified variable with value</li> <li>which line of code is being run</li> <li>return value</li> <li>passed in parameters</li> <li>elapsed time</li> </ul></li> <li>Options to: <ul> <li>isolate logging to a section of a function with a with block</li> <li>log to a file instead of stdout</li> <li>extend watch to a list of non-local variables</li> <li>extend watch to functions called by the function being decorated</li> </ul></li> <li>All with a simple decorator and a pretty simple API</li> </ul> <hr /> <p><strong>Michael #6:</strong> <a href="https://pythonspeed.com/articles/memory-profiler-data-scientists/"><strong>Fil: A New Python Memory Profiler for Data Scientists and Scientists</strong></a></p> <ul> <li>via PyCoders</li> <li>If your Python data pipeline is using too much memory, it can be very difficult to figure where exactly all that memory is going.</li> <li>Yes, there are existing memory profilers for Python that help you measure memory usage, but none of them are designed for batch processing applications that read in data, process it, and write out the result.</li> <li><strong>What you need is some way to know exactly where peak memory usage is, and what code was responsible for memory at that point.</strong> And that’s exactly what the <a href="https://pythonspeed.com/products/filmemoryprofiler/">Fil memory profiler</a> does.</li> <li>Because of this difference in lifetime, the impact of memory usage is different. <ul> <li><strong>Servers:</strong> Because they run forever, memory leaks are a common cause of memory problems. Even a small amount of leakage can add up over tens of thousands of calls. Most servers just process small amounts of data at a time, so actual business logic memory usage is usually less of a concern.</li> <li><strong>Data pipelines:</strong> With a limited lifetime, small memory leaks are less of a concern with pipelines. Spikes in memory usage due to processing large chunks of data are a more common problem.</li> </ul></li> <li><strong>This is Fil’s primary goal: diagnosing spikes in memory usage.</strong></li> <li>Many tools track just Python memory. <em>*</em><em>Fil captures *all</em> allocations going to the standard C memory allocation APIs.</li> </ul> <hr /> <p><strong>Extras:</strong></p> <p>Michael:</p> <ul> <li>Student cohorts: <a href="https://training.talkpython.fm/cohorts/apply">training.talkpython.fm/cohorts/apply</a>, but had to <a href="https://twitter.com/TalkPython/status/1267226093880139776">close after just a day due to high volume</a></li> </ul> <hr /> <p><strong>Joke:</strong></p> <ul> <li><strong>Senior dev</strong>: Where did you get the code that does this from?</li> <li><em>Junior dev</em>: Stack Overflow</li> <li><strong>Senior dev</strong>: Was it from the question part or from the answer part?</li> </ul>

June 12, 2020 08:00 AM UTC


Janusworx

A Hundred Days of Code, Day 043

Continuing with the Flask course.

Today I learnt about how to loop, using Jinja loop blocks.
The syntax is slowly becoming clear to me.
Everything python related in enclosed is {% … %} blocks, except for variables which use their own {{ … }} syntax.

What I am still confused on is the relationship between the various files, I am writing. There is html and then there are templates and there are python files themselves. Hopefully that will get clearer in the days to come.
My naïve understanding, right now, is

  1. Some native python code is mainly for launching and running the app.
  2. The html templates pull data from …
  3. The flask python code I write (the routes file, as of now).

I also learnt how to extend templates. I created a base template that basically contains the header and the title, which will now be used by every new webpage I build. Right now, it’s jut the home page.
I can see a footer or header or some such persistent element that needs to be on every page, that can be created once and then extended multiple times.

More, tomorrow …

P.S. Looking at that finished app, and knowing my extremely rudimentary Python skills, I feel like an apprentice mason, hammer and a chisel in hand, wondering, how in heck, am I going to carve David?


June 12, 2020 07:41 AM UTC


Programiz

Python RegEx

In this tutorial, you will learn about regular expressions (RegEx), and use Python's re module to work with RegEx (with the help of examples).

June 12, 2020 03:08 AM UTC

June 11, 2020


Janusworx

A Hundred Days of Code, Day 042

Second day with the Flask course.

Beginning to realise that Flask is not a monolithic thing, but consists of a lot of moving parts.
Looking forward to learning what they are as I progress along.

Today I learnt how to set my Flask variable, and create an environment, so that I can run Flask consistently without problems.

Miguel also teaches a simple, yet effective way to combat yak shaving.
You know, where all you want is one simple thing, but then that depends on that other thing, which reminds you that you need that third thing and the next thing you know, you’re at the zoo, shaving a yak, all so you can wax your car.
Just don’t do that other thing.
Focus on what you are doing.
If there is something you need, use a dummy. Mock something up.
This is a very real, meta lesson, that I’ll carry with me for the rest of my days.

So I learnt how to return a web page.
And that got tiring really quickly.
Which is when Miguel introduced me to templates.
And I realise why they are needed.
I created a basic template and also learnt about conditionals in the templating language, Jinja and then I called a stop to the day.
More to follow tomorrow.


June 11, 2020 11:06 AM UTC


PSF GSoC students blogs

[Week 1] Check-in

1. What did you do this week?

This week's main job is to build a calculation graph. The core of the automatic differential system is vjp, which is composed of calculation graph construction and gradient calculation.

2. Difficulty


The computational graph construction is not as simple as I thought. The main problem encountered in the process is that some basic differential operators not only need to pass in the tensors, but also need to pass in some parameters of the function, such as np.sum. This requires that some parameters required by the differential operator be passed in advance when constructing the calculation graph. In addition, according to the different calculation paths, the parents of each node in the calculation graph should be marked appropriately, and then they can be calculated along the correct path during back propagation.

3. What is coming up next?


The work for the next week is to implement simple back propagation and complete a complete differential operation.

June 11, 2020 07:12 AM UTC

Weekly Check In - 1

What did I do till now?

As the Community Bonding phase finished I started coding the HTTP/2 Client Protocol. I started simple with adding support for GET requests.

Whats coming up next? 

Next week I plan to

Did I get stuck anywhere?

Initially I was intimidated with some of the libraries that I was using for my project. Now, I am comfortable working with them. I was stuck with the issue of combining different chunks of data received from the server for multiple streams in proper order but now its fixed 😊

June 11, 2020 06:02 AM UTC


Fabio Zadrozny

PyDev 7.6.0 (Python 3.8 parsing fixes and debugger improvements)

PyDev 7.6.0 is now available for download.

This release brings multiple fixes to parsing the Python 3.8 grammar (in particular, dealing with f-strings and iterable unpacking had some corner cases that weren't well supported).

Also, the debugger had a number of improvements, such as:





Besides those, there were a number of other improvements... some noteworthy are support for the latest PyTest, faster PyDev Package Explorer, type inference for type comments with self attributes (i.e.: #: :type self.var: MyClass) and properly recognizing trailing commas on automatic import.

Acknowledgements

Thanks to Luis Cabral, who is now helping in the project for doing many of those improvements (and to the Patrons at https://www.patreon.com/fabioz which enabled it to happen).

Thanks for Microsoft for sponsoring the debugger improvements, which are also available in Python in Visual Studio and the Python Extension for Visual Studio Code.

Enjoy!
--
Fabio

June 11, 2020 05:13 AM UTC


Matt Layman

A View From Start To Finish - Building SaaS #60

In this episode, I created a view to add students from beginning to the end. I used Error Driven Development to guide what I needed to do next to make the view, then wrote tests, and finished it all off by writing the template code. At the start of the episode, I gave a quick overview of the models in my application and which models I planned to focus on for the stream.

June 11, 2020 12:00 AM UTC

June 10, 2020


Nathan Piccini Data Science Dojo Blog

Building an AI-based Chatbot in Python

Building an AI-based Chatbot in Python

Chatbots have become extremely popular in recent years and their use in the industry has skyrocketed. The chatbot market is projected to grow from $2.6 billion in 2019 to $9.4 billion by 2024. This really doesn't come as a surprise when you look at the immense benefits chatbots bring to businesses. According to a study by IBM, chatbots can reduce customer services cost by up to 30%.

In the third blog of A Beginners Guide to Chatbots, we’ll be taking you through how to build a simple AI-based chatbot with Chatterbot; a Python library for building chatbots.

Introduction to Chatterbot

Chatterbot is a python-based library that makes it easy to build AI-based chatbots. The library uses machine learning to learn from conversation datasets and generate responses to user inputs. The library allows developers to train their chatbot instance with pre-provided language datasets as well as build their own datasets.

Training Chatterbot

A newly initialized Chatterbot instance starts off with no knowledge of how to communicate. To allow it to properly respond to user inputs, the instance needs to be trained to understand how conversations flow. Since Chatterbot relies on machine learning at its backend, it can very easily be taught conversations by providing it with datasets of conversations.

Chatterbot’s training process works by loading example conversations from provided datasets into its database. The bot uses the information to build a knowledge graph of known input statements and their probable responses. This graph is constantly improved and upgraded as the chatbot is used.

Building an AI-based Chatbot in PythonChatterbot Knowledge Graph (Source: Chatterbot Knowledgebase)

Chatterbot Corpus

The Chatterbot Corpus is an open-source user-built project that contains conversational datasets on a variety of topics in 22 languages. These datasets are perfect for training a chatbot on the nuances of languages – such as all the different ways a user could greet the bot. This means that developers can jump right to training the chatbot on their custom data without having to spend time teaching common greetings.

Chatterbot has built-in functions to download and use datasets from the Chatterbot Corpus for initial training.

Chatterbot Logic Adapters

Chatterbot uses Logic Adapters to determine the logic for how a response to a given input statement is selected.

A typical logic adapter designed to return a response to an input statement will use two main steps to do this. The first step involves searching the database for a known statement that matches or closely matches the input statement. Once a match is selected, the second step involves selecting a known response to the selected match. Frequently, there will be a number of existing statements that are responses to the known match. In such situations, the Logic Adapter will select a response randomly. If more than one Logic Adapter is used, the response with the highest cumulative confidence score from all Logic Adapters will be selected.

Building an AI-based Chatbot in PythonHow Logic Adapters Work (Source: Chatterbot Knowledgebase)

Chatterbot Storage Adapters

Chatterbot stores its knowledge graph and user conversation data in a SQLite database. Developers can interface with this database using Chatterbot’s Storage Adapters.

Storage Adapters allow developers to change the default database from SQLite to MongoDB or any other database supported by the SQLAlchemy ORM. Developers can also use these Adapters to add, remove, search and modify user statements and responses in the Knowledge Graph as well as create, modify and query other databases that Chatterbot might use.

Building a Chatbot

In this tutorial, we will be using the Chatterbot Python library to build an AI-based Chatbot.

We will be following the steps below to build our chatbot

  1. Importing Dependencies
  2. Instantiating a ChatBot Instance
  3. Training on Chatbot-Corpus Data
  4. Training on Custom Data
  5. Building a frontend

Importing Dependencies

The first thing we’ll need to do is import the modules we’ll be using. The ChatBot module contains the fundamental Chatbot class that will be used to instantiate our chatbot object. The ListTrainer module allows us to train our chatbot on a custom list of statements that we will define. The ChatterBotCorpusTrainer module contains code to download and train our chatbot on datasets part of the ChatterBot Corpus Project.

#Importing modules
from chatterbot import ChatBot
from chatterbot.trainers import ListTrainer
from chatterbot.trainers import ChatterBotCorpusTrainer

Instantiating a ChatBot Instance

A chatbot instance can be created by creating a ChatBot object. The ChatBot object needs to have a name of the chatbot and must reference any logic or storage adapters you might want to use.

In the case you don’t want your chatbot to learn from user inputs after it has been trained, you can set the read_only parameter to True.

BankBot = ChatBot(name = 'BankBot',
                  read_only = False,                  
                  logic_adapters = ["chatterbot.logic.BestMatch"],                 
                  storage_adapter = "chatterbot.storage.SQLStorageAdapter")

Training on Chatterbot-Corpus Data

Training your chatbot agent on data from the Chatterbot-Corpus project is relatively simple. To do that, you need to instantiate a ChatterBotCorpusTrainer object and call the train() method. The ChatterBotCorpusTrainer takes in the name of your ChatBot object as an argument. The train() method takes in the name of the dataset you want to use for training as an argument.

Detailed information about ChatterBot-Corpus Datasets is available on the project’s Github repository.

corpus_trainer = ChatterBotCorpusTrainer(BankBot)
corpus_trainer.train("chatterbot.corpus.english")

Training on Custom List Data

You can also train ChatterBot on custom conversations. This can be done by using the module’s ListTrainer class.

In this case, you will need to pass in a list of statements where the order of each statement is based on its placement in a given conversation. Each statement in the list as a possible response to it’s predecessor in the list.

The training can be undertaken by instantiating a ListTrainer object and calling the train() method. It is important to note that the train() method must be individually called for each list to be used.

greet_conversation = [
    "Hello",
    "Hi there!",
    "How are you doing?",
    "I'm doing great.",
    "That is good to hear",
    "Thank you.",
    "You're welcome."
]
 
open_timings_conversation = [
    "What time does the Bank open?",
    "The Bank opens at 9AM",
]
 
close_timings_conversation = [
    "What time does the Bank close?",
    "The Bank closes at 5PM",
]

#Initializing Trainer Object
trainer = ListTrainer(BankBot)

#Training BankBot
trainer.train(greet_conversation)
trainer.train(open_timings_conversation)
trainer.train(close_timings_conversation)

Building a Frontend

Once the chatbot has been trained, it can be used by calling Chatterbot's get_response() method. The method takes a user string as an input and returns a response string.

while (True):
    user_input = input()
    if (user_input == 'quit'):
        break
    response = BankBot.get_response(user_input)
    print (response)

Conclusion

This blog was a hands-on to building a simple AI-based chatbot in Python. The functionality of this bot can easily be increased by adding more training examples. You could, for example, add more lists of custom responses related to your application.

As we saw, building an AI-based chatbot is relatively easy compared to building and maintaining a Rule-based Chatbot. Despite this ease, chatbots such as this are very prone to mistakes and usually give robotic responses because of a lack of good training data.

A better way of building robust AI-based Chatbots is to use Conversational AI Tools offered by companies like Google and Amazon. These tools are based on complex machine learning models with AI that has been trained on millions of datasets. This makes them extremely intelligent and, in most cases, are almost indistinguishable from human operators.

In the next blog in the series, we’ll be looking at how to build a simple AI-based Chatbot using Google’s DialogFlow Conversational AI Platform.

June 10, 2020 07:58 PM UTC


PSF GSoC students blogs

GSoC week 1

This week I've been largely focused on blog posting bugs, notably #394 and #396. These two bugs are not actually part of the python-blogs codebase, but the now-abandoned aldryn newsblog projected which we use as a dependency.

Since aldryn_newsblog is abandoned and its repository set permanently read-only, the first step was pull that module into the python-blogs tree. After spending a few hours unsuccessfully attempting to merge aldryn_newsblog's commit history into a python-blogs branch, then merge its branch with the master, I gave up and just copied aldryn_newsblog's final version as a subdirectory (abandoning its 5 years of commit history in the process). Someday, if anyone really cares and has more git experience than me, they can always run this merge "correctly".

$ pip uninstall aldryn_newsblog

Uninstalling aldryn-newsblog-2.2.1:
  Would remove:
    /usr/local/lib/python3.8/site-packages/aldryn_newsblog-2.2.1.dist-info/*
    /usr/local/lib/python3.8/site-packages/aldryn_newsblog/*
Proceed (y/n)? 

 

So long global module, hello local module, and - the website still works.

I'm now staring at the mess that is our tag cleaning. There appears to be at least three HTML cleaners used by the python-blogs codebase, but embarrassingly I found most of #394 can be resolved by adding tags to gsoc/settings.py and appears to have almost nothing to do with aldryn_newsblog.

June 10, 2020 05:47 PM UTC


Peter Bengtsson

./bin/huey-isnt-running.sh - A bash script to prevent lurking ghosts

tl;dr; Here's a useful bash script to avoid starting something when its already running as a ghost process.

June 10, 2020 02:56 PM UTC


Real Python

SettingWithCopyWarning in Pandas: Views vs Copies

NumPy and Pandas are very comprehensive, efficient, and flexible Python tools for data manipulation. An important concept for proficient users of these two libraries to understand is how data are referenced as shallow copies (views) and deep copies (or just copies). Pandas sometimes issues a SettingWithCopyWarning to warn the user of a potentially inappropriate use of views and copies.

In this article, you’ll learn:

  • What views and copies are in NumPy and Pandas
  • How to properly work with views and copies in NumPy and Pandas
  • Why the SettingWithCopyWarning happens in Pandas
  • How to avoid getting a SettingWithCopyWarning in Pandas

You’ll first see a short explanation of what the SettingWithCopyWarning is and how to avoid it. You might find this enough for your needs, but you can also dig a bit deeper into the details of NumPy and Pandas to learn more about copies and views.

Free Bonus: Click here to get access to a free NumPy Resources Guide that points you to the best tutorials, videos, and books for improving your NumPy skills.

Prerequisites

To follow the examples in this article, you’ll need Python 3.7 or 3.8, as well as the libraries NumPy and Pandas. This article is written for NumPy version 1.18.1 and Pandas version 1.0.3. You can install them with pip:

$ python -m pip install -U "numpy==1.18.*" "pandas==1.0.*"

If you prefer Anaconda or Miniconda distributions, you can use the conda package management system. To learn more about this approach, check out Setting Up Python for Machine Learning on Windows. For now, it’ll be enough to install NumPy and Pandas in your environment:

$ conda install numpy=1.18.* pandas=1.0.*

Now that you have NumPy and Pandas installed, you can import them and check their versions:

>>>
>>> import numpy as np
>>> import pandas as pd

>>> np.__version__
'1.18.1'
>>> pd.__version__
'1.0.3'

That’s it. You have all the prerequisites for this article. Your versions might vary slightly, but the information below will still apply.

Note: This article requires you to have some prior Pandas knowledge. You’ll also need some knowledge of NumPy for the later sections.

To refresh your NumPy skills, you can check out the following resources:

To remind yourself about Pandas, you can read the following:

Now you’re ready to start learning about views, copies, and the SettingWithCopyWarning!

Example of a SettingWithCopyWarning

If you work with Pandas, chances are that you’ve already seen a SettingWithCopyWarning in action. It can be annoying and sometimes hard to understand. However, it’s issued for a reason.

The first thing you should know about the SettingWithCopyWarning is that it’s not an error. It’s a warning. It warns you that you’ve probably done something that’s going to result in unwanted behavior in your code.

Let’s see an example. You’ll start by creating a Pandas DataFrame:

>>>
>>> data = {"x": 2**np.arange(5),
...         "y": 3**np.arange(5),
...         "z": np.array([45, 98, 24, 11, 64])}

>>> index = ["a", "b", "c", "d", "e"]

>>> df = pd.DataFrame(data=data, index=index)
>>> df
    x   y   z
a   1   1  45
b   2   3  98
c   4   9  24
d   8  27  11
e  16  81  64

This example creates a dictionary referenced by the variable data that contains:

  • The keys "x", "y", and "z", which will be the column labels of the DataFrame
  • Three NumPy arrays that hold the data of the DataFrame

You create the first two arrays with the routine numpy.arange() and the last one with numpy.array(). To learn more about arange(), check out NumPy arange(): How to Use np.arange().

The list attached to the variable index contains the strings "a", "b", "c", "d", and "e", which will be the row labels for the DataFrame.

Finally, you initialize the DataFrame df that contains the information from data and index. You can visualize it like this:

Read the full article at https://realpython.com/pandas-settingwithcopywarning/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

June 10, 2020 02:00 PM UTC


CubicWeb

Report of June 10th Cubicweb Meeting

Hi everyone,

We've just published the RC1 for CubicWeb https://pypi.org/project/cubicweb/3.28.0rc1/ and a new version 1.7.0 for logilab-common https://pypi.org/project/logilab-common/1.7.0/

Our current focus is finishing the last details for the release.

Milestone update

Current roadmap

Semver

One of our focus right now is to make stable releases of our core projects that won't break all the things ™ and we've made a lot of improvement in our testing suit to ensure that we test everything against our latest modifications before a release is made. Another problem we have right now is that CW only depends on a minimum version number for its dependencies, this mean that if we want to make a new release for one of the dependencies that will have some breaking code this introduce the risk of breaking all new CW installations.

To solve this situation we have decided to implement semantic versioning and only introduction breaking changes in major releases and in addition to only depends on one specific major release at the time in CW dependencies. This way, when we need to make a new release with breaking changes, this will be a major release and we won't break all new CW installations.

We have planned to start implementing this strategy starting CW version 4.0

Various updates

See you next week!

June 10, 2020 01:39 PM UTC


Codementor

Django User Model

a small introduction to django framework with more focus on the User Model class and Authentication

June 10, 2020 09:35 AM UTC

June 09, 2020


PyCoder’s Weekly

Issue #424 (June 9, 2020)

#424 – JUNE 9, 2020
View in Browser »

The PyCoder’s Weekly Logo


Web Scraping in Python: Tools, Techniques, and Legality

Do you want to get started with web scraping using Python? Are you concerned about the potential legal implications? What are the tools required and what are some of the best practices? This week Kimberly Fessel is a guest on the Real Python Podcast to discuss her excellent tutorial created for PyCon 2020 online titled “It’s Officially Legal so Let’s Scrape the Web.”
REAL PYTHON podcast

Pedantic Configuration Management with Pydantic

Dealing with multiple configuration files for a Python application can be stressful. Learn how to take the edge off with a custom workflow centered around the Pydantic library.
REDOWAN NAFI • Shared by Redowan Delowar

Find Performance Bottlenecks in Python Code

alt

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” - Donald Knuth Blackfire is built to let you find the 3%. Quick install, appealing and user-friendly UI. →
BLACKFIRE sponsor

How Async Should Have Been

Sometimes synchronous and asynchronous code can look very similar. The only difference might be the use of async and await keywords. In this opinion piece, Nikita Sobolev argues that the potential for repeated code is a design fault of Python’s asyncio framework, and describes a solution that allows synchronous Python to execute asynchronous code.
NIKITA SOBOLEV • Shared by Nikita Sobolev opinion

Our Python Monorepo

Opendoor, a residential real-estate startup, has quite a few Python services. These services were spread across several git repositories, but Opendoor’s engineering team recently moved them all into a single monorepo. Learn about the challenges the team faced with many repositories and how they set-up their monorepo to solve their problems.
DAN HIPSCHMAN

Fastest Way to Flatten a List in Python

Explore six different was to flatten a list of lists in Python and how their performance compares. The fastest of the six methods mentioned might surprise you!
CHRIS CONLAN

Nominees for 2020 Python Software Foundation Board Election

Ballots were sent out on June 8th.
PYTHON.ORG

Discussions

Classifying Values Based on Ranges That Contain Them

Avoid big if/else blocks by using the bisect module!
STACK OVERFLOW

Can You Change the Value of the Integer 1 in Python?

Spoiler alert: yes! But you probably shouldn’t do that.
STACK OVERFLOW

Python Jobs

Senior Python Engineer (Remote)

Gorgias

Software Engineer Python (Remote)

Netizen Corporation

Senior Backend Engineer Python/Django/PostgreSQL (Remote)

Cybercoders

More Python Jobs >>>

Articles & Tutorials

Python Wheels Crosses 90% Adoption

Wheels are the new standard of Python distribution and are intended to replace “eggs.” This site shows the top 360 most-downloaded packages on PyPI showing which have been uploaded as wheel archives. As of today, 90% of the top Python packages are now available as wheels on PyPI.
PYTHONWHEELS.COM

15 Amazing pytest Plugins

pytest plugins are an amazing way to supercharge your test suites, leveraging great solutions from people solving test problems all over the world. In this episode of the Test & Code Podcast, Michael Kennedy and Brian Okken discuss 15 favorite plugins that you should know about.
TESTANDCODE.COM podcast

Python 2 to 3 Migration: A Developer’s Experience

alt

Still haven’t migrated your Python 2 application to Python 3? This article provides guidelines, strategies and tips to make the process as easy as possible, along with a ready-to-use Python 2-to-3 runtime containing the most relevant packages. Check it out! →
ACTIVESTATE sponsor

Regular Expressions: Regexes in Python (Part 2)

In the previous tutorial in this series, you learned how to perform sophisticated pattern matching using regular expressions, or regexes, in Python. This tutorial explores more regex tools and techniques that are available in Python.
REAL PYTHON

Getting the Most Out of a Python Traceback

Learn how to read and understand the information you can get from a Python stack traceback. You’ll walk through several examples and see some of the most common tracebacks in Python.
REAL PYTHON video

Python Community Interview With Kattni Rembor

Kattni Rembor is a creative engineer at Adafruit Industries. In this interview, she talks about her work developing CircuitPython and the role mentorship has played in her career to date. She also shares her advice for anyone looking to start their first hardware project using CircuitPython.
REAL PYTHON

Combining Flask and Vue

Learn about three ways to combine Flask and Vue, the pros and cons of each, and some guidelines for when to use each method.
JACE MEDLIN • Shared by Jace Medlin

What is Python Redis? Enhance Python with Redis – The Fastest In-Memory Cloud Database

Install redis-py & Python Redis Client. Explore how Redis can enhance Python capabilities. Learn how to use Connection Pooling, SSL, Reading & Writing, & Opening a Connection with redis-py.
REDIS LABS sponsor

Django Stripe Tutorial

Learn how to configure a new Django website from scratch to accept one-time payments with Stripe Checkout.
MICHAEL HERMAN • Shared by Michael Herman

Why You Should Use More Enums in Python

Learn about Python’s Enum type and why you should consider using them in your own programs.
FLORIAN DAHLITZ • Shared by Florian Dahlitz

Python Debugging Tips

MARTIN HEINZ

Projects & Code

pydantic: Data Parsing and Validation Using Python Type Hints

GITHUB.COM/SAMUELCOLVIN

altair: Declarative Statistical Visualization Library for Python

GITHUB.COM/ALTAIR-VIZ

pyinfra: Automate Infrastructure Super Fast at Massive Scale

GITHUB.COM/FIZZADAR

falcon: The No-Nonsense, Minimalist Web Services and App Backend Framework

GITHUB.COM/FALCONRY

icl: An Interactive Memory Aid for One-Liners

GITHUB.COM/PLAINAS

falconify: Falcon Microservice Template for Quick Bootstrapping

GITLAB.COM/RAPHACOSTA


Happy Pythoning!
This was PyCoder’s Weekly Issue #424.
View in Browser »

alt

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

June 09, 2020 07:30 PM UTC


Philippe Normand

WebKitGTK and WPE now supporting videos in the img tag

Using videos in the <img> HTML tag can lead to more responsive web-page loads in most cases. Colin Bendell blogged about this topic, make sure to read his post on the cloudinary website. As it turns out, this feature has been supported for more than 2 years in Safari, but …

June 09, 2020 04:00 PM UTC


Real Python

Getting the Most Out of a Python Traceback

Python prints a traceback when an exception is raised in your code. The traceback output can be a bit overwhelming if you’re seeing it for the first time or you don’t know what it’s telling you. But the Python traceback has a wealth of information that can help you diagnose and fix the reason for the exception being raised in your code. Understanding what information a Python traceback provides is vital to becoming a better Python programmer.

By the end of this course, you’ll be able to:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

June 09, 2020 02:00 PM UTC


Anwesha Das

Difference between chcon and semanage

SELinux

Security Enhanced Linux, SELinux is the discretionary access control in the Linux distribution. This extra layer of security keeps the user's data safe in the system. SELinux context contains additional information, labels attached to each process, and files to determine the SELinux policy. The extra details about user, role, type, and sensitivity help to make access control decisions. The context of the file is generally similar to the context of its parent directory.

chcon

It is essential to alter the SELinux context to grant or deny access through SELinux. chcon, (change context) the command is used to change the SELinux context. The files and processes share the same SELinux context as their parent directory.

$ mkdir data
$ ls -Zd  /data

drwxr-xr-x. root root unconfined_u:object_r:default_t:s0 /data

We are creating a directory called /data. ls -Zd is the command to show the SELinux context of the directory.

$ sudo chcon -t httpd_sys_content_t /data
$ ls -Zd /data
drwxr-xr-x. root root unconfined_u:object_r:httpd_sys_content_t:s0 /data

With chcon -t we changed selinux context of /data to httpd_sys_content_t from its default context default_t .

restorecon

restorecon restores, alters the context of the files, process, and directories to its default SELinux context.

$ sudo restorecon -v /data
restorecon reset /data context unconfined_u:object_r:httpd_sys_content_t:s0->unconfined_u:object_r:default_t:s0

semanage

semanage is the policy management tool for SELinux. It modifies the port type for service different from its default usage. But it does not modify or recompile the policy sources. semanage can map the usernames to SELinux user identities and security context for objects like network ports, interfaces, and hosts. The default settings of SELinux only allow known services to bind to known ports. To modify a service for the usage of a non-default port, we use semanage.

$ sudo semanage fcontext -a -t httpd_sys_content_t '/data(./*)?'

This adds the SELinux policy for the data directory.

$ sudo semanage fcontext -l


/data(./*)?                                        all files      system_u:object_r:httpd_sys_content_t:s0

Then we can add the new context by using the restorecon command.

$ sudo restorecon -v /data

restorecon reset /data context unconfined_u:object_r:default_t:s0->unconfined_u:object_r:httpd_sys_content_t:s0

Difference between semanage and chcon

With both semanage and chcon commands, we can change the SELinux context of a file, process, or directory. But there is a significant difference between both. The changes made with chcon are temporary in nature whereas with semanage it is permanent. The context of the file altered with chcon goes back to default with the execution of the restorecon command. restorecon relabels the file system and restores the selinux context set by semanage. This makes changes made by semanage fcontext persistent. Therefore it is not advisable to use the chcon to change the SELinux context.

June 09, 2020 11:40 AM UTC

June 08, 2020


PSF GSoC students blogs

Weekly Blog Post | Gsoc'2020 | #2

Greetings, People of the world!

 

Coding period for Gsoc 2020 started last week, and it has been as incredible as I imagined.

 

1. What did you do this week?

I worked on improvising the UI for the project on adobe XD keeping in mind that all the required features perfectly fit in. With a lot of possible ways of how to accomplish the desired features, I planned out the best possible ways to achieve that keeping in mind all the potential Use cases for the project after receiving valuable insights from the mentors.

 

2. What is coming up next?

I will be working on adding basic logic for icon customisation in the Icons picker API that is currently being used. 

 

3. Did you get stuck anywhere?

Yes, I did, I was confused about where to place the components for multiple icon customisation features. And if I should use the icon picker for customisation of multiple icons since it was already built and will make it feasible to select desired icons before customisation. The placement of components was a bit confusing since a lot of components were required to be added in a small space. But eventually, it was done with the suggestions from the mentors.

 

June 08, 2020 10:48 PM UTC


Codementor

A Comprehensive Guide to Handling Exceptions in Python

The dos and don’ts of best-practice Python exception handling

June 08, 2020 09:53 PM UTC


PSF GSoC students blogs

Unexpected Things When You're Expecting

Hi everyone, I hope that you are all doing well and wishes you all good health! The last week has not been really kind to me with a decent amount of academic pressure (my school year is lasting until early Jully). It would be bold to say that I have spent 10 hours working on my GSoC project since the last check-in, let alone the 30 hours per week requirement. That being said, there were still some discoveries that I wish to share.

The multiprocessing[.dummy] wrapper

Most of the time I spent was to finalize the multi{processing,threading} wrapper for map function that submit tasks to the worker pool. To my surprise, it is rather difficult to write something that is not only portable but also easy to read and test.

By the latest commit, I realized the following:

  1. The multiprocessing module was not designed for the implementation details to be abstracted a way entirely. For example, the lazy maps could be really slow without specifying suitable chunk size (to cut the input iterable and distribute them to workers in the pool). By suitable, I mean only an order smaller than the input. This defeats half of the purpose of making it lazy: allowing the input to be evaluated lazily. Luckily, in the use case I'm aiming for, the length of the iterable argument is small and the laziness is only needed for the output (to pipeline download and installation).
  2. Mocking import for testing purposes can never be pretty. One reason is that we (Python users) have very little control over the calls of import statements and its lower-level implementation __import__. In order to properly patch this built-in function, unlike for others of the same group, we have to monkeypatch the name from builtins (or __builtins__ under Python 2) instead of the module that import stuff. Furthermore, because of the special namespacing, to avoid infinite recursion we need to alias the function to a different name for fallback.
  3. To add to the problem, multiprocessing lazily imports the fragile module during pools creation. Since the failure is platform-specific (the lack of sem_open), it was decided to check upon the import of the pip's module. Although the behavior is easier to reason in human language, testing it requires invalidating cached import and re-import the wrapper module.
  4. Last but not least, I now understand the pain of keeping Python 2 compatibility that many package maintainers still need to deal with everyday (although Python 2 has reached its end-of-life, pip, for example, will still support it for another year).

The change in direction

Since last week, my mentor Pradyun Gedam and I set up weekly real-time meeting (a fancy term for video/audio chat in the worldwide quarantine era) for the entire GSoC period. During the last session, we decided to put parallelization of download during resolution on hold, in favor of a more beneficial goal: partially download the wheels during dependency resolution.

Assuming I'll reach the goal eventually

As discussed by Danny McClanahan and the maintainers of pip, it is feasible to only download a few kB of a wheel to obtain enough metadata for the resolution of dependency. While this is only applicable to wheels (i.e. prebuilt packages), other packaging format only make up less than 20% of the downloads (at least on PyPI), and the figure is much less for the most popular packages. Therefore, this optimization alone could make the upcoming backtracking resolver's performance par with the legacy one.

During the last few years, there has been a lot of effort being poured into replacing pip's current resolver that is unable to resolve conflicts. While its correctness will be ensured by some of the most talented and hard-working developers in the Python packaging community, from the users' point of view, it would be better to have its performance not lagging behind the old one. Aside from the increase in CPU cycles for more rigorous resolution, more I/O, especially networking operations is expected to be performed. This is due to the lack of a standard and efficient way to acquire the metadata. Therefore, unlike most package managers we are familiar with, pip has to fetch (and possibly build) the packages solely for dependency informations.

Fortunately, PEP 427 recommends package builders to place the metadata at the end of the archive. This allows the resolver to only fetch the last few kB using HTTP range requests for the relevant information. Simply appending Range: bytes=-8000 to the request header in pip._internal.network.download makes the resolution process lightning fast. Of course this breaks the installation but I am confident that it is not difficult to implement this optimization cleanly.

One drawback of this optimization is the compatibility. Not every Python package warehouse support range requests, and it is not possible verify the partial wheel. While the first case is unavoidable, for the other, hashes checking is usually used for pinned/locked-version requirements, thus no backtracking is done during dependency resolution.

Either way, before installation, the packages selected by the resolver can be downloaded in parallel. This warranties a larger crowd of packages, compared to parallelization during resolution, where the number of downloads can be as low as one during trail of different versions of the same package.

Unfortunately, I have not been able to do much other than a minor clean up. I am looking forward to accomplishing more this week and seeing what this path will lead us too! At the moment, I am happy that I'm able to meet the blog deadline, at least in UTC!

June 08, 2020 08:56 PM UTC

Week 2 blog!

Hello everyone,

Its me again. Excited to share my progress this week with you all. There was a lot of coding involved this week. I coded three classes in order to generate a polymesh using the recast tools directly through the panda3d interface. First I coded for any .obj file to be loaded into a polymesh, but later changed it to load polymesh from any geometry stored in the NodePath. A lot of time was spent on debugging and stuffs. Properly converting Z-up coordinate system to Y-up coordinate system and vice versa, drawing the output polymesh using GeomNode, understanding how exactly GeomNode and various other Panda3D tools work, proper knowledge of recast library were very essential to get the code compile and run successfully. Here is a 3D model form Panda3D sample directory called roaming-ralph and corresponding output for polymesh (the walkable surface) is shown in red. 

             

Apart from the main coding part, building the recast library and my classes with panda3d is also important. This was one place I stuck for a long long time and everytime my mentors helped me out. So, though I might be troubling them for this part, but I got to learn a lot from them. So a big thanks to them for how they helped me this whole week

For future work, I have a lot of stuff lined up. I have drafted a PR. My mentors will ask me to do some necessary changes within a day or two. Apart from that, NavMesh generation and Detour integration are main challenges of the journey in the coming weeks. 

Thanks for reading the blog

Stay safe

June 08, 2020 07:26 PM UTC

Blog post for week 1: Introducing support for Redis

Scrapy uses queues for handling requests. The scheduler pushes requests to the queue and pops them from the queue when the next request is ready to be made. At the moment, there is no support for external message queues (e.g. Redis, Kafka, etc.) implemented in Scrapy, however, there are external libraries (https://github.com/rmax/scrapy-redis and others) that bridge Scrapy with external message queues.

The goal of the first week was to implement a new disk-based queue with Redis as a message queue backend. In the first iteration, which happened last week, Redis is used for storing and retrieving requests. Meta data (request fingerprints, etc.) is still saved on disk in a directory set by the JOBDIR folder. If the setting SCHEDULER_DISK_QUEUE is set to a class name, e.g. scrapy.squeues.PickleFifoRedisQueue, the Redis-based implementation is used as a queue backend.

Implementation

The classes PickleFifoRedisQueue and PickleLifoRedisQueueNonRequest are wrappers around the actual Redis queue classes _FifoRedisQueue, _LifoRedisQueue and _RedisQueue that handle connecting to Redis and issuing commands. The only difference between a FIFO and a LIFO queue is the position from which an element is popped after it has been pushed to the queue (left side for a LIFO or right side for a FIFO). Therefore the implementation for both queues is based on the common abstract base class _RedisQueue where most of the code is implemented (except for the pop() method which is abstract and implemented in _FifoRedisQueue and _LifoRedisQueue). The implementation uses the redis-py library (https://pypi.org/project/redis/) under the hood. Redis-py is the recommended library for Python by the Redis project.

Testing

Although I was planning to write tests a bit later this month, I had some time and already experimented with testing the Redis integration. Scrapy already comes with tests for generic memory and disk-based queues. An additional requirement in case of a Redis queue is that the tests require redis-server to be running. In case of a CI like Travis CI, this can be achieved by enabling redis-server in the CI‘s configuration file. However, tests should also be able to run outside of the CI, of course, with little manual intervention. Therefore the usual approach in the Scrapy testing code base is to start and stop a process if it is needed for the code under test. Further, tests should not be executed if redis-server is not available. Fortunately pytest supports skipping tests based on a condition. I added a function that checks for the condition and decorators to the appropriate tests so that they are skipped if redis-server is not available.

Outlook

This week I am working on saving even more data in Redis and getting rid of storing meta information about the crawl job on the file system. The idea is to use Redis not only as a queue for requests but also as a store for meta information that is needed to be persistent between crawls.

June 08, 2020 06:02 PM UTC